Application of CNN and Vision Transformer Models for Classifying Crowns in Pine Plantations Affected by Diplodia Shoot Blight

Wang, Mingzhu; Stone, Christine; Carnegie, Angus J.

doi:10.3390/f17010108

Open AccessArticle

Application of CNN and Vision Transformer Models for Classifying Crowns in Pine Plantations Affected by Diplodia Shoot Blight

by

Mingzhu Wang

^*

,

Christine Stone

and

Angus J. Carnegie

Forest Science, New South Wales Department of Primary Industries and Regional Development, Parramatta, NSW 2150, Australia

^*

Author to whom correspondence should be addressed.

Forests 2026, 17(1), 108; https://doi.org/10.3390/f17010108

Submission received: 11 December 2025 / Revised: 7 January 2026 / Accepted: 8 January 2026 / Published: 13 January 2026

(This article belongs to the Section Forest Health)

Download

Browse Figures

Versions Notes

Abstract

Diplodia shoot blight is an opportunistic fungal pathogen infesting many conifer species and it has a global distribution. Depending on the duration and severity of the disease, affected needles appear yellow (chlorotic) for a brief period before becoming red or brown in colour. These symptoms can occur on individual branches or over the entire crown. Aerial sketch-mapping or the manual interpretation of aerial photography for tree health surveys are labour-intensive and subjective. Recently, however, the application of deep learning (DL) techniques to detect and classify tree crowns in high-spatial-resolution imagery has gained significant attention. This study evaluated two complementary DL approaches for the detection and classification of Pinus radiata trees infected with diplodia shoot blight across five geographically dispersed sites with varying topographies over two acquisition years: (1) object detection using YOLOv12 combined with Segment Anything Model (SAM) and (2) pixel-level semantic segmentation using U-Net, SegFormer, and EVitNet. The three damage classes for the object detection approach were ‘yellow’, ‘red-brown’ (both whole-crown discolouration) and ‘dead tops’ (partially discoloured crowns), while for the semantic segmentation the three classes were yellow, red-brown, and background. The YOLOv12m model achieved an overall mAP50 score of 0.766 and mAP50–95 of 0.447 across all three classes, with red-brown crowns demonstrating the highest detection accuracy (mAP50: 0.918, F1 score: 0.851). For semantic segmentation models, SegFormer showed the strongest performance (IoU of 0.662 for red-brown and 0.542 for yellow) but at the cost of longest training time, while EVitNet offered the most cost-effective solution achieving comparable accuracy to SegFormer but with a superior training efficiency with its lighter architecture. The accurate identification and symptom classification of crown damage symptoms support the calibration and validation of satellite-based monitoring systems and assist in the prioritisation of ground-based diagnosis or management interventions.

Keywords:

tree health; convolutional neural networks; vision transformer; aerial imagery; Pinus radiata; Diplodia sapinea

1. Introduction

The frequency and severity of weather events such as drought and hailstorms are increasing under the effects of climate change [1]. These abiotic stresses increase the susceptibility of trees to attack from biotic agents such as insect pests and fungal pathogens [2,3,4]. Globally, drought-related tree mortality in softwood plantations is becoming a significant management issue [5]. In Australia, two biotic agents that magnify the impacts of drought in Pinus plantations are the five-spined bark beetle (Ips grandicollis (Eichhoff)) and diplodia shoot blight (Diplodia sapinea (Fries) Fuckel). Australian pine plantations consist of numerous management units (MUs), each containing same age class compartments, with the MUs ranging in age from just planted to scheduled for harvesting and can present crown damage symptoms that can vary spatially over time, including when infested with D. sapinea. Diplodia sapinea is an opportunistic pathogen infecting many conifer species and has a global distribution [6,7]. Trees of all growth stages are susceptible to this disease with symptoms presented as dead or dying shoots, branches or crown tops. Severe infection can kill the tree. The infected needles may, for a brief period, appear yellow (chlorotic) before becoming red or brown (necrotic) in colour. Over time, the necrotic needles then fall, leaving defoliated shoots and branches. These crown-scale symptoms are also common to several other damaging agents in Australia, including Ips grandicollis, and so require diagnosis by a forest health expert.

Conventional forest health monitoring often involves aerial and ground-based manual surveys. Forest health aerial sketch-mapping surveys remain standard practice in North America [8] and for pine plantations in Australia [9] and New Zealand [10] to identify the extent and severity of damaging agents and processes [11,12]. In addition to mapping the extent and severity of damaged tree crowns, an experienced observer can often diagnosis the damaging agent from the aircraft. However, this methodology is subjective, can be imprecise and reliant on the experience of the aerial observer. A rapidly expanding solution to improving on this approach is the application of semantic segmentation and classification techniques applied to high-spatial-resolution, remotely sensed imagery acquired from UAVs, aircraft and high-resolution satellite systems [13,14,15,16,17,18,19].

Recent advances in digital image processing and deep learning (DL) are now being applied to aerial imagery by commercial forest service providers to automate tree inventories and health surveys [20,21,22]. However, their workflows are commercial in confidence and so not freely available to educational institutions, government agencies, or the forest industry more broadly.

AI-based image analysis was initially dominated by traditional machine learning (ML) methods such as random forest (RF) and support vector machine (SVM) but over the past decade has principally been replaced by DL architectures [13,15]. DL techniques such as object detection, pixel-based semantic segmentation and instance segmentation techniques can be applied for tree health surveys [15,23,24]. Object detection techniques identify the presence and location of tree crowns and draw bounding boxes around them, whereas pixel-based semantic segmentation classify every pixel in an image into a predefined category, producing a pixel-wise mask for each classified crown. These approaches present different trade-offs in computational cost, spatial detail, and suitability for specific monitoring tasks, with object detection typically being more efficient for precise tree counting and individual level assessment, while semantic segmentation offering finer spatial resolution of within-crown symptom patterns and coverage of affected areas [25]. Instance segmentation, another DL approach that combines object detection with pixel-level masks, also can be used for tree surveys [24,26] and represents a promising direction for future investigation. While previous studies have demonstrated the potential of deep learning for the detection of unhealthy tree crowns, they have typically focused on either object detection approaches or semantic segmentation methods. In our study, we compared both object detection and semantic segmentation approaches on the same dataset for diplodia shoot blight detection. This comparison is critical for determining which approach provides the optimal balance of accuracy, computational efficiency, and practical applicability for operational forest health monitoring. Our study addresses this gap by evaluating state-of-the-art models from both paradigms.

The most common DL architectures applied for automated tree detection have been based on convolutional neural networks (CNNs). CNNs require training using a large number of labelled samples that for forest health applications are usually created by manual annotation of tree crowns. Importantly the training samples must cover sufficient variation to avoid model underfitting or overfitting. A common technique to reduce inaccuracies from this source is to use pre-trained models [13,23]. In addition, DL object detection models can contain one or two stages and often leverage CNNs. Two-stage detectors comprise a region proposal module and an object detection/classification module [13]. Initially, two-stage CNNs such as Faster R-CNN and Mask R-CNN, were shown to outperform traditional ML approaches applied for tree crown detection and classification [13,27]. One-stage detectors integrate the tasks of object classification and localisation of the bounding box or mask into a global problem and produces detections in one-stage [13,28]. The application of YOLO (You Only Look Once) frameworks has been successfully demonstrated for object detection and instance segmentation of tree health in high-spatial-resolution imagery [13,29,30]. The YOLO architecture family evolves rapidly, with frequent releases of improved versions. YOLOv12, the most recent stable version at the time of our study, was selected as the representative instance segmentation model [31].

Importantly, CNNs are re-trainable, being able to incorporate multiple, unique dataset characteristics and hence increasing their robustness for plantation compartments of variable age classes; however, their performance can be challenged in complex scenes having variable image conditions and complex background features [14]. Far fewer studies have evaluated the performance of vision transformers for the segmentation and classification of tree crowns [14] as well as a hybrid approach which combines the strengths of CNN-based and transformer-based architectures [32,33]. Vision transformers undertake computer vision tasks by dividing an image into patches and processing them with a self-attention mechanism to model local and global relationships across image patches. Vision transformers typically require larger amounts of training data while CNNs can perform well with relatively smaller datasets.

A large portion of the DL studies for forest health have been classifying damaged tree crowns as a single category [14,19,32,34,35,36]. Some studies have incorporated multiple damage classes, such as distinguishing between infested and dead trees [37,38,39], but these typically represent distinct and visually obvious damage stages. For certain damaging agents including diplodia shoot blight, the affected tree crowns can present a progression of coloured crown symptoms, which also indicates the duration or severity of the infestation. In contrast to binary or distinct class schemes, our study focused on more transitional symptom classes that represent gradual disease progression. The progression of tree crown symptoms from green to yellow to red also occurs with other important damaging agents of pine trees, for example, numerous bark beetle species [40] and sirex infestation [28]. In addition, in these Australian plantations, unhealthy trees affected by the aphid Essigella californica, cyclaneusma needle cast or magnesium deficiency present crown symptoms that are mostly yellow in colour. Identifying and quantify the proportion of partially affected crowns can also indicate the severity or duration of a damaging event. Therefore, having separate classes of yellow and red-brown colours or partially damaged crowns can assist in targeted on-ground diagnosis and for forest managers to make informed decisions regarding treatment priorities. For example, stands with a high proportion of dead trees may warrant salvage to recover timber before trees deteriorate, while stands with predominantly dead-tops may warrant thinning to alleviate water stress and halt or slow the progression of the disease.

However, distinguishing these symptom classes presents significant challenges for both manual annotation and automated prediction. The colour gradient from green to yellow to red is continuous rather than discrete, making the definition of class boundaries inherently subjective and dependent on annotator expertise and judgement. This is further complicated by natural variation in needle colour due to stand age, seasonal phenology, site conditions, and silvicultural status. Furthermore, the varying lighting conditions in aerial imagery captured in different times of year and times of day increased the difficulty of creating consistent annotations for each class across different study sites.

To address these challenges and evaluate the feasibility of automated multi-class symptom detection, in this study, we compared the accuracy and computational efficiency of several representative DL approaches across different architectural paradigms for detecting and classifying discoloured Pinus radiata crowns affected by diplodia shoot blight across five plantations in New South Wales (NSW), South Australia (SA) and Victoria (VIC). We evaluated two complementary approaches: (1) object detection using YOLOv12 [31] combined with the Segment Anything Model (SAM) [41] for identifying and classifying individual tree crowns into three severity classes (yellow, red-brown, and dead tops), and (2) pixel-level semantic segmentation using three architectures—U-Net (CNN baseline with ResNet-34 encoder) [42], SegFormer (vision transformer) [43], and EVitNet (CNN-transformer hybrid) [32]—for mapping yellow and red-brown discoloured pixels. These two approaches provide different levels of information for plantation management decisions. While object detection provides location and approximate crown size, it has limited capacity to quantify actual affected crown area, particularly for areas with a large portion of partially discoloured crowns. Semantic segmentation captures subtle pixel-level changes at the crown level that are difficult to detect with object detection alone, especially for early-stage infections where less than 50% of the crown is affected. Furthermore, integrating both methods enables accurate measurement of the proportion of affected area within each crown, providing the detailed spatial information needed for targeted management decisions such as selective thinning or salvage cutting.

Our overall aim was to identify the advantages and disadvantages of these DL models and to provide recommendations for applying DL to high-spatial-resolution imagery for assessing crowns affected by diplodia shoot blight and other damaging agents that present similar crown damage symptoms in Pinus spp. plantations.

2. Materials and Methods

The methodology consisted of seven main stages (Figure 1): data acquisition and preprocessing, manual annotation, data splitting, model training (object detection and semantic segmentation approaches), performance evaluation on the test set, comparative analysis, and integrated assessment.

2.1. Study Sites and Image Acquisition

The study was conducted across five P. radiata plantation sites with varying topographies located in three Australian states: New South Wales (NSW), South Australia (SA), and Victoria (VIC). The NSW sites included Kangaroo Vale and Carabost, while the SA and VIC sites—Langkoop, Myora, and Dartmoor—are situated within the Green Triangle (GT) region (Figure 2). The GT is a major plantation forestry and wood products region spanning the border area between the southeast of SA and southwest of VIC. These timber plantations are subject to multiple abiotic and biotic disturbances. At the time of image acquisition, these sites were all affected by drought and D. sapinea to varying extents. Depending on the duration and severity of the drought and stand conditions, tree crowns present colour symptoms ranging from pale green or yellow to orange, red or brown. Dead tops, individual branches or entire crowns can become necrotic. For our deep learning model development, crown classification was based on predominant visible colour symptoms rather than tree physiological status (details in Section 2.2.1).

High-resolution aerial imagery was acquired by Xmap [44] between 2023 and 2024 under clear sky conditions. The Kangaroo Vale and Langkoop sites were acquired in September 2023, while the Carabost, Myora, and Dartmoor sites were acquired in July 2024. The aerial platform used for 2023 imagery was a Cessna 172 (Textron Aviation, Wichita, KS, USA) fitted with a nadir camera hatch and external camera pod for dual camera use. The dual camera setup comprised a Nikon D850 (Nikon Corporation, Tokyo, Japan) (Bayer filter) and a modified Nikon D850 (hot filter removed) with a HOYA 52 mm infrared (R72) external filter (Hoya Corporation, Tokyo, Japan) fitted to the lens. The lenses used for Kangaroo Vale were Nikkor 85 mm prime lenses (Nikon Corporation, Tokyo, Japan), while for Langkoop were Nikkor 50 mm prime lenses (Nikon Corporation, Tokyo, Japan). The camera photos were captured with 80% forward overlap and 70% side overlap. The Kangaroo Vale and Langkoop sites were flown at 7800 ft and 4600 ft, respectively. Aerial imagery for sites in 2024 was acquired using a Cessna 172 equipped with a wheel strut camera pod. The sensor was a Fujifilm GFX 100S camera (Fujifilm Corporation, Tokyo, Japan) with 50 mm Fujinon GF lens (Fujifilm Corporation, Tokyo, Japan) mounted in nadir orientation. Images were captured at 3900 ft above ground level. These acquisition specifications resulted in delivered imagery having a ground sample distance (GSD) of 0.09 m with positional accuracy of approximately 0.27 m (±3 pixels) for all study sites between 2023 and 2024. Digital surface models (DSMs) were generated from stereo photogrammetry and used to orthorectify the raw images. Final orthomosaics were then produced using dodging (local brightness adjustment) [45] and colour balancing techniques to ensure radiometric consistency across the imagery. The 2023 imagery included four bands (red, green, blue, and near-infrared), while the 2024 imagery contained only three bands (red, green, and blue). For consistency across all sites, only the three visible bands were used in this study due to the absence of NIR data for the 2024 acquisitions. It is acknowledged that while NIR data are advantageous for detecting unhealthy vegetation, in Australia the more commonly available RGB cameras allow for more rapid and flexible operational deployment.

2.2. Data Preparation

2.2.1. Class Definitions

Based on visual interpretation of the aerial imagery, affected tree crowns were classified according to their dominant colour symptoms. Two classification schemes were developed to accommodate the different requirements of object detection and semantic segmentation approaches. For individual tree crown detection and classification (tree-level analysis), three classes were defined (Figure 3a):

Yellow: Tree crowns showing predominantly yellow-coloured needles.
Red-brown: Tree crowns displaying predominantly orange, red, and/or brown coloured needles.
Dead tops: Tree crowns with damage where more than 50% of crown pixels but less than 90% exhibited yellow and/or red-brown discolouration, indicating dead branches or shoots, with the remaining crown retaining green needles.

For semantic segmentation (pixel-level analysis), another three-class scheme was used (Figure 3b):

Yellow: Crown pixels showing yellow colours.
Red-brown: Crown pixels displaying orange, red, or brown colours.
Background: All other pixels including green crowns, shadows, and ground.

The different classification schemes reflect the distinct analytical capabilities of each approach. Object detection operates at the tree crown level, enabling assessment of within-crown colour variation and the identification of trees with mixed symptom patterns (e.g., dead tops class). In contrast, semantic segmentation focuses on spectral classification of individual pixels without tree-level aggregation.

2.2.2. Image Annotations and Preprocessing

Due to the different input requirements of object detection and semantic segmentation models, separate annotation datasets were produced for each approach. Annotations were produced by a forest health expert using ArcGIS Pro 3.5.2 (Esri, Redlands, CA, USA) and subsequently verified by an independent GIS technician to ensure labelling accuracy and consistency. For object detection, bounding boxes were drawn to encompass the entire crown extent of each affected tree. The bounding box annotations were converted to text files in YOLO format, containing the class label and normalised coordinates (centre x, centre y, width, height) for each detection instance.

For semantic segmentation, a modified image dataset was also created to enable pixel-level classification while managing annotation effort. First, bounding boxes were drawn around all trees containing any yellow or red-brown pixels, excluding those already classified as fully yellow or red-brown trees in the object detection dataset. The Segment Anything Model (SAM) was then applied within these bounding boxes to generate precise crown segments. These crown segments were subsequently masked out from the original imagery to create a modified dataset containing only fully symptomatic yellow and red-brown trees and background. This approach eliminated the need to annotate all individual discoloured pixels while avoiding potential confusion from crowns with mixed symptoms that could introduce ambiguous training signals. Within the remaining areas in the modified images, detailed polygons were manually delineated by the forest health expert following the boundaries of continuous yellow and red-brown crowns. These polygon annotations were rasterised into three-class mask images (yellow, red-brown, and background) matching the spatial resolution of the input imagery. Figure 3 shows examples of the annotation strategies used for both object detection and semantic segmentation.

In total, 10,893 individual tree crowns were annotated across all five study sites for the object detection task, comprising 1751 yellow trees, 5747 red-brown trees, and 3395 dead top trees. For semantic segmentation, 7498 yellow and red-brown tree crowns were annotated with detailed polygons after masking.

To prepare the imagery for model training, a two-stage tiling approach was implemented. First, each orthomosaic was divided into large, non-overlapping tiles of 3040 × 3040 pixels (Figure 4a). These large tiles were then randomly partitioned into training, validation, and test sets at a 7:2:1 ratio. This approach prevented data leakage by ensuring that overlapping subtiles generated in the subsequent step would not span across different dataset splits. Within each large tile, smaller subtiles of 640 × 640 pixels were generated using a sliding window approach with a stride of 480 pixels, creating 25% overlap between adjacent subtiles (Figure 4b). This overlapping strategy ensured that tree crowns near tile boundaries were fully captured in at least one subtile while providing additional training samples to improve model robustness [46]. Tiles and subtiles containing no annotated tree crowns were excluded from the dataset. Following this procedure, the final dataset comprised 2821 training, 746 validation, and 426 test images, all measuring 640 × 640 pixels. We used RGB as three input bands for our DL models as they were all pretrained on natural colour imagery, and the RGB combination has been demonstrated to be effective for detecting unhealthy tree crowns [14,39]. This RGB-only approach establishes baseline performance for future comparison with additional spectral bands or derived indices.

2.3. Individual Tree Crown Detection and Classification

A pipeline combining object detection and instance segmentation was investigated in this study to provide an efficient approach for extracting and classifying individual tree crowns.

You Only Look Once (YOLO) is one of the most popular object detection frameworks [47]. The state-of-the-art YOLOv12 model was utilised in this study. Compared to previous YOLO models using traditional CNN-based approaches, YOLOv12 combines convolutional feature extraction with an attention-centric architecture and an improved feature aggregation module based on residual efficient layer aggregation networks (R-ELAN) [31]. It blends CNN and transformer-style components, achieving outstanding speed and accuracy.

We employed the YOLOv12m model, pretrained on the Microsoft COCO (Common Objects in Context) dataset [48], which was implemented using the Ultralytics framework. The training process consisted of 150 epochs with a batch size of 16 and an initial learning rate of 0.001, following a cosine learning rate schedule with a final learning rate fraction of 0.01. The AdamW optimiser was used with a momentum of 0.937 and weight decay of 0.0005. Five warmup epochs were employed to gradually adapt the pretrained model to the target dataset. A comprehensive set of augmentation strategies was applied, including HSV colour augmentation, geometric augmentations (rotation, translation, scaling, and flipping), and advanced augmentations (mosaic, mixup, and copy-paste) [49].

To obtain precise crown boundaries beyond the rectangular bounding boxes provided by YOLO, SAM was employed for instance segmentation. SAM, developed by Meta AI, is a cutting-edge image segmentation model that can produce high-quality object masks from input prompts such as points or boxes [50]. The bounding boxes predicted by the YOLO model were directly used as prompts for SAM with minimal preprocessing. SAM 2.1-hiera-large model [41] was utilised in this study, leveraging its improved accuracy and efficiency for generating precise tree crown masks from the YOLO-detected bounding boxes. To remove artefacts from the SAM output, post-filtering was applied to ensure mask quality. First, any crown segment not intersecting with its corresponding bounding box was removed as such masks likely represented segmentation errors. Second, any crown segment with an area exceeding 1.5 times the area of its bounding box was removed as such masks likely represented segmentation errors extending beyond the target tree. Finally, when multiple crown segments were generated for a single bounding box, only the segment with the largest area was retained, assuming it represented the primary tree crown. This three-stage filtering process ensured reliable extraction of individual tree crown boundaries.

2.4. Semantic Segmentation of Affected Tree Crown Classes

To evaluate and compare different deep learning architectures for mapping tree crown symptoms, three representative models were investigated: a CNN-based U-Net as a baseline, a lightweight CNN-transformer hybrid model—Easy Vision Transformer Net (EVitNet), and a vision transformer-based model—SegFormer.

U-Net is a symmetric CNN architecture composed of an encoder–decoder structure with skip connections to recover spatial details [42]. It was originally developed for biomedical image segmentation but has been widely adopted across various domains. The encoder repeatedly applies convolutions and pooling operations to capture increasingly abstract features at progressively lower resolutions, while the decoder upsamples features back to the original resolution. Skip connections pass feature maps from each encoder stage directly to its corresponding decoder stage, preserving spatial details lost through downsampling. In this study, U-Net was initially evaluated with both ResNet-34 and ResNet-50 encoders pretrained on ImageNet. ImageNet is a large-scale dataset containing over 14 million images across 1000 object categories [51]. While not domain-specific to forestry, ImageNet pre-training provides models with general visual feature extraction capabilities (edges, textures, colour patterns) that transfer effectively to specialised tasks, improving performance and training efficiency. (a large-scale image dataset). Between the two U-Net variants evaluated, ResNet-34 achieved comparable segmentation accuracy to ResNet-50 while requiring substantially shorter training time (6.6 h vs. 16.0 h), and was therefore selected as the baseline CNN model for comparison. EVitNet is a lightweight CNN-transformer combined architecture initially designed by [32] for detecting pine wilt disease in drone imagery. It combines a MobileViT-based encoder that alternates between CNN blocks (for local feature extraction) and lightweight vision transformer blocks (for global context modelling) with a U-Net-style decoder that uses expanded convolutions to improve upsampling accuracy without adding parameters. The model’s hybrid architecture preserves spatial detail through skip connections while capturing global features via self-attention. In this study, we adapted the original EVitNet by replacing the custom MobileViT blocks with Apple’s official ImageNet pretrained MobileViT-XXS model as the encoder backbone.

SegFormer is an efficient transformer-based semantic segmentation architecture that employs a hierarchical transformer encoder (Mix Transformer, MiT) to capture multi-scale features through self-attention mechanisms, combined with a lightweight all-MLP (Multilayer Perceptron) decoder [43]. This design achieves strong segmentation performance while addressing the computational challenges of traditional heavy encoders and complex decoders. In this study, we employed the MiT-B0 encoder pretrained on the ImageNet dataset as the backbone.

All three segmentation models were pretrained on ImageNet and fine-tuned on the study dataset using identical training configurations to ensure fair comparison. Training was conducted for 150 epochs with 15 warmup epochs. The AdamW optimiser [52] was employed with an initial learning rate of 1 × 10⁻⁴ and weight decay of 0.01, using a linear warmup followed by cosine annealing learning rate schedule with a minimum learning rate of 1 × 10⁻⁶. To address the severe class imbalance inherent in the dataset (where background pixels vastly outnumber symptomatic pixels and imbalance exists between disease severity classes), we employed generalised dice loss. This loss function automatically assigns higher weights to underrepresented classes during training and has been proven effective for imbalanced segmentation tasks [53]. A comprehensive data augmentation strategy was implemented using the Albumentations library [54] to improve model robustness and generalisation. Geometric augmentations included random horizontal flipping, vertical flipping, rotation, and affine transformations with scaling, and translation. Colour space augmentations comprised random brightness and contrast adjustments, hue-saturation-value shifts, and channel shuffling. To simulate real-world imaging conditions and improve robustness, we applied Gaussian noise, random fog effects, and coarse dropout. Finally, images were normalised using dataset-specific mean and standard deviation values before conversion to tensors.

2.5. Integration of Object Detection and Semantic Segmentation Outputs

To leverage the complementary strengths of both approaches, we developed an integrated assessment workflow that combines outputs from object detection (tree-level) and semantic segmentation (pixel-level). This integration enables within-crown damage quantification by overlaying pixel-level disease classifications onto individual tree crown boundaries identified through the object detection approach (YOLO + SAM). Specifically, for each detected tree crown, we calculate the proportion of pixels classified as yellow or red-brown, providing precise metrics of disease severity and spatial distribution within individual crowns. This dual-level analysis supports more nuanced management decisions compared to either approach alone, enabling forest managers to distinguish between trees with minor branch symptoms versus those with extensive crown damage, even when both fall within the same nominal class.

2.6. Implementation Details

All experiments were conducted on a Windows Subsystem for Linux 2 (WSL2) environment running on a desktop workstation equipped with dual Intel Xeon Gold 6136 CPUs (3.00 GHz, 24 cores total) and 256 GB RAM. Model training and inference were performed on an NVIDIA Quadro P6000 GPU with 24 GB memory. The deep learning framework employed was PyTorch 2.6.0 with CUDA 12.6 and Python 3.11.11. All speed measurements were conducted on an NVIDIA Quadro P6000 GPU with batch size 1, input size 3 × 640 × 640, using PyTorch 2.6.0 and CUDA 12.6. Inference times represent the mean of 100 iterations after 10 warmup iterations.

Table 1 summarises the model specifications and computational requirements of all deep learning models employed in this study. GFLOPs (giga floating-point operations) quantify the computational complexity of a single forward pass through the model, serving as a hardware-independent measure of model efficiency. FPS (frames per second) measures the inference speed, representing the number of images the model can process per second under standardised conditions. All speed measurements were conducted on the NVIDIA Quadro P6000 GPU with batch size 1, using PyTorch 2.6.0 and CUDA 12.6. Inference times represent the mean of 100 iterations after 10 warmup iterations to ensure stable measurements. SAM 2.1, used for instance segmentation in the detection pipeline, is not included in Table 1 as it was applied in a prompt-based manner using pretrained weights without additional training.

2.7. Accuracy Assessment and Evaluation Metrics

The results predicted from the object detection model were evaluated using several standard metrics, including precision, recall, F1 score and mAP.

These metrics rely on the calculation of intersection over union (IoU), a fundamental measure that quantifies the overlap between predicted and ground truth bounding boxes. IoU is defined as

I o U = \frac{A r e a o f I n t e r s e c t i o n (P r e d i c t e d \cap G r o u n d T r u t h)}{A r e a o f U n i o n (P r e d i c t e d \cup G r o u n d T r u t h)}

With IoU values ranging from 0 to 1, where 0 signifies no overlap and 1 denotes a perfect match, IoU serves as a crucial threshold to determine the correctness of a detection. For a given IoU threshold (α), true positives are detections where objects are correctly labelled and IoU values between the predicted and ground truth bounding boxes exceed the threshold. False positives (FP) occur when objects are incorrectly labelled or IoU values fall below the threshold. False negatives (FN) represent missed detections of objects present in the ground truth. True positives (TP) are correctly detected objects and IoU values above the threshold. Based on these concepts, precision is defined as:

P r e c i s i o n = \frac{T P}{T P + F P}

Precision represents the percentage of correctly detected trees among all predicted trees. Similarly, recall is defined as:

R e c a l l = \frac{T P}{T P + F N}

Recall represents the percentage of correctly detected trees among all ground truth trees.

The F1 score, serving as the harmonic mean of precision and recall, provides a balanced measure of the model’s performance, considering both false positives and false negatives. It is expressed as:

F 1 s c o r e = \frac{2 \times (P r e c i s i o n \times R e c a l l)}{P r e c i s i o n + R e c a l l}

Precision-recall values at different confidence thresholds are calculated to form a precision-recall curve. The average precision (AP) is computed as the area under this curve (AUC), representing the trade-off between precision and recall in object detection at a given IoU threshold. Commonly, AP is calculated at an IoU threshold of 0.5 (AP50 or mAP@0.5) as a standard benchmark. Additionally, AP50–95 (or mAP@[0.5:0.95]) represents the mean AP calculated across IoU thresholds ranging from 0.5 to 0.95 in 0.05 increments, providing a more comprehensive evaluation of localization accuracy. To obtain the Mean Average Precision (mAP), the AP values for each individual class are calculated, and the final mAP is derived by averaging these class-specific AP values over the total number of classes

m A P = \frac{1}{n} \sum_{i = 1}^{n} {A P}_{i}

where AP_i is the AP of class i and n is the number of target classes. mAP serves as a comprehensive metric, providing an overall evaluation of the model’s effectiveness across diverse object categories. In this study, both mAP50 and mAP50–95 were computed to evaluate detection performance at different localization precision requirements.

Semantic segmentation performance was evaluated using the same fundamental metrics (IoU, precision, recall, and F1 score) as object detection, but computed at the pixel level rather than bounding box level (Table 2).

3. Results

3.1. Tree-Level Detection and Classification Accuracy

The YOLOv12m model achieved an overall mAP50 of 0.766 and mAP50–95 of 0.447 across all three classes (Table 3). Class-specific performance revealed a clear hierarchy, with red-brown crowns demonstrating the highest detection accuracy (mAP50: 0.918, F1 score: 0.851), followed by yellow crowns (mAP50: 0.789, F1 score: 0.729), while dead tops exhibited the most challenging detection with the lowest performance across all metrics (mAP50: 0.591, F1 score: 0.555).

This performance hierarchy can be attributed to multiple factors. Red-brown crowns benefited from both substantially more training samples compared to yellow crowns and stronger spectral contrast. The red and brown hues provide distinctive colour separation against the predominantly green background of healthy crowns. In contrast, yellow crowns’ visual appearance is more similar to green healthy crowns, with subtle spectral transitions resulting in greater classification ambiguity. Although yellow crown detection accuracy was lower, the moderate performance achieved (mAP50: 0.789) remains operationally valuable for early detection and intervention strategies.

Dead tops presented the greatest detection challenges due to the heterogeneous nature of partially affected crowns, which vary in the proportion of discoloured crown area (>50% threshold), the spatial distribution of necrotic needles (e.g., upper crown versus scattered branches), and the possible mixture of yellow and red-brown symptoms within individual crowns. Additionally, the 50% damage threshold meant that partially damaged trees below that threshold with similar heterogeneous patterns were excluded, potentially creating an inconsistent training signal.

The precision-recall curves (Figure 5) illustrate the performance trade-offs across different confidence thresholds for each class. Red-brown crowns maintained high precision (>0.85) across a wide range of recall values, while dead tops exhibited lower precision at equivalent recall levels, reflecting the greater classification uncertainty for this heterogeneous class.

Analysis of the confusion matrix (Figure 6) provided further insights into model classification patterns. Red-brown crowns achieved the highest correct detection rate at 87% (622 out of 711), followed by yellow crowns at 73% (316 out of 431) and dead tops at 62% (161 out of 261). While inter-class confusion among affected crown categories was relatively low, the primary source of error across all classes was misclassification as background, with dead tops being particularly susceptible (124 out of 261 trees, 47.5%), followed by yellow crowns (88 out of 431, 20.4%) and red-brown crowns (114 out of 711, 16.0%). This suggests that trees with extensive but patchy damage may lack sufficient contiguous symptomatic area for reliable detection. Despite these classification challenges, dead top detection remains useful for identifying trees in transitional disease stages. Future work could focus on acquiring additional training data with more consistent annotation criteria for this class to improve detection accuracy.

Figure 7 illustrates representative detection and segmentation outputs from the YOLO + SAM pipeline, comparing manual annotations with model predictions. As shown in this example, the model successfully detected affected tree crowns and classified them by symptom class, with predictions closely aligning with ground truth for the majority of instances. Bounding boxes indicate the spatial extent for each detection. The subsequent application of SAM refined these detections into precise instance-level crown segmentation masks, delineating individual crown boundaries beyond the rectangular bounding boxes. These refined crown masks enabled accurate extraction of affected crown areas while preserving individual tree identity, demonstrating the complementary strengths of object detection and semantic segmentation.

3.2. Semantic Segmentation Performance

Table 4 presents the pixel-level segmentation performance across four approaches: three dedicated semantic segmentation models (U-Net, EVitNet, SegFormer) and the YOLO + SAM detection pipeline. Among the semantic segmentation models, SegFormer consistently demonstrated superior performance across both affected crown classes. For yellow crowns, SegFormer achieved an IoU of 0.542, precision of 0.694, recall of 0.711, and F1 score of 0.703, outperforming EVitNet (IoU: 0.527, F1: 0.690) and U-Net (IoU: 0.504, F1: 0.670). The performance advantage was even more pronounced for red-brown crowns, where SegFormer attained an IoU of 0.662 and F1 score of 0.797, exceeding EVitNet by 0.6% in IoU and U-Net by 2.6% in IoU.

While SegFormer achieved the highest accuracy, it required the longest training time (20.5 h), representing a 210% increase over the U-Net baseline (6.6 h). EVitNet demonstrated a favourable balance between performance and training efficiency, requiring only 9.4 h (42% increase over U-Net) while achieving IoU improvements of 4.6% for yellow crowns and 2.0% for red-brown crowns. This moderate accuracy gain at modest computational cost makes EVitNet an efficient alternative for operational applications where training resources are constrained.

Figure 8 illustrates the training and validation loss curves for all three semantic segmentation models over 150 epochs. All models demonstrated stable convergence with steadily decreasing losses and no evidence of overfitting. EVitNet exhibited the fastest convergence rate, with validation loss decreasing dramatically during the first 20 epochs and reaching a final loss of 0.36, while SegFormer showed slower initial convergence with two abrupt loss drops within the first 5 epochs and around epochs 35–40, reaching a final validation loss of around 0.34. U-Net converged at an intermediate rate, stabilising around epoch 80 with the highest final validation loss (~0.41). The final validation loss hierarchy (SegFormer < EVitNet < U-Net) directly corresponded to the segmentation performance metrics in Table 4, confirming that lower validation loss translated to superior IoU and F1 scores.

Across all models, red-brown crowns were consistently segmented with higher accuracy than yellow crowns. The average IoU improvement for red-brown over yellow symptoms was 28.0% for U-Net, 24.9% for EVitNet, and 22.1% for SegFormer. This pattern mirrors the detection results from Section 3.1, reflecting both the larger sample size of red-brown training data and the more distinctive spectral signatures compared to yellow symptoms, which show greater spectral similarity to healthy green crowns.

Although YOLO + SAM was implemented for tree-level detection and classification, its instance segmentation outputs can be evaluated at the pixel level to compare crown delineation accuracy with the semantic segmentation models. Notably, while the three semantic segmentation models were fine-tuned on the study dataset, SAM was applied as a pretrained model without additional training for affected tree crown segmentation. As shown in Table 4, YOLO + SAM achieved lower pixel-level IoU values (0.400 for yellow, 0.484 for red-brown) compared to the semantic segmentation models. This performance difference can be attributed to two key factors: (1) YOLO + SAM was optimised for tree-level classification rather than pixel-wise accuracy, and (2) SAM was applied in a zero-shot manner using only bounding box prompts from YOLO, without fine-tuning on the specific characteristics of symptomatic pine crowns. In contrast, the three semantic segmentation models were specifically trained to distinguish yellow and red-brown pixels from background in the study imagery.

3.3. Integrated Within-Crown Damage Quantification

Integration of object detection and semantic segmentation outputs enabled detailed quantification of damage distribution within individual tree crowns. By overlaying pixel-level classifications onto detected crown boundaries, we extracted precise metrics of affected crown area for each tree.

The pixel-level quantification provides finer information beyond bounding-box detection alone. For example, two trees occupying similar areas may differ substantially in actual diseased tissue extent—one with less than 25% affected crown area versus another with more than 75%. Figure 9 illustrates examples of varying disease severity and spatial distribution patterns detected using the integrated approach.

4. Discussion

4.1. Performance Comparison and Model Analysis

Direct quantitative comparison with previous studies is challenging due to differences in datasets (e.g., UAV vs. aerial imagery), species, damaging agents (e.g., pine wilt disease vs. diplodia shoot blight), imaging conditions, and classification schemes. For object detection, Ref. [28] successfully identified three damage classes (healthy, weakened, and dead) in P. sylvestris stands attacked by the woodwasp Sirex noctilio, achieving a mAP of 0.923 and F1 score of 0.866. The study was based on UAV acquired RGB imagery which provided sufficient tree level detail of colour and texture for the incorporation of more precise symptoms such as partial needle loss (defoliation). Our YOLOv12m model using aerial imagery achieved comparable accuracy for the red-brown class (mAP50: 0.918, F1: 0.851), while yellow crowns (mAP50: 0.789, F1: 0.729) showed moderate performance [22]. Our workflow is expected to transfer well to UAV-derived orthomosaics and may achieve improved performance due to higher spatial resolution from lower flight altitudes, potentially enhancing detection of subtle early-stage symptoms. Importantly, while UAV-based approaches provide higher spatial resolution, aerial imagery offers substantial scalability advantages for forest health monitoring. To enhance model robustness, training data included images captured at different times of day across sites with varying topographies, exposing the model to diverse shadow patterns and lighting conditions. Quantifying the specific impact of shadows and acquisition timing on classification performance remains an area for future investigation. This scalability makes aerial imagery approaches more suitable for regional surveillance programmes and larger plantation estate monitoring.

Similarly, for semantic segmentation, our red-brown crown segmentation using SegFormer (IoU: 0.662) is comparable to previous studies using binary classification schemes. For example, Ref [14] achieved an IoU of 0.54 for drought and invasive insect-affected trees, while [32] achieved IoU of 0.655 for pine wilt disease affected trees using UAV imagery. These results suggest that multi-class pixel-level classification does not necessarily compromise segmentation accuracy for spectrally distinct symptoms such as advanced red-brown discoloration.

The selection of an appropriate DL architecture for forest health monitoring requires careful consideration of both detection accuracy and computational efficiency. Our comparison of multiple architectures, ranging from traditional CNN-based models to modern transformer-based and hybrid approaches, highlights the trade-offs between these competing objectives.

For both classes, SegFormer achieved the highest accuracy across all performance metrics, including IoU, precision, recall, and F1 score. EVitNet showed slightly lower but comparable performance—only 0.6% (red-brown) and 2.8% (yellow) decreases in IoU, while the U-Net baseline showed a larger accuracy gap with IoU decreases of 2.6% (red-brown) and 7.0% (yellow). However, relative model performance appears context-dependent. Wu et al. [32] reported EVitNet outperformed over SegFormer for binary pine wilt disease detection in UAV imagery, while our multi-class aerial imagery task favoured SegFormer. This suggests that model selection should consider task-specific factors including classification complexity (binary vs. multi-class), spatial resolution (UAV vs. aerial imagery), and symptom characteristics, rather than assuming universal model rankings.

In terms of computational complexity, the U-Net (ResNet-34) architecture exhibited the highest GFLOPs, followed by SegFormer (MiT-B0) and EVitNet (MobileVit-XXS), which is consistent with the number of parameters in each model (Table 1). The symmetric encoder and decoder design of U-Net incorporates sequential convolution/downsampling and upsampling/convolution processes, along with skip connections copying full-resolution feature maps from encoder to decoder [42]. These large intermediate feature maps increase the computational complexity dramatically. In contrast, SegFormer replaces heavy convolutions with a hierarchical transformer encoder employing efficient self-attention mechanisms with reduced spatial dimensions and removes the heavy decoder with a lightweight all-MLP decoder head that fuses multi-scale features [43]. With reduced feature map copying and upsampling overhead, SegFormer achieves lower parameter count and consequently reduced GFLOPs. EVitNet still adopts a U-shaped encoder–decoder architecture, but the encoder alternately employs CNN and vision transformer blocks to extract both local and global features [32]. Its lightweight vision transformer blocks result in the lowest computational complexity among the three models.

Inference speeds varied among the three models (Table 1). U-Net achieved the highest FPS (68.50) due to its GPU-optimised convolutional operations. SegFormer and EVitNet showed comparable but slower inference speeds (49.04 and 44.35 FPS, respectively). While SegFormer’s transformer-based attention mechanisms are inherently less parallelisable than convolutions, EVitNet’s similar speed despite much lower GFLOPs reflects overhead from its hybrid CNN-transformer architecture.

Overall training efficiency, however, showed a different pattern (Table 4). U-Net required the shortest training time (6.6 h), while EVitNet (9.4 h) trained substantially faster than SegFormer (20.5 h) despite comparable inference speeds. This advantage reflects EVitNet’s lightweight design with only 1.16 M parameters compared to SegFormer’s 3.71 M, reducing both memory requirements and optimiser computational load.

In summary, the SegFormer achieved the highest accuracy at the cost of longest training time. For applications where detection performance is the primary objective, SegFormer represents the most favourable choice. However, EVitNet offers the most cost-effective solution, achieving comparable performance with substantially reduced training time, making it well-suited for rapid deployments with limited computational resources or frequent model retraining requirements.

4.2. Application Scenarios

This study investigated two different DL workflows: object detection with instance segmentation, and semantic segmentation. We strategically tested architectural diversity within semantic segmentation (U-Net, SegFormer, EVitNet) while using a single robust object detection workflow (YOLOv12 + SAM), as our objective was paradigm comparison rather than comprehensive model benchmarking. This design enabled efficient resource allocation while addressing our core research questions about detection paradigm performance and integration. While each approach has distinct strengths and limitations, our evaluation demonstrated that rather than one method being universally superior, they are suited to different application scenarios aligned with specific management objectives. Forest health surveillance and monitoring involve diverse tasks ranging from individual tree inventory to pixel-level damage mapping across multiple compartments, which require different forms of spatial information. Both approaches can provide reliable ground-truth information for calibrating and validating broader-scale satellite-based monitoring, and further assist in prioritising on-ground diagnostic or treatment efforts.

Object detection and instance segmentation workflow (YOLO + SAM) is most appropriate for individual tree-level assessment. This workflow identifies and delineates discrete tree crowns, facilitating tree inventory by providing accurate counts of affected trees across different severity classes. Beyond using spectral characteristics alone, object detection also learns from spatial patterns and symptom distributions within crowns to classify heterogeneous conditions such as dead tops, providing insights valuable for early detection. The zero-shot application of pretrained SAM—requiring no additional training—enables efficient deployment to new areas using YOLO-detected bounding boxes as prompts. While individual crown masks derived from SAM were sufficiently representative for tree-level assessment and counting, they achieved lower pixel-level accuracy (Table 4) compared to outputs from trained segmentation models. For heterogeneous classes such as dead tops, the segmented crowns encompassed the entire tree with mixed discoloured and green pixels, limiting their usefulness for precise symptom area quantification.

In contrast, semantic segmentation is better suited to pixel-level assessment and area-based quantification. This approach directly extracts affected pixels, providing accurate estimates of impacted canopies and enabling the identification of disease hotspots. Compared to object detection, it is particularly effective for plantations with widespread partial crown damage or early-stage symptoms, where pixel-level precision is more critical than individual tree delineation. However, its main limitation lies in the absence of individual tree identification, as semantic segmentation produces continuous symptom maps without discrete tree boundaries. Although post-processing techniques such as grouping connected pixels can provide approximate indication of individual trees, this approach performs well only where affected trees are spatially isolated rather than clustered.

The complementary nature of these workflows suggests potential value in integrated approaches. For example, the individual crown boundaries from YOLO + SAM can be refined using segmentation outputs. Additionally, the affected pixels identified through semantic segmentation within the partially damaged crowns can be combined with the entire crown masks from YOLO + SAM to calculate the proportion of affected area within each crown, providing both tree-level identity and precise symptom quantification (Figure 9). The detailed within-crown damage quantification enables more targeted field verification and provides forest managers with comprehensive severity information to make evidence-based treatment prioritisation decisions.

Computational efficiency should also be considered when selecting a workflow. While the training time for the YOLO model used in this study was comparable to one efficient segmentation model (EVitNet) (11.0 h vs. 9.4 h), the annotation effort to draw bounding boxes was far less demanding than manually delineating precise polygons along crown boundaries. This difference in annotation requirements can significantly impact the feasibility of developing large training datasets, particularly for applications requiring frequent model updates or expansion to new geographic areas.

5. Conclusions

In this study, we evaluated DL models for identifying discoloured tree crowns affected by diplodia shoot blight in P. radiata plantations, comparing two complementary approaches: object detection using YOLO combined with SAM to detect and classify individual tree crowns, and semantic segmentation using three different architectures including CNN-based U-Net, vision transformer-based SegFormer, and CNN-transformer hybrid EVitNet. Based on crown colour symptoms, three damage classes (yellow, red-brown, dead tops) and three classes (yellow, red-brown and background) were defined for object detection and semantic segmentation, respectively. The YOLO model achieved an overall mAP50 of 0.766 and mAP50–95 of 0.447 across all three classes, with red-brown crowns demonstrating the highest detection accuracy (mAP50: 0.918, F1 score: 0.851). For semantic segmentation, both SegFormer and EVitNet models outperformed the baseline U-Net, with SegFormer showing the strongest performance (IoU of 0.662 for red-brown and 0.542 for yellow). EVitNet achieved slightly lower but comparable accuracy to SegFormer while demonstrating superior training efficiency with its lighter architecture, requiring less than half of the training time (9.4 h vs. 20.5 h). The two approaches serve complementary application roles. Object detection combined with SAM is most effective for tree-level assessment and can detect crowns with heterogeneous symptom patterns, while semantic segmentation excels at providing damage information at the pixel level, facilitating accurate area quantification. Integrating both approaches can provide both tree-level identity and precise symptom quantification within individual crowns, offering comprehensive information spanning individual tree details to spatial symptom mapping. These capabilities support calibration and validation of satellite-based monitoring systems and assist in prioritisation of ground-based diagnosis or interventions. Importantly, this study was conducted across five geographically dispersed sites spanning multiple states over two acquisition years, demonstrating the operational scalability of aerial imagery-based deep learning workflows for forest health surveillance.

Future research could explore end-to-end instance segmentation architectures such as YOLO11-Seg [55] or Mask2Former [56] to investigate whether they could potentially streamline the workflow and improve both accuracy and efficiency. Manual annotations are labour-intensive and time-consuming. Semi-automated annotation workflows [57] could be explored to determine whether they can enable more rapid development of larger training datasets, which would facilitate rapid model adaptation to new regions. While our study required substantial manual annotation, the resulting trained models can serve as a foundation for future applications. Through transfer learning and few-shot learning approaches, these models could be adapted to detect similar forest diseases or deployed in new plantation sites with significantly reduced annotation requirements. The consistency and accuracy of training data annotations are key to developing reliable and robust models [14]. Alternative machine learning approaches, such as unsupervised clustering or foundation model-assisted annotation, could be explored to establish more consistent class boundaries that reduce dependence on annotators’ subjective judgements. Vision-Language Models (VLMs) represent another promising direction, potentially reducing annotation requirements by leveraging semantic text-image alignment to identify visual features based on natural language descriptions rather than extensive labelled datasets [58,59]. This capability could be particularly valuable for detecting early symptoms like chlorosis, though rigorous validation against standard deep learning approaches would be essential before operational deployment. While this study focused on representative models from CNN, Transformer, and hybrid architectures, future work could expand the comparison to include additional semantic segmentation models (e.g., DeepLabv3+, PSPNet) and conduct ablation studies to identify optimal architectural components for forest disease detection tasks. Finally, the input channels for DL models can be expanded beyond RGB by including NIR (when available), or derived vegetation indices. Such spectral enhancements would be particularly valuable for improving detection of spectrally ambiguous yellow symptoms, which showed consistently lower accuracy than red-brown crowns across both detection and segmentation approaches in this study.

Author Contributions

M.W.: Conceptualization, Methodology, Software, Analysis, Writing—original draft, Writing—review and editing. C.S.: Funding acquisition, Conceptualization, Annotation, Writing—original draft, Writing—review and editing. A.J.C. Funding acquisition, Damaging agent diagnosis, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

Funding was provided by the National Institute for Forest Products Innovation, Project NIF197-223.

Data Availability Statement

The workflow scripts and the datasets used in this study are available from Mingzhu Wang upon request.

Acknowledgments

David Bruce (Flinders University) and Dianne Patzel (University of South Australia) for project management; Dianne Patzel for selection of study sites in the Green Triangle and diagnosis of damaging agents; Forestry Corporation of NSW and growers from the Green Triangle Forest Health Group for access to plantations; and Grant Pearse (Flinders University) who provided very constructive and helpful comments to improve the manuscript. The submitted draft paper was improved in response to the comments from 3 anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Intergovernmental Panel on Climate Change (IPCC). Weather and Climate Extreme Events in a Changing Climate. In Climate Change 2021—The Physical Science Basis; Cambridge University Press: Cambridge, UK, 2023; pp. 1513–1766. [Google Scholar] [CrossRef]
Anderegg, W.R.L.; Hicke, J.A.; Fisher, R.A.; Allen, C.D.; Aukema, J.; Bentz, B.; Hood, S.; Lichstein, J.W.; Macalady, A.K.; McDowell, N.; et al. Tree Mortality from Drought, Insects, and Their Interactions in a Changing Climate. New Phytol. 2015, 208, 674–683. [Google Scholar] [CrossRef]
Hossain, M.; Veneklaas, E.J.; Hardy, G.E.S.J.; Poot, P. Tree Host–Pathogen Interactions as Influenced by Drought Timing: Linking Physiological Performance, Biochemical Defence and Disease Severity. Tree Physiol. 2019, 39, 6–18. [Google Scholar] [CrossRef] [PubMed]
Bracalini, M.; Bălăcenoiu, F.; Panzavolta, T. Forest Health under Climate Change: Impact of Insect Pests. iForest 2024, 17, 295–299. [Google Scholar] [CrossRef]
Carnegie, A.J.; Kathuria, A.; Nagel, M.; Mitchell, P.J.; Stone, C.; Sutton, M. Current and Future Risks of Drought-Induced Mortality in Pinus radiata Plantations in New South Wales, Australia. Aust. For. 2022, 85, 161–177. [Google Scholar] [CrossRef]
Brodde, L.; Stein Åslund, M.; Elfstrand, M.; Oliva, J.; Wågström, K.; Stenlid, J. Diplodia sapinea as a Contributing Factor in the Crown Dieback of Scots Pine (Pinus sylvestris) after a Severe Drought. For. Ecol. Manag. 2023, 549, 121436. [Google Scholar] [CrossRef]
Wingfield, M.J.; Slippers, B.; Barnes, I.; Duong, T.A.; Wingfield, B.D. The Pine Pathogen Diplodia sapinea: Expanding Frontiers. Curr. For. Rep. 2024, 11, 2. [Google Scholar] [CrossRef]
Johnson, E.W.; Wittwer, D. Aerial Detection Surveys in the United States. Aust. For. 2008, 71, 212–215. [Google Scholar] [CrossRef]
Carnegie, A.J.; Cant, R.G.; Eldridge, R.H. Forest Health Surveillance in New South Wales, Australia. Aust. For. 2008, 71, 164–176. [Google Scholar] [CrossRef]
Kershaw, D.J. History of Forest Health Surveillance in New Zealand. N. Z. J. For. Sci. 1989, 19, 375–377. [Google Scholar]
Stone, C.; Carnegie, A.; Melville, G.; Smith, D.; Nagel, M. Aerial Mapping Canopy Damage by the Aphid Essigella californica in a Pinus radiata Plantation in Southern New South Wales: What Are the Challenges? Aust. For. 2013, 76, 101–109. [Google Scholar] [CrossRef]
Coleman, T.W.; Graves, A.D.; Heath, Z.; Flowers, R.W.; Hanavan, R.P.; Cluck, D.R.; Ryerson, D. Accuracy of Aerial Detection Surveys for Mapping Insect and Disease Disturbances in the United States. For. Ecol. Manag. 2018, 430, 321–336. [Google Scholar] [CrossRef]
Zhao, H.; Morgenroth, J.; Pearse, G.; Schindler, J. A Systematic Review of Individual Tree Crown Detection and Delineation with Convolutional Neural Networks (CNN). Curr. For. Rep. 2023, 9, 149–170. [Google Scholar] [CrossRef]
Joshi, D.; Witharana, C. Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty. Remote Sens. 2025, 17, 1066. [Google Scholar] [CrossRef]
Zhi, J.; Li, L.; Zhu, H.; Li, Z.; Wu, M.; Dong, R.; Cao, X.; Liu, W.; Qu, L.; Song, X.; et al. Comparison of Deep Learning Models and Feature Schemes for Detecting Pine Wilt Diseased Trees. Forests 2024, 15, 1706. [Google Scholar] [CrossRef]
Wang, L.; Gao, Y.; Liu, Y.; Zhong, L.; Wang, S.; Ma, Y.; Zhan, Z. Monitoring Pine Shoot Beetle Damage Using UAV Imagery and Deep Learning Semantic Segmentation Under Different Forest Backgrounds. Forests 2025, 16, 668. [Google Scholar] [CrossRef]
Brandt, M.; Chave, J.; Li, S.; Fensholt, R.; Ciais, P.; Wigneron, J.-P.; Gieseke, F.; Saatchi, S.; Tucker, C.J.; Igel, C. High-Resolution Sensors and Deep Learning Models for Tree Resource Monitoring. Nat. Rev. Electr. Eng. 2024, 2, 13–26. [Google Scholar] [CrossRef]
Chiang, C.-Y.; Barnes, C.; Angelov, P.; Jiang, R. Deep Learning Based Automated Forest Health Diagnosis from Aerial Images. IEEE Access 2020, 8, 144064–144076. [Google Scholar] [CrossRef]
Windrim, L.; Carnegie, A.J.; Webster, M.; Bryson, M. Tree Detection and Health Monitoring in Multispectral Aerial Imagery and Photogrammetric Pointclouds Using Machine Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2554–2572. [Google Scholar] [CrossRef]
ArborCarbon—ArborCarbon Are Experts in Sustainable Vegetation Management Solutions and Urban Forest Monitoring. Available online: https://www.arborcarbon.com.au/index.html (accessed on 3 November 2025).
DeepForest Technologies. Available online: https://deepforest-tech.co.jp/en/ (accessed on 3 November 2025).
Forest AI Experts | Remote Sensing & Forest Carbon Solutions—SKYLAB. Available online: https://skylabglobal.com/ (accessed on 3 November 2025).
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in Vegetation Remote Sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Xie, Y.; Wang, Y.; Sun, Z.; Liang, R.; Ding, Z.; Wang, B.; Huang, S.; Sun, Y. Instance Segmentation and Stand-Scale Forest Mapping Based on UAV Images Derived RGB and CHM. Comput. Electron. Agric. 2024, 220, 108878. [Google Scholar] [CrossRef]
Kratky, M.; Komarkova, J. Comparison of Pixel and Object-Based Image Classification Based on Very High Spatial Resolution UAV-Borne RGB Imagery—Baroch Case Study. In Proceedings of the 2025 IEEE Zooming Innovation in Consumer Technologies Conference (ZINC), Novi Sad, Serbia, 28–29 May 2025; pp. 131–135. [Google Scholar] [CrossRef]
Wołk, K.; Tatara, M.S. A Review of Semantic Segmentation and Instance Segmentation Techniques in Forestry Using LiDAR and Imagery Data. Electronics 2024, 13, 4139. [Google Scholar] [CrossRef]
Pearse, G.D.; Watt, M.S.; Soewarto, J.; Tan, A.Y.S. Deep Learning and Phenology Enhance Large-Scale Tree Species Classification in Aerial Imagery during a Biosecurity Response. Remote Sens. 2021, 13, 1789. [Google Scholar] [CrossRef]
Yang, W.; Zhao, J.; Zhu, D.; Wang, Z.; Song, M.; Chen, T.; Liang, T.; Shi, J. YOLO-PTHD: A UAV-Based Deep Learning Model for Detecting Visible Phenotypic Signs of Pine Decline Induced by the Invasive Woodwasp Sirex noctilio (Hymenoptera, Siricidae). Insects 2025, 16, 829. [Google Scholar] [CrossRef]
Safonova, A.; Hamad, Y.; Alekhina, A.; Kaplun, D. Detection of Norway Spruce Trees (Picea abies) Infested by Bark Beetle in UAV Images Using YOLOs Architectures. IEEE Access 2022, 10, 10384–10392. [Google Scholar] [CrossRef]
Leidemer, T.; Lopez Caceres, M.L.; Diez, Y.; Ferracini, C.; Tsou, C.Y.; Katahira, M. Evaluation of Temporal Trends in Forest Health Status Using Precise Remote Sensing. Drones 2025, 9, 337. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025. [Google Scholar] [CrossRef]
Wu, Q.; Chen, M.; Shi, H.; Yi, T.; Xu, G.; Wang, W.; Zhao, C.; Zhang, R. Algorithm for Detecting Trees Affected by Pine Wilt Disease in Complex Scenes Based on CNN-Transformer. Forests 2025, 16, 596. [Google Scholar] [CrossRef]
Yuan, Q.; Zou, S.; Wang, H.; Luo, W.; Zheng, X.; Liu, L.; Meng, Z. A Lightweight Pine Wilt Disease Detection Method Based on Vision Transformer-Enhanced YOLO. Forests 2024, 15, 1050. [Google Scholar] [CrossRef]
Qin, B.; Sun, F.; Shen, W.; Dong, B.; Ma, S.; Huo, X.; Lan, P. Deep Learning-Based Pine Nematode Trees’ Identification Using Multispectral and Visible UAV Imagery. Drones 2023, 7, 183. [Google Scholar] [CrossRef]
Xu, S.; Huang, W.; Wang, D.; Zhang, B.; Sun, H.; Yan, J.; Ding, J.; Wang, J.; Yang, Q.; Huang, T.; et al. Automatic Pine Wilt Disease Detection Based on Improved YOLOv8 UAV Multispectral Imagery. Ecol. Inform. 2024, 84, 102846. [Google Scholar] [CrossRef]
Wang, S.; Cao, X.; Wu, M.; Yi, C.; Zhang, Z.; Fei, H.; Zheng, H.; Jiang, H.; Jiang, Y.; Zhao, X.; et al. Detection of Pine Wilt Disease Using Drone Remote Sensing Imagery and Improved YOLOv8 Algorithm: A Case Study in Weihai, China. Forests 2023, 14, 2052. [Google Scholar] [CrossRef]
Amin, S.U.; Jung, Y.; Fayaz, M.; Kim, B.; Seo, S. Enhancing Pine Wilt Disease Detection with Synthetic Data and External Attention-Based Transformers. Eng. Appl. Artif. Intell. 2025, 159, 111655. [Google Scholar] [CrossRef]
Kanerva, H.; Honkavaara, E.; Näsi, R.; Hakala, T.; Junttila, S.; Karila, K.; Koivumäki, N.; Alves Oliveira, R.; Pelto-Arvo, M.; Pölönen, I.; et al. Estimating Tree Health Decline Caused by Ips typographus L. from UAS RGB Images Using a Deep One-Stage Object Detection Neural Network. Remote Sens. 2022, 14, 6257. [Google Scholar] [CrossRef]
Shen, J.; Xu, Q.; Gao, M.; Ning, J.; Jiang, X.; Gao, M. Aerial Image Segmentation of Nematode-Affected Pine Trees with U-Net Convolutional Neural Network. Appl. Sci. 2024, 14, 5087. [Google Scholar] [CrossRef]
Kapil, R.; Marvasti-Zadeh, S.M.; Goodsman, D.; Ray, N.; Erbilgin, N. Classification of Bark Beetle-Induced Forest Tree Mortality Using Deep Learning. arXiv 2022, arXiv:2207.07241v2. [Google Scholar] [CrossRef]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024. [Google Scholar] [CrossRef]
Ronneberger, O.; Philipp, F.; Thomas, B. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer—Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process Syst. 2021, 15, 12077–12090. [Google Scholar]
Xmap—Aerial Survey Systems. Available online: https://xmap.com.au/ (accessed on 6 November 2025).
Tian, J.; Li, X.; Duan, F.; Wang, J.; Ou, Y. An Efficient Seam Elimination Method for UAV Images Based on Wallis Dodging and Gaussian Distance Weight Enhancement. Sensors 2016, 16, 662. [Google Scholar] [CrossRef]
Perez, M.I.; Karelovic, B.; Molina, R.; Saavedra, R.; Cerulo, P.; Cabrera, G. Precision Silviculture: Use of UAVs and Comparison of Deep Learning Models for the Identification and Segmentation of Tree Crowns in Pine Crops. Int. J. Digit. Earth 2022, 15, 2223–2238. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
Model Training with Ultralytics YOLO—Ultralytics YOLO Docs. Available online: https://docs.ultralytics.com/modes/train/#augmentation-settings-and-hyperparameters (accessed on 4 November 2025).
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
El Joudi, N.A.; Lazaar, M.; Delmotte, F.; Allaoui, H.; Mahboub, O. Adaptive Transfer Learning Using SegFormer for Imbalanced Pixel in Medical Image Segmentation. Signal Image Video Process. 2025, 19, 617. [Google Scholar] [CrossRef]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Instance Segmentation—Ultralytics YOLO Docs. Available online: https://docs.ultralytics.com/tasks/segment/ (accessed on 5 November 2025).
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 1280–1289. [Google Scholar] [CrossRef]
Yang, J.; Yu, W.; Lv, Y.; Sun, J.; Sun, B.; Liu, M. SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 22499–22512. [Google Scholar] [CrossRef]
Weng, X.; Pang, C.; Xia, G.S. Vision-Language Modeling Meets Remote Sensing: Models, Datasets, and Perspectives. IEEE Geosci. Remote Sens. Mag. 2025, 13, 276–323. [Google Scholar] [CrossRef]
Woo, S.; Kim, D.; Jang, J.; Choi, Y.; Kim, C. Don’t Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models. In Findings of the Association for Computational Linguistics: ACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 1927–1951. [Google Scholar] [CrossRef]

Figure 1. Overall workflow for comparing object detection and semantic segmentation approaches for diplodia shoot blight detection in pine plantations.

Figure 2. Location of the five study sites across Australia. (a) Overview map showing sites in NSW and the GT region. (b) Detailed view of the GT region showing Langkoop, Myora, and Dartmoor sites. (c) Detailed view of NSW sites showing Kangaroo Vale and Carabost sites.

Figure 3. Comparison of annotation strategies for object detection and semantic segmentation. (a) Original RGB aerial imagery with object detection bounding boxes showing three classes: yellow (yellow boxes), red-brown (red boxes), and dead tops (purple boxes). (b) Modified imagery for semantic segmentation with pixel-level polygon annotations for yellow and red-brown crowns; partially affected trees have been masked out to avoid ambiguous training signals.

Figure 4. Two-stage tiling strategy for dataset preparation. Illustration of the tiling workflow showing (a) large non-overlapping tiles (3040 × 3040 pixels) (black boxes) used for train/validation/test splitting, and (b) overlapping subtiles (640 × 640 pixels) (blue boxes) generated within each large tile using a stride of 480 pixels (25% overlap).

Figure 5. Precision-recall curves showing detection performance of YOLOv12m across three tree crown classes and overall performance (mAP@0.5).

Figure 6. Confusion matrix showing classification performance of YOLOv12m across three tree crown classes and background.

Figure 7. Visual comparison of manual annotations and model predictions for tree crown detection. (a) Manually annotated bounding boxes colour-coded by symptom class: dead tops (purple), yellow (yellow), and red-brown (red). (b) YOLOv12m detection results with SAM instance segmentation masks delineating precise crown boundaries.

Figure 8. Training and validation loss curves for semantic segmentation models. (a) Training loss showing smooth convergence for all models. (b) Validation loss demonstrating performance hierarchy: SegFormer < EVitNet < U-Net.

Figure 9. Quantification of within-crown symptom coverage through integrated detection and segmentation. Crown boundaries from YOLO + SAM (white outlines) combined with pixel-level symptom classification from SegFormer (red and yellow). Percentages indicate the proportion of affected crown pixels within each detected crown.

Table 1. Model specifications and computational requirements.

Task	Model	Encoder/Variant	Params (M)	GFLOPs	FPS
Object detection	YOLO	YOLOv12m	20.20	70.93	68.08
Semantic segmentation	U-Net	ResNet-34	24.44	49.01	68.50
Semantic segmentation	EVitNet	MobileVit-XXS	1.16	7.98	44.35
Semantic segmentation	SegFormer	MiT-B0	3.71	13.18	49.04

Table 2. Comparison of evaluation metrics between object detection and semantic segmentation.

Metric	Object Detection	Semantic Segmentation
IoU	Bounding box overlap	Pixel region overlap
TP	Correct detections & IoU > threshold	Correctly classified pixels
FP	Incorrect detections or IoU < threshold	Background pixels misclassified as target
FN	Missed detections	Target pixels misclassified as background
Precision	Correct detections/All detections	Correct pixels/All predicted pixels
Recall	Correct detections/All ground truth	Correct pixels/All ground truth pixels
F1 score	Harmonic mean of precision & recall	Harmonic mean of precision & recall
mAP	Average AP across classes	Not applicable

Table 3. Object detection performance of YOLOv12m model across three tree crown classes.

Crown Class	mAP50	mAP50–95	Precision	Recall	F1 Score
Dead tops	0.591	0.283	0.570	0.540	0.555
Yellow	0.789	0.480	0.745	0.713	0.729
Red-brown	0.918	0.578	0.839	0.864	0.851

Table 4. Segmentation performance of three semantic segmentation models and YOLO + SAM approach across two tree crown classes evaluated on the test dataset.

Model	Training Time (h)	Crown Class	IoU	Precision	Recall	F1 Score
U-Net	6.6	Yellow	0.504	0.665	0.674	0.670
U-Net	6.6	Red-brown	0.645	0.778	0.791	0.784
EVitNet	9.4	Yellow	0.527	0.677	0.705	0.690
EVitNet	9.4	Red-brown	0.658	0.769	0.821	0.794
SegFormer	20.5	Yellow	0.542	0.694	0.711	0.703
SegFormer	20.5	Red-brown	0.662	0.770	0.825	0.797
YOLO + SAM	11.0 *	Yellow	0.400	0.526	0.625	0.571
YOLO + SAM	11.0 *	Red-brown	0.484	0.611	0.699	0.652

* Training time is for YOLO only. SAM was used in zero-shot mode without fine-tuning.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Stone, C.; Carnegie, A.J. Application of CNN and Vision Transformer Models for Classifying Crowns in Pine Plantations Affected by Diplodia Shoot Blight. Forests 2026, 17, 108. https://doi.org/10.3390/f17010108

AMA Style

Wang M, Stone C, Carnegie AJ. Application of CNN and Vision Transformer Models for Classifying Crowns in Pine Plantations Affected by Diplodia Shoot Blight. Forests. 2026; 17(1):108. https://doi.org/10.3390/f17010108

Chicago/Turabian Style

Wang, Mingzhu, Christine Stone, and Angus J. Carnegie. 2026. "Application of CNN and Vision Transformer Models for Classifying Crowns in Pine Plantations Affected by Diplodia Shoot Blight" Forests 17, no. 1: 108. https://doi.org/10.3390/f17010108

APA Style

Wang, M., Stone, C., & Carnegie, A. J. (2026). Application of CNN and Vision Transformer Models for Classifying Crowns in Pine Plantations Affected by Diplodia Shoot Blight. Forests, 17(1), 108. https://doi.org/10.3390/f17010108

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application of CNN and Vision Transformer Models for Classifying Crowns in Pine Plantations Affected by Diplodia Shoot Blight

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Sites and Image Acquisition

2.2. Data Preparation

2.2.1. Class Definitions

2.2.2. Image Annotations and Preprocessing

2.3. Individual Tree Crown Detection and Classification

2.4. Semantic Segmentation of Affected Tree Crown Classes

2.5. Integration of Object Detection and Semantic Segmentation Outputs

2.6. Implementation Details

2.7. Accuracy Assessment and Evaluation Metrics

3. Results

3.1. Tree-Level Detection and Classification Accuracy

3.2. Semantic Segmentation Performance

3.3. Integrated Within-Crown Damage Quantification

4. Discussion

4.1. Performance Comparison and Model Analysis

4.2. Application Scenarios

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI