Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty

Joshi, Durga; Witharana, Chandi

doi:10.3390/rs17061066

Open AccessArticle

Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty

by

Durga Joshi

^*

and

Chandi Witharana

Department of Natural Resources and the Environment, Eversource Energy Center, University of Connecticut, Storrs, CT 06269, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1066; https://doi.org/10.3390/rs17061066

Submission received: 6 January 2025 / Revised: 7 March 2025 / Accepted: 12 March 2025 / Published: 18 March 2025

(This article belongs to the Special Issue Computer Vision-Based Methods and Tools in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Forest health monitoring at scale requires high-spatial-resolution remote sensing images coupled with deep learning image analysis methods. However, high-quality large-scale datasets are costly to acquire. To address this challenge, we explored the potential of freely available National Agricultural Imagery Program (NAIP) imagery. By comparing the performance of traditional convolutional neural network (CNN) models (U-Net and DeepLabv3+) with a state-of-the-art Vision Transformer (SegFormer), we aimed to determine the optimal approach for detecting unhealthy tree crowns (UTC) using a publicly available data source. Additionally, we investigated the impact of different spectral band combinations on model performance to identify the most effective configuration without incurring additional data acquisition costs. We explored various band combinations, including RGB, color infrared (CIR), vegetation indices (VIs), principal components (PC) of texture features (PCA), and spectral band with PC (RGBPC). Furthermore, we analyzed the uncertainty associated with potential subjective crown annotation and its impact on model evaluation. Our results demonstrated that the Vision Transformer-based model, SegFormer, outperforms traditional CNN-based models, particularly when trained on RGB images yielding an F1-score of 0.85. In contrast, DeepLabv3+ achieved F1-score of 0.82. Notably, PCA-based inputs yield reduced performance across all models, with U-Net producing particularly poor results (F1-score as low as 0.03). The uncertainty analysis indicated that the Intersection over Union (IoU) could fluctuate between 14.81% and 57.41%, while F1-scores ranged from 8.57% to 47.14%, reflecting the significant sensitivity of model performance to inconsistencies in ground truth annotations. In summary, this study demonstrates the feasibility of using publicly available NAIP imagery and advanced deep learning techniques to accurately detect unhealthy tree canopies. These findings highlight SegFormer’s superior ability to capture complex spatial patterns, even in relatively low-resolution (60 cm) datasets. Our findings underline the considerable influence of human annotation errors on model performance, emphasizing the need for standardized annotation guidelines and quality control measures.

Keywords:

CNN; vision transformer; tree health; annotation uncertainty

Graphical Abstract

1. Introduction

Forest ecosystems provide essential services, such as carbon sequestration and biodiversity conservation, yet human activities and environmental stressors, such as climate change and insect infestations, increasingly threaten them [1,2,3,4]. These stressors contribute to declining tree health, resulting in biodiversity loss [5], altered hydrological and carbon cycles [6], and increased risks to human safety [7], particularly in vulnerable areas, such as overhead electric utility corridors. Unhealthy or structurally compromised trees within these corridors pose significant risks, as falling branches or entire trees can encounter power lines, leading to infrastructure damage, electrical outages, and service disruptions. In severe cases, downed power lines can create electrocution hazards for the public and emergency responders and increase the likelihood of wildfires, especially in dry conditions [8]. Furthermore, forest disturbances are expected to become more frequent and severe due to the synergistic effects of human activities and climate change, particularly in areas like the Northeastern United States, where forest cover constitutes a significant portion of the landscape [9].

Managing degraded and dying forests at various spatial levels, from individual trees in urban areas to larger stands on a landscape scale, is challenging. Effective management of trees is reliant upon accurate and up-to-date information. To maintain forest dynamics and promote sustainable forest management practices, monitoring forest health and timely identifying unhealthy trees is essential. Traditional inventory methods are often time-consuming and costly, but advancements in remote sensing (RS) technology, digital image processing, and machine learning offer promising alternatives for assessing unhealthy tree crowns (UTCs) [10,11,12].

Different automated image analysis methods in RS have advanced significantly, mainly with deep learning (DL) techniques, such as convolutional neural networks (CNNs) for object segmentation [13], classification [14], and detection [15]. CNN offers a robust and more accurate approach to classification than traditional approaches [16,17] by directly identifying objects in training data and learning hierarchical combinations of image features focusing on object-level representations. Also, CNNs are re-trainable to incorporate unique dataset characteristics [18], thus excelling in segmentation and detection tasks, such as individual tree crown detection [19,20], plant health monitoring [10,21,22,23,24], and forest mapping and monitoring [25,26,27]. Kattenborn et al. [28] reviewed the application of CNNs in forest remote sensing and noted that CNNs outperform shallow machine learning techniques, such as random forest or support vector machine algorithms, in a wide range of forest applications.

CNN models, such as Faster R-CNN [29], FCN [30], Mask R-CNN [31], U-Net [32], DeepLab v3+ [33], and YOLO [34], have been widely used for segmentation tasks. For instance, Lobo Torres et al. [35] utilized Unmanned Aerial Vehicle (UAV)-based RGB images to semantically segment individual species in an urban environment. They compared the performance of five different CNNs and achieved accuracy ranging from 88.9% to 96.7% across various architectures in identifying endangered single species. Also, CNNs have proven effective in tasks such as tree counting and identification, outperforming traditional methods using high-resolution remote sensing data [10,16,36]. Li et al. [20] designed and implemented a CNN-based deep learning framework to detect oil palm trees in high-resolution images with an accuracy higher than 96% while counting the individual treetops. Brandt et al. [37] were capable of mapping more than 1.8 billion trees in the Sahara and Sahel region using more than 11 thousand satellite scenes. This shows the potentiality of CNNs in pattern recognition for vegetation analysis from RS data. Also, DL-based techniques are becoming prevalent in leveraging 3D-LiDAR datasets [38,39,40,41] for accurate segmentation tasks. LiDAR-derived canopy height models, and other information coupled with multispectral images and/or hyperspectral images, are gaining popularity due to their adaptability and showing flexibility in both spatial and spectral resolution [18]. However, tree detection and segmentation tasks using CNNs remain a challenge due to the lack of training samples, complex geometries, target–background imbalance, imaging conditions, and sensor characteristics. Moreover, accurately identifying targets in complex scenes, such as mixed dense forest vegetation, solely based on local feature information presents a challenge for CNN-based models.

Vision Transformers (ViTs) [42] is a transformer [43]-like architecture introduced in the field of computer vision, specifically designed for the vision task [44] to understand the spatial relationships in an image. Transformers are powerful because they can be trained on massive internet data and then fine-tuned to transfer their learned abilities to excel at other domains using smaller, task-specific datasets [42]. ViTs have demonstrated significant application in semantic segmentation tasks, often retaining the encoder-decoder architecture from CNN-based approaches. This makes them versatile tools for a wide range of applications, including remote sensing image analysis [45,46,47,48]. For instance, Kaselimi et al. [46] used a ViT for remote sensing scene classification and achieved more than 94% accuracy in different datasets. Maurício et al. [49] reviewed a large body of studies that compared the performance of CNNs with ViTs and concluded that ViTs outperformed CNNs. However, ViTs still have certain disadvantages when compared to CNNs, such as the necessity for extremely large datasets. Self-supervised techniques can mitigate some of these limitations and enhance ViTs further [50]. On the other hand, combining CNNs and ViTs for classification purposes can help overcome the limitations of either approach. For instance, Li et al. [51] combined a CNN and transformer for agricultural crop classification in time-series satellite images. They first extracted multitemporal features from the CNN and then used transformers to learn the land cover patterns.

ViTs offer a promising alternative to CNNs for semantic segmentation tasks, particularly those requiring global context, high-resolution images, or intricate segmentation. SegFormer [52], Swin Transformer [53], Detectron2 with transformer backbone [54], Pyramid Vision Transformer (PVT) [55], and Masked-attention Mask Transformer (Mask2Former) [56] are popular ViTs for semantic segmentation tasks. Such ViTs are gaining popularity in the field of remote sensing for semantic segmentation tasks [57]. The SegFormer algorithm [52], a notable transformer-based segmentation model inspired by ViTs, partitions the input image into smaller patches and feeds each patch as a sequence into the transformer for analysis. Unlike conventional semantic segmentation techniques, the SegFormer model integrates multi-scale features to capture pixel relationships, thereby reducing parameters and computations in the decoder section effectively. In forest remote sensing, a very limited number of studies have used transformer-based semantic segmentation. ViTs are mainly used in land cover classification and change detection purposes [48,58,59,60]. There is a need and an opportunity to explore the performance of ViTs in forest health monitoring tasks.

Forest health monitoring in a dense mixed forest has significant challenges as individual spectral bands may not suffice to detect unhealthy tree canopies. Vegetation indices (VIs), derived from combinations of spectral reflectance values, are proposed as an alternative to individual spectral bands [61]. However, VIs have limitations in interpreting forest canopy information, especially regarding variables, for instance, chlorophyll content and leaf angles [62]. Peddle et al. [63] found that the performance comparisons with spectral mixture analysis show VIs’ shortcomings in assessing tree canopy cover. Also, several studies claimed that textural information can improve forest health classification [64,65,66]. However, there is still a significant amount of work needed to assess the effectiveness of texture analysis in examining tree decay and mortality [67]. Most DL models applied in satellite and aerial imagery analysis, particularly segmentation, predominantly employ end-to-end architectures that utilize multi- or hyper-spectral bands directly in their processing pipeline. However, challenges arise from the complex, multimodal, geolocated, and multitemporal nature of the data, coupled with limited labeled remote-sensing data for model training purposes. Additionally, the growth of high-dimensional data surpasses improvements in GPU (Graphical Processing Unit) computing power. To improve the efficacy of DL models in high-dimensional data analysis, it is essential to incorporate domain knowledge from forest science and remote sensing. Karpatne et al. [68] suggest that the integration of theory-driven insights with data science significantly bolsters model robustness in complex environmental settings. By embedding textural information and VIs as prior knowledge, we aim to refine DL models and enhance their predictive capabilities in forestry applications.

The efficacy of DL models is significantly contingent upon the quality of training samples, where accurate and consistent annotations play a pivotal role in determining model performance. It is imperative to account for annotation uncertainty during model evaluation, as variability in annotations can profoundly affect predictive accuracy and model generalizability. This issue is particularly pronounced in scientific studies that often rely on limited training datasets due to the high costs of data production and the restricted availability of domain experts. Research by studies such as Nowak & Rüger; Rädsch et al. and Vădineanu et al. [69,70,71] underscores the critical impact of annotator variability, highlighting the necessity for systematic analyses of inter-annotator agreement. By rigorously addressing annotation uncertainty, our study not only strengthens the robustness of model evaluation but also introduces a novel framework that extends beyond conventional metrics, thereby enhancing the reliability and validity of findings in complex environmental contexts. This approach voices the importance of methodological rigor in annotation processes, ultimately contributing to more accurate and explainable DL applications in the field.

In this study, we present a systematic comparative analysis of CNNs and a ViT model to identify unhealthy forest canopies in mixed dense forest areas using publicly available aerial imagery acquired by the National Agricultural Imagery Program (NAIP). We compare the effectiveness of a set of candidate CNNs and ViT for this task, which, to our knowledge, has not been explored before in forest health monitoring in dense mixed forests. Our specific objectives are to (i) compare the accuracy of the CNN-based models U-Net and DeepLab v3+ and the ViT-based model SegFormer, (ii) evaluate the different models of varying band combination from the 4-band NAIP images that include derived VIs, Principal Components, and image textures, and (iii) evaluate manual annotation uncertainty and its impact on evaluation metrics. By examining annotation uncertainty, we aim to shed new light on how inter-annotator variability can impact model performance. Our study uniquely integrates Vision Transformer-based models (SegFormer) into forest health monitoring domain, leveraging NAIP imagery and evaluating the impact of annotation uncertainty. This study not only benchmarks different semantic segmentation models but also provides insights into annotation variability, an often-overlooked factor in tree crown detection research. Our findings contribute to refining segmentation practices and highlight the need for improved annotation consistency in forest monitoring applications. Our study seeks to contribute to the development of a cost-effective method for detecting unhealthy trees using high-resolution imagery, which can be valuable for forest management and conservation efforts.

2. Materials and Methods

2.1. Study Area

The study area is in the eastern part of Windham County, Connecticut, USA (Figure 1). Connecticut (CT) is a densely forested state of the USA with a mixed forest vegetation type occupying around 57% percent of land area [9]. The forest type is temperate deciduous, with 58 species of trees in Connecticut [72]. It has a predominant oak-hickory species forest type that covers approximately 70% of the total forested area, followed by elm, ash, cottonwood, maple, beech, birch, and other hardwoods [72]. These areas have substantial numbers of eastern white pine, eastern hemlock, and other softwoods. The forest of CT has been widely infested with invasive insects, leading to defoliation and tree mortality. A massive oak tree die-off occurred in eastern CT during 2017–2018. Prolonged drought followed by severe spongy moth (Lymantria dispar) defoliation weakened the trees, making them susceptible to additional stressors like borers and fungus, leading to widespread mortality [73].

2.2. Modeling Framework

The overall methodology is divided into four steps: (1) data retrieval and preprocessing, (2) DL model training and optimization, (3) DL model comparison and accuracy assessment, and (4) annotation uncertainty analysis (Figure 2). NAIP aerial imagery was annotated in the ArcGIS Pro 3.1.2 software, creating polygon labels for the UTCs in shapefile format, which were later turned into a binary mask image. Various combinations of the spectral bands, VIs, and texture metrics and their Principal Components (PCs) were used as the input for training our candidate algorithms. In the model training phase, we used the ViT-based SegFormer architecture and the CNN-based U-Net and DeepLab v3+ with Restnet101 backbone architecture. Furthermore, we conducted an accuracy assessment based on various evaluation metrics, which are later described in detail.

2.3. Data Retrieval and Preparation

2.3.1. Image Data

We downloaded NAIP imagery from the USGS Earth Explorer archive, managed by the USDA Farm Service Agency (FSA) through the Aerial Photography Field Office (APFO). The imagery has ground sample distances between 0.6 and 1 m and is orthorectified prior to distribution. Images are organized into uniform scenes, each potentially having various flight lines with differing geometries. Images consist of 4 spectral channels (Blue (435–495 nm), Green (525–585 nm), Red (619–651 nm), and Near-Infrared (808–882 nm)) and were acquired with a Leica Geosystems ADS-100 airborne digital sensor [74]. The imagery has an 8-bit radiometric resolution. We did not perform any other processing on the NAIP imagery, as our methods were specifically designed to accommodate the diverse characteristics present within NAIP image scenes. A total of 16 NAIP image scenes were used to cover the study area. To avoid confusion in terminology, we define two key image types: “image scene” and “image patch”. An “image scene” refers to the entire NAIP image downloaded from USGS Earth Explorer, typically covering a large area (3.75 min quarter quadrangle plus a 300 m buffer, i.e., 12,596 × 12,596 pixels) and containing four spectral channels (B, G, R, and NIR). Image scenes are then subdivided into subarrays with a dimension of 254 × 254 pixels (152.4 m × 152.4 m) for further processing. We refer to those subarrays as “image patches” throughout the text.

2.3.2. Spectral Band Combination

We derived VIs from multispectral bands as a proxy for vegetation health. Three vegetation indices, Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), and Atmospherically Resistant Vegetation Index (ARVI), were created for each image patch. Since each of our candidate DL architectures takes an input image with 3 spectral channels, we used only three indices as a proxy for vegetation health. Equations (1)–(3) show the formula used to calculate vegetation indices [75,76,77].

N D V I = \frac{N I R - R e d}{N I R + R e d}

(1)

E V I = 2.5 \times \frac{(N I R - R e d)}{((N I R + 6 \times R e d - 7.5 \times B l u e) + 1)}

(2)

A R V I = \frac{(N I R - (2 \times R e d - B l u e))}{(N I R + (2 \times R e d - B l u e))}

(3)

Similarly, we derived other 3-band combinations, such as color infrared bands, i.e., red, green, and NIR, as model inputs. Furthermore, we conducted Principal Component Analysis (PCA) on all four bands and selected the first PCA, and then used corresponding red and green bands as the first two channels and PC as the third channel to create a three-band raster. With the PCA analysis, we can represent the original data in a summarized form as principal components, which maximally account for the variances from variables used [78]. We explored whether the textural properties may help in identifying vegetation health. We calculated Harlick’s Grey Level Co-occurrence Matrix (GLCM) textural metrics [79] and extracted eight standard textural features (Variance, Entropy, Contrast, Homogeneity, Angular Second Momentum, Dissimilarity, Mean, and Correlation) for each spectral band. We used the “GLCMTextures” library in R-software (v 4.4.1) with a sliding window size of 3 × 3 at the shift angle of (1, 0). Haralick texture measures could be correlated and redundant; hence, we employed PCA for derived metrics for each image (32 textural images for four bands) and selected the first three PCs for further analysis. Different combinations of bands used in training data (Table 1) and UTC annotations are shown in Figure 3.

2.3.3. Manual UTC Annotation

We initially identified areas heavily affected by spongy moth infestations and then randomly generated 500 sample points within these locations. NAIP scenes were then tiled into 500 image patches (254 × 254-pixel patches) centering each random point. We manually digitized all the UTCs within each image patch. We annotated a total of 400 image patches that comprise 5133 UTCs. UTCs were identified using a combination of visual indicators and spectral analysis. Visually, UTCs typically exhibited greyish hues, indicating defoliation or canopy stress, though seasonal foliage changes, such as leaf senescence, sometimes resembled unhealthy crowns, leading to potential misclassification. To mitigate this, we incorporated spectral information from NAIP imagery to improve classification accuracy. Key visual criteria for UTC identification included significant canopy defoliation, where trees with notable foliage loss were classified as UTCs, as well as discoloration patterns, particularly greyish, brown, or pale tones inconsistent with normal seasonal changes. Additionally, sparse or degraded canopy structures suggesting thinning or dieback were strong indicators of declining tree health. To further differentiate between seasonal color changes and true UTCs, we used spectral indices and false-color band combinations. The NDVI helped us to assess vegetation health, as UTCs typically exhibited lower NDVI values than healthy trees, indicating reduced photosynthetic activity. False-color infrared composites incorporating near-infrared (NIR) bands further distinguished stressed vegetation, as unhealthy trees reflected less NIR light. We aimed to isolate unhealthy crowns while preserving their shape. However, dense vegetation can be tiring and confusing for human eyes, leading to inconsistencies in crown delineation. Hence, to minimize the bias, we kept the scale of 1:300 while annotating the crown. However, occluded and merged canopies posed difficulties in accurately delineating individual tree crowns, often resulting in multiple UTCs being annotated as a single crown, as shown in Figure 4. This challenge was particularly prominent in densely mixed forests, where overlapping branches and foliage obscured distinct crown boundaries. Even though we primarily relied on visual cues, such as differences in texture, color, and shading, to infer separations between crowns, there were some instances where clear boundaries were not discernible. In such scenarios, a conservative approach was taken, where the entire visibly stressed area was annotated as a single UTC rather than making speculative separations.

The study did not account for species-level information of UTCs, as the dense mixed forest setting made species differentiation unfeasible and beyond the study’s scope. Additionally, the complexity of the vegetation prevented an exact determination of the proportion of healthy trees. However, we systematically captured the extent of UTC presence in affected areas through a structured sampling and annotation process. The average crown area of annotated UTCs was approximately 54 m², with each image patch containing an average of 13 UTCs, ranging from 1 to 170 (Table 2). Figure 5 and Figure 6 illustrate the distribution of UTCs per image patch and their sizes, respectively.

2.4. Deep Learning Models

2.4.1. SegFormer Model

The SegFormer [52] model consists of a transformer encoder and an MLP decoder, as shown in Figure 7.

This architecture is notable for its lightweight All-MLP decoder, which sets it apart from other transformer-based segmentation models, such as [80]. With its encoder producing multi-level feature maps that are fused in the decoder, SegFormer can capture both high- and low-resolution information. Additionally, the model employs a “mix-FFN” operation in its encoder that replaces the original positional encoding found in other transformer architectures. Because SegFormer is less complex than other transformer-based models, it requires less data for training and can be applied to real-time applications. Finally, the encoder can be scaled up or down, from B0 to B5, by increasing the number of layers or dimensions of encoder blocks. For this study, we used the SegFormer-B0 architecture, which has 3.75 million parameters.

2.4.2. U-Net Model

We opted for the U-Net and DeepLab v3+ CNN architectures over the pixel-wise semantic segmentation task, as they have been widely used for computer vision with computational efficiency. Initially, U-Net was devised for biomedical image segmentation tasks [32], but with time, it has been widely adopted across diverse domains including remote sensing [81]. U-Net architecture structurally resembles a ‘U’-shape, consisting of an encoding and decoding path as shown in Figure 8.

The encoder captures the semantic information from the input image. It used convolutional and pooling layers to reduce the image size while extracting increasingly complex features. The symmetrical decoder path enables the accurate location of semantic information. It uses upsampling and concatenation to increase image size gradually. U-Net has skip connections to preserve spatial information during upsampling and downsampling processes. While the U-Net is powerful, it can face challenges like vanishing gradients, especially in the decoder. We used ResNet101 as the encoder backbone, a deep residual network with 101 layers that has about 44.5 million parameters, to capture contextual information from the input image by mitigating the vanishing gradient problem encountered in U-Net.

2.4.3. DeepLab Model

The architectural design of DeepLab v3+ (Figure 9) encompasses an encoder-decoder path, representing an enhanced iteration of the model that integrates the encoder component from the DeepLab v3 architecture with a simplified decoder module containing around 58 million parameters. This framework leverages ResNet-101 as its foundational structure and incorporates Atrous Convolution within deep layers to expand the receptive field effectively. Subsequent to ResNet-101, an Atrous Spatial Pyramid Pooling (ASPP) module consolidates multi-scale contextual information. The decoder unit harmonizes low-level features from ResNet-101 with upsampled deep-level multi-scale features derived from ASPP. Ultimately, it performs upsampling of the combined feature maps to produce the definitive semantic segmentation outcomes.

2.5. DL Model Training

Candidate models were run on PyTorch,2.1.1a Python-based machine learning framework that can be efficiently administered to analyze and implement DL algorithms in GPUs. We used an Intel(R) Core (TM) i9-10900K CPU @ 3.70 GHz processor with 128 GB RAM in a 64-bit operating system with NVIDIA^® Quadro RTX 4000 with 8192MiB memory under Cuda version 12.2. To ensure the fairness of the comparative experiments, we trained three candidate models under similar conditions. For the ViT-based algorithm, we used a batch size of 12, trained for 100 epochs with early stopping; the Adam optimizer was used, and the learning rate was set to 0.001 to improve the speed and effectiveness of the network. We used the weighted focal loss function, giving 15% weight to the background pixels and 85% weight to our target class to address the class imbalance problem.

2.6. Accuracy Assessment

We used standard segmentation quality metrics used in computer vision, such as F1-Score, Recall, and Precision, Intersection over Union (IoU), and overall accuracy, to assess the performance of candidate algorithms.

The metric precision serves as a pivotal evaluation parameter within classification tasks, especially when confronted with imbalanced class distributions [82]. It quantifies the accuracy of a model in correctly identifying instances belonging to a specific class relative to the total instances it identifies as positive. Conceptually, precision gauges the model’s capability to discern relevant instances among all instances it classifies as positive. It yields a value between 0 and 1, with a higher precision value indicating better performance. Mathematically, precision is calculated as the ratio of true positives to the sum of true positives and false positives, representing both correctly and incorrectly identified instances.

P r e c i s i o n = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s}

(4)

Recall, also known as sensitivity or true positive rate, is a fundamental metric in classification tasks, particularly when ensuring that all relevant instances of a class are correctly identified [82]. It quantifies the fraction of correctly labeled instances of each class relative to the total number of instances that truly belong to that class. Essentially, recall measures the model’s effectiveness in capturing all relevant instances of a class among all instances that belong to that class.

R e c a l l = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e N e g a t i v e s}

(5)

The F1-score metric is a powerful tool used to evaluate the performance of a classifier by combining precision and recall into a single value [82]. The F1-score is calculated as the harmonic mean of precision and recall. The harmonic mean is used because it punishes extreme values, making it sensitive to cases where either precision or recall is low. As a result, the F1-score tends to favor classifiers that have similar precision and recall values. Its value extends from 0 to 1, where a higher F1 score implies better overall performance in terms of both precision and recall. Mathematically,

F 1 s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(6)

Intersection over Union (IoU) is a metric widely used in semantic segmentation tasks to assess the accuracy of predicted object masks [83]. It quantifies the overlap between the predicted and ground truth masks, providing a comprehensive measure of the model’s ability to accurately delineate object boundaries.

I o U = \frac{I n t e r s e c t i o n A r e a}{U n i o n A r e a}

(7)

The overall accuracy measures the proportion of correctly classified samples to the total number of samples [84].

O v e r a l l a c c u r a c y = \frac{T r u e P o s i t i v e s + T r u e N e g a t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s + T r u e N e g a t i v e s + F a l s e N e g a t i v e s}

(8)

As we had a limited number of UTCs in our dataset, and our masks had more background regions than target class regions, we therefore calculated the recall, precision, F1-score, and overall accuracy per class to gain a meaningful quantitative assessment. We then calculated the average of metric scores across both classes to provide an overall assessment of the algorithm’s performance.

2.7. Manual Annotation Uncertainty Analysis

Data annotations are susceptible to bias and errors, and hence largely impact DL model performance. To address the variability in annotation and the uncertainty reflected in evaluation metrics, we engaged nine individuals to annotate 30 new image patches, each containing at least one tree crown. All annotators have sufficient academic and research background in remote sensing, hold at least a graduate-level education, and are proficient in geospatial software. They were instructed to follow the same protocol used during the generation of our initial training dataset. Furthermore, 49 UTC locations (in point form) were provided to guide them in identifying areas where annotation should be implemented. However, they were not limited to these locations; they could annotate single or multiple UTCs within the designated areas and were also free to annotate other locations outside of the guided areas based on their own judgment. We conducted a comparative analysis to understand the extent of inter-annotator variability and its potential influence on accuracy metrics. We selected 49 polygons covering our guided location points as a reference for a fair comparison. Spatial analysis was then performed to evaluate the geometric congruence of annotators, allowing us to examine differences in annotations despite the use of a standardized image, object, and protocol. First, we analyzed the annotator consensus on unhealthy pixels for the selected UTC polygons. To capture the maximum extent of UTC outline judgment across annotators, we calculated the union area of all manually outlined UTCs. To identify the minimum consensus area, we calculated the intersection of all annotated polygons.

We found that annotators produced one-to-one (Figure 10a) and one-to-many relationships in some instances (Figure 10b). Figure 10 illustrates the representative scenario of annotation we collected from annotators. For one-to-many relationship, we selected the polygon with the largest crown area for further analysis; for instance, in Figure 10b, Annotator 1 (A₁) has annotated the location as two separate UTCs, and we selected the polygon with a larger size (Figure 10b polygon 2A₁) for calculating the intersection polygon.

To evaluate the percentage of error improvement from varying annotations, we calculated improvement percentages metrics (Equation (9)). The metric quantifies the relative difference between the annotated and ground truth annotation, highlighting how much the annotated measurements differed from the ground truth as a percentage. Due to the lack of field-based measurements (e.g., GPS observations) of UTC extent, we used a median polygon as the baseline or ground truth against the intersection and the union to calculate the improvement percentage using the equation below:

I m p r o v e m e n t % = \frac{M_{x} - M_{m d}}{M_{m d}} \times 100

(9)

where M_md is the median metric (baseline) and M_x is either the intersection or union metric.

3. Results

3.1. Evaluation Metrics

We evaluated candidate DL model performances under different band combinations. Table 3 depicts the test metric based on average and per class F1-score, Recall, precision, IoU, and overall accuracy. Given that the background class dominates the dataset due to the low percentage of UTC pixels, the model metrics are heavily influenced by the background class. To address this, Figure 11 provides a more meaningful comparison of test accuracy metrics across different input band combinations and candidate models, focusing on the UTC class.

3.1.1. CNN Models

The DeepLab v3+ model consistently outperforms the U-Net model across all band combinations. The U-Net exhibited the best performance for the VI band combination with an F1-score of 0.49, while the lowest performance was yielded from the PCA band combination with an F1-score of 0.03. Notably, U-Net struggled with recall values below 0.50 across all combinations except for VI (0.56), indicating that it had difficulty identifying true positive UTC pixels. In contrast, DeepLab v3+ reported superior results, with F1-scores ranging from 0.38 (PCA) to 0.65 (RGB), marking an approximately more than 20% improvement over the U-Net model for similar band combinations. The RGB combination exhibited the highest F1-score (0.65), supported by a recall of 0.71 and a precision of 0.6, showcasing DeepLab v3+’s ability to balance precision and recall effectively. Despite this, the DeepLab v3+ model also showed weakness in the PCA dataset, though its F1-score of 0.38 still surpassed U-Net’s 0.04. These results highlight the consistent advantage of DeepLab v3+ over the U-Net in terms of inference accuracy, particularly with the RGB and CIR band combinations, making it the more reliable model for accurate urban tree canopy segmentation within CNN-based algorithms.

3.1.2. ViT-Based Model

Our evaluation demonstrates that the SegFormer architecture exhibits superior performance compared to conventional CNN architectures. As shown in Figure 11, SegFormer consistently achieves F1-Scores exceeding 0.60 across most band combinations, apart from the PCA combination (F1-Score of 0.38). This suggests the suitability of SegFormer for UTC mapping, particularly in dense forests with mixed vegetation environments. Notably, the model attains the highest F1-Score of 0.70 for the RGB band combination, indicating a well-balanced performance in terms of both accurate UTC pixel identification and minimization of false positives when using natural color images. It has a precision score of 0.62, highlighting the model’s ability to identify true positive pixels. Furthermore, the highest recall score of 0.80 suggests that it can detect most of the actual positive pixels.

In addition to F1-Score, SegFormer consistently achieved superior IoU values across all input bands, peaking at 0.54 on the RGB band combination. It maintains IoU values of 0.48 (CIR), 0.51 (VI), 0.24 (PCA), and 0.45 (RGBPC), showing stronger results even on lower-performing bands, such as PCA and RGBPC, indicating reliable segmentation performance. Furthermore, SegFormer’s average precision, recall, and F1-score are highest among its counterparts. This suggests that SegFormer not only identifies UTC pixels effectively but also minimizes misclassifications across the entire image scene. The high performance of the SegFormer architecture across multiple metrics demonstrates its robustness, generalizability, and clear advantage over traditional CNN architectures in UTC extraction tasks.

The evaluation of U-Net, DeepLabv3+, and SegFormer across five band combinations (RGB, CIR, VI, PCA, and RGBPC) revealed distinct trends. RGB emerged as the top-performing band for all models, with SegFormer achieving the highest IoU and F1-Score. CIR also performed well, particularly for DeepLabv3+ and SegFormer, though it still trailed RGB. SegFormer benefited the most from VI bands, while the CNN-based models, U-Net and DeepLabv3+, exhibited a decline in performance. PCA resulted in the lowest IoU across all models, especially for U-Net, which struggled significantly with this combination. However, RGBPC improved SegFormer’s performance, highlighting its ability to leverage additional features from the PCA data, whereas U-Net and DeepLabv3+ showed limited gains.

3.2. Visual Quality Assessment

We corroborated quantitative assessment with extensive visual inspections. Figure 12 shows a test image patch with a complex scene that has seasonal color (leaf senescence) on tree crowns and denser UTCs. The visualization column has input image patches, ground truth annotations, and model detection for the unseen test image patches. The visual inspection results showcase the capabilities of models to accurately detect UTCs in diverse environments. Most importantly, the differentiation between the senescent-colored tree crowns and UTCs was achieved in an RGB band combination by our optimal model, i.e., SegFormer (Figure 7). As seen in Figure 12, the U-Net model struggled to produce crisp boundaries around the detected objects. The outline is fuzzy and pixelated (see yellow arrows) in all the channel combinations. Among the channel combinations, the PCA and RGBPC combinations yielded the least accurate detection mask for the U-Net model, as supported by lower evaluation metrics in Figure 11. For DeepLabv3+, detections on the PCA channel showed under-segmentation, while SegFormer was over-segmented and failed to capture many UTC areas in this channel. DeepLabv3+ performed reliably on other channels but tended to have less precise UTC boundaries, suggesting it could detect the overall shapes but struggled with UTC outline accuracy. SegFormer, however, demonstrated relatively improved boundary definition (see blue circle in the SegFormer column), offering more refined edges and accurate segmentation in complex scenes.

Figure 13 illustrates the training and validation curves for different candidate architectures across all channel combinations. DeepLabv3+ shows that it performs well on each channel, though it has varying stability for different complexities of data in the combination of its input channels. For PCA, however, a noticeable divergence between the training and validation losses later in training indicates that DeepLabv3+ tends to overfit with these combinations of channels. Similarly, U-Net has higher initial losses, and greater fluctuations and divergence of validation loss from training loss, suggesting a struggle with generalization, perhaps due to overfitting or instability when such complex or noisy data are provided. However, CIR channels cause faster and smoother convergence in the case of U-Net, meaning that such combinations are more suitable for this model. SegFormer has the most smooth and stable loss curves across all channels, showing its strong learning efficiency and robust generalization. Its validation loss is closer to the training loss, indicating a minimum overfitting problem. This suggests that SegFormer is highly adaptable to various spectral combinations, and it is likely the most robust model regarding inputs of different channels.

We performed a visual comparison for one of the image patches to understand the inference capabilities and the detected mask area discrepancy of our models. In Figure 14, in each image patch, the yellow outline represents the ground truth boundary of the object, while the red outline shows the model-inferred boundary for each channel combination. DeepLabv3+ showed better alignment with the ground truth in certain channel combinations, such as RGB and RGBPC; however, it completely missed some of the smaller UTCs. The U-Net model displayed a “breadcrumb” effect in RGB and had no detection for PCA and RGBPC channels. SegFormer shows relatively better alignment with the ground truth in most cases, with smoother boundaries that closely follow the yellow outlines, particularly in RGB and RGBPC channels. However, in some patches, SegFormer’s detections also slightly miss the exact boundary, either under-detection or failing to capture finer details.

Furthermore, we examined the one-to-one relationship between the masks that overlap to understand the nature of the IoU score of our detection results. Firstly, we vectorized the detections and selected only those detected polygons that intersect ground truth data (n = 298). Then, we evaluated the distribution of detected and ground truth polygons by area (See Figure 15a). It was seen that detected polygons had greater area compared to ground truth polygons, with mean areas of 66.83 and 50.35 square meters, respectively. There were a few instances, as well, where the detected polygons were smaller than the ground truths (see Figure 15c). The overall mean IoU for the detected polygons was 0.52 (Figure 15b). Since these results suggested that the UTC boundary shape and size is one of the probable causes impacting mean IoU, we further investigated if the root cause is the uncertainties spawning from hand annotations. Where to draw the edge of the UTC during manual annotation is largely a judgment call. Thus, it is necessary to account for the uncertainty from annotation and examine how model inference and the evaluation metrics are impacted.

3.3. Annotation Uncertainty Analysis

Even though we had a relatively high F1-score, the IoU score seemed low. Our uncertainty analysis revealed substantial variability in annotations in both object shape and size (see Figure 16). Across 30 image patches, the number of annotated polygons ranged from 50 to 141. Interestingly, the annotator with the highest polygon count had the second lowest total annotated pixel count in the dataset. Manual annotations of a single UTC area showed considerable variation; for instance, Annotators #1 and #2 reported similar total annotated areas and pixel counts yet differed markedly in the number of individual annotations. Annotator #2 demonstrated a meticulous approach to delineating boundaries, aiming to separate individual tree crowns, and thus identified numerous small polygons as dead tree crowns, resulting in a higher number of segments than those of Annotator #3.

Similarly, we created an intersection, median, and union polygon for each of the guided locations for all the annotators (Figure 17). The shade of the blue color shows the areas of agreement among annotators, with darker shades representing a higher level of consensus for those areas. The visual representation provides us with a clear understanding of variation across annotators. A key observation is the high level of spatial uncertainty in crown boundary delineation. While most annotators showed considerable overlap, there were notable discrepancies for complex or ambiguous features. For example, in ID 9 (Figure 17), one annotator interpreted the UTC as a large tree, resulting in a much larger annotated area compared to the others.

Furthermore, to understand the variability in IoU score, we generated an annotator-to-annotator matrix of IoU (See Figure 18). The IoU scores among the annotators varied from 0.37 to 0.70, which suggests that, despite common annotation protocols provided in the experiment, interpretations of image features can differ dramatically among annotators. The annotator-to-annotator IoU matrix underscores the challenges posed by manual annotation, especially for complex tasks like delineating tree crowns in which a given crown boundary does not provide a crisp edge to follow. The lack of strong visual cues to follow along the crown edge misleads the annotator. Such instances represent fiat object boundaries as opposed to bona fide boundaries [85], demanding the annotator to employ best approximations based on limited visual cues. The variability observed indicates that, although annotation protocols can offer helpful guidance, subjective interpretation is inevitable, which can ultimately influence model performance.

Additionally, to understand the range of differences in evaluation metrics, we detected UTC segmentation on the dataset provided to the annotators using our SegFormer-RGB model. The detected masks were evaluated in relation to three reference polygon annotations: (1) intersection (maximum consensus area of all annotators), (2) union (maximum extent of all annotators), and (3) median (where most annotators reached a consensus regarding the presence of UTC pixels). The results indicate that the intersection of annotations achieved the highest recall (0.95); however, they exhibited low precision (0.23), F1-score (0.37), and IoU (0.23). In contrast, the union of annotations demonstrated high precision (0.73) alongside a mediocre recall (0.57), F1-score (0.64), and IoU (0.46). Notably, the median polygon annotations yielded the most favorable evaluation metrics, achieving high recall (0.90), F1-score (0.70), and moderate precision (0.57), with the highest IoU (0.54) among the three reference annotation types (see Table 4).

Lastly, to examine the range of improvements in model accuracy while using different annotation strategies, we calculated the relative improvements on metrics IoU, and F1-score based on intersection and union polygons against median ones using Equation (9). The results show that using an intersection polygon reduced the IoU score by 57.41% and the F1-score by 47.14%. In contrast, using a union polygon decreased the IoU and F1-score by 14.81% and 8.57%, respectively.

4. Discussion

Our study compared the performance of CNN-based and ViT-based algorithms in detecting UTCs, exploring various multiband channel combinations including textural characteristics and vegetation indices. While Vision Transformer-based models (such as SegFormer) have been rarely evaluated for UTC detection compared to traditional machine learning techniques (e.g., Random Forest, clustering techniques) and CNN-based approaches, our systematic assessment highlights their superiority in capturing canopy-scale spatial patterns. By assessing these models, we evaluated the effectiveness of publicly available aerial imagery and transformer models for detecting UTCs in large, complex environments to aid forest health monitoring. Additionally, we quantified inter-annotator variability in annotation processes, highlighting the critical need for standardized protocols to minimize label noise, a finding consistent with [86], which identified label inconsistency as a key limitation in model reliability.

Our findings revealed significant differences in how each model interprets the boundaries and spatial distributions of UTCs. It contributed to advancing research in deep-learning-based UTC detection, highlighting the potential of transformer models over CNNs in capturing complex spatial dependencies within large-scale forested environments. Additionally, annotation uncertainty analysis emphasized the need for standardized guidelines and quality control to enhance labeling consistency.

4.1. Performance of DL Models in UTC Segmentation

Our CNN models (U-Net and DeepLab v3+) struggled to segment UTCs accurately. The F1-scores ranged from 0.06 to 0.41 for U-Net and 0.23 to 0.67 for DeepLab v3+, depending on the input image channel used. This is lower than the results reported in [87], where the authors achieved an F1-score of 0.87 using Mask-RCNN for dead tree instance segmentation. There are a few possible reasons for this discrepancy. They had higher-resolution low-altitude drone images (20 cm) compared to our 60 cm aerial images. Our dataset contained senescent-colored foliage images, which might have introduced more noise compared to their dataset. While [87] focused on instance segmentation (identifying individual objects), we aimed for semantic segmentation (classifying each pixel as a UTC or not). This difference in task complexity could also play a role. However, Tao et al. [88] reported similar results when using two CNN-based models (overall accuracy of 37% for AlexNet and 55% for GoogLeNet CNN models) for a region that has mixed artificial forest consisting of dead pine trees (DPTs), non-DPTs, and red broadleaf trees in RGB images. Tao et al. [88] claimed that confusion from color and texture features between dead pine trees and red broadleaf trees resulted in low detection accuracy. A limitation of CNN-based models is the loss of resolution when processing large images. This occurs because they downsample features to capture context, sacrificing detail around object edges. Captured from a top–down perspective, aerial images lack crucial information about tree shape and vertical structure. Furthermore, the mix of dead and dying trees, shade, dense vegetation, and senescent-colored trees adds extra semantic complexity to the segmentation task.

In Figure 12, the U-Net model exhibited a ‘salt and pepper’ effect (highlighted by yellow arrows) across all channel combinations. This might be due to channel combinations providing less distinct spectral information for differentiating target objects, leading the U-Net model to rely more heavily on minor variations and less relevant features, ultimately resulting in less coherent detections. Unlike transformer-based models such as SegFormer, which can capture global dependencies within the scene, U-Net is more prone to overfitting localized features, especially in diverse environments with significant color variation (such as leaf senescence) and dense UTCs. DeepLabv3+ performed more reliably than U-Net across the channels but tended to have less precise UTC boundaries, suggesting it could detect the overall shapes but struggled with boundary extent accuracy. SegFormer, however, demonstrated relatively improved boundary definition (see Figure 12, blue circle in the SegFormer column), offering more refined edges and accurate segmentation in complex scenes. While all models achieved high overall accuracy (OA), the lower precision and recall for UTCs may be attributed to class imbalance, the visual complexity of UTCs, and boundary variability in annotations. Despite using a weighted loss function, the dominance of background (BG) pixels led to models prioritizing BG classes. Additionally, variability in canopy stress responses and occluding tree crowns may have introduced segmentation challenges, leading to over- or under-segmentation. Furthermore, differences in manual annotation strictness versus model predictions may have caused boundary extensions, increased false positives, and reduced precision.

While ViT models have not been extensively studied in forest health monitoring, existing research comparing the classification accuracies of CNNs and ViT on the ImageNet dataset suggests their potential for high performance [46]. SegFormer, the ViT-based model in our study, outperformed other CNNs, similar to the studies conducted in forestry applications [89]. He et al. [89] used HIS-BERT, a Transformer-based model, to classify hyperspectral images and found that it outperformed all other state-of-the-art CNNs. The performance of SegFormer compared to its counterpart CNNs used in our study could be attributed to the Transformer’s ability to capture long-range relationships between pixels [90]. In UTC segmentation, such a relationship can be crucial for tasks, such as analyzing color variations across the entire crown, understanding the relationship between a tree and its surroundings, and contrasting UTCs to surrounding healthy canopies. While the specific reasons for this advantage require further investigation, SegFormer’s Transformer-based architecture might be particularly well-suited for analyzing high-resolution aerial imagery with complex backgrounds, potentially leading to more accurate identification of unhealthy trees. By efficiently utilizing available data and capturing relevant features from the input images, SegFormer can generalize better to unseen data and achieve superior performance in challenging detection tasks involving complex environmental features, such as mixed tree types and varying health conditions. The attention mechanisms in Transformer models allow them to selectively focus on relevant regions of the input image while processing and preserving fine details, such as object edges and texture information, which were crucial for accurate segmentation tasks in our study. By attending to informative regions of the image, SegFormer can effectively differentiate between different tree shapes, identify subtle cues indicative of dead or dying trees, and differentiate them from senescent-colored tree crowns in images. Shahid et al. [90] came to a similar conclusion when it was used to segment forest fires with high-resolution aerial images.

The distinct patterns observed in the training and validation loss curves of DeepLabV3+, U-Net, and SegFormer illustrated the substantial influence of model architecture and input features on learning dynamics and generalization (Figure 13). For DeepLabV3+ and U-Net, fluctuating validation loss curves indicate overfitting, particularly as feature set dimensionality increases, which introduces high variance. The oscillatory nature of validation loss in the PCA channel combination indicated poor generalization capability, probably due to the limited data or noisy features. In contrast, simpler feature sets, such as CIR or RGB, resulted in more stable validation curves, suggesting a better alignment between model capacity and data variance.

SegFormer incorporates a transformer architecture with lower inductive bias compared to CNNs [91]. It showed smoother training and validation curves across all feature sets, confirming that it captured global dependencies more effectively. It demonstrated reduced variance compared to CNN-based approaches and maintained generalization without overfitting. Overall, these findings supported the notion that Transformer-based models are better for complex spatial patterns and less susceptible to overfitting than CNNs.

4.2. Performance of Band Combination in UTC Segmentation

We evaluated five different combinations of spectral data (band combinations) to train our model for identifying dead trees (UTCs) in aerial images. Surprisingly, for all candidate models, none of these combinations, including those using specialized VIs, performed better than using red, green, and blue channels (RGB). This aligns with other research studies, such as [92], which found that using only RGB data achieved the best results for detecting tree damage in aerial imagery. Similarly, [93] also supported our results, reporting that pre-trained models fine-tuned with RGB + NIR data underperformed compared to those using only RGB data. This performance difference is likely due to pre-trained architectures being typically trained on RGB data exclusively. Due to limited training data, we used a transfer learning strategy while keeping the backbone frozen, and only the segmentation head was trained. The feature extraction backbone was trained on standard RGB channels. Introducing spectral indices might not effectively utilize the full strength of the feature extraction backbone due to differences in data distributions. Training the encoder with large set of VIs and texture indices might be one way to capitalize on RGB channels. Also, the combination of red, green, and blue channels of natural color images offered an extensive feature space for the model. This enabled the model to learn intricate connections between the pixel values and characteristics of UTCs, thereby enhancing its ability to differentiate between UTC and healthy vegetation, as well as other background elements present within the scene.

While textural information from RGB images and vegetation indices (VIs) were explored for identifying unhealthy trees, neither approach yielded optimal results. False-color composites, which typically highlight unhealthy vegetation in gray, were ineffective likely due to misinterpreting noise from dead branches under healthy foliage. VIs, designed to isolate vegetation health signals mathematically, might be overly sensitive to seasonal variations in leaf color, leading to misclassifications. Textural information, particularly the smoother textures of leafless areas on dead trees, could be beneficial. However, our attempt to emphasize texture through principal component analysis (PCA) of Haralick textures proved ineffective, potentially due to information loss during dimensionality reduction. As suggested by [94], deep learning models might be better suited to directly extract information from RGB imagery, potentially eliminating the need for pre-processing steps, such as VI calculation and PCA. However, our pre-trained model had encoders trained on the RGB channel. If we trained the encoder on VI, PCA, and texture channels, different results might be expected.

4.3. Uncertainty and Limitations

Our study faced several challenges due to the complexity of the forest environment and data characteristics. The dense canopy covered with significant crown occlusion made it difficult to distinguish individual trees, particularly between healthy and unhealthy ones with similar spectral signatures. Acquiring data during early fall introduced additional noise, as the spectral reflectance of unhealthy trees might be similar to that of senescent-colored foliage, leading to segmentation errors. Manual annotation in such dense and visually complex scenes was time-consuming and susceptible to human bias, potentially impacting the accuracy of training data. Subtle variations in how different analysts interpreted and delineated UTC boundaries introduced inconsistencies in the training data and the model struggled to generalize to unseen data. This led to reduced accuracy on data annotated by different analysts or in different forest conditions. The chosen annotation scale (1:300), while ensuring consistent capture of sharp edges, might not capture finer details crucial for precise segmentation in occluded areas. Furthermore, a 60 cm spectral resolution might be insufficient to resolve details within the dense canopy, potentially leading to spectral mixing and reduced accuracy in differentiating UTCs, and a higher resolution might be needed, as emphasized in [95]. Additionally, even though the models, such as SegFormer-RGB, were successful in covering UTC locations, they exhibited uncertainty in crown boundaries. The bluntness or overestimation of boundary extent was seen in the segmented image (Figure 12). Our best-performing SegFormer-RGB model suggested that smaller ground truth UTCs tended to have more accurate and comparable detection, while larger UTCs were more prone to overestimation (see Figure 15). Although the dataset contained a smaller number of large UTCs, the over-prediction of larger UTC areas contributed to low IoU scores (Figure 15). Such a performance discrepancy might be attributed to the fact that our annotation process primarily focused on delineating individual crowns, with less emphasis on accommodating noise (probably unhealthy branches inside the shadows) between annotated adjacent crowns. Consequently, this led to instances of under-segmentation during the detection phase.

Our extended investigation of uncertainties from annotations resulted in a similar conclusion as that of earlier studies [96,97]. The annotators had limited consensus on judging the UTC pixels (Figure 18). However, it was clear that they were mostly biased in the edge areas of the UTC (Figure 17b,c). This might be due to differences in domain expertise and the level of precision needed in their training data, as discussed in [71]. Additionally, the bias introduced in regions with occluded and merged UTCs may have influenced the results, as some annotators adopted a more conservative approach, while others were less restrictive, leading to discrepancies in the total number of UTC annotations. Such variability in annotation style likely contributed to differences in observed UTC sizes and counts (Figure 16). Furthermore, the limitations of NAIP imagery in resolving fine-scale tree structures became evident in densely forested areas, where overlapping canopies obscured individual tree crowns. The inability to clearly distinguish between adjacent trees in such regions introduces a source of uncertainty in the dataset. The performance of the SegFormer-RGB model in detecting UTCs highlighted the critical role of the annotation strategy in evaluating model accuracy. Our analysis compared three annotation methods—intersection, union, and median—as ground truth references, revealing significant variability in performance metrics. The intersection results showed significant variability in strict consensus areas, leading to lower precision as the model captured UTC regions only partially agreed upon by annotators. Conversely, while the union approach improved precision by considering all marked areas, it compromised recall, missing some UTCs due to less consensus. The median results emphasized that prioritizing high-consensus regions yielded the most reliable model performance, aligning detection with broader annotator agreement. A decrease in the IoU and F1-score for both intersection and union (Table 4) revealed that annotation uncertainty could impact model evaluation and underscores the importance of reaching consensus or using confidence-weighted annotations for future assessments.

Several strategies could be employed to mitigate annotation uncertainties and limitations. Despite providing a well-written protocol, variability in annotations was seen; incorporating visual examples of annotations could help improve consistency and reduce this variability [71]. Considering the limitations of our study, particularly the potential for spectral mixing due to the chosen resolution, incorporating hyperspectral information in future research could offer a more comprehensive analysis of spectral variations across different tree species. Also, incorporating a lidar-derived canopy height model could reduce uncertainty related to the boundary extent during the annotation process, as well as guide the model to learn and infer the tree canopy edges. Data augmentation techniques can be used to address seasonal variations and introduce controlled variability into the training data. Similarly, adopting robust, consensus-based annotation strategies to improve the reliability of performance metrics in machine learning applications might be helpful. Future research should focus on refining the annotation process, potentially through enhanced training or semi-automated tools, to further enhance consistency and applicability in ecological contexts.

4.4. Future Outlook

While this study provides valuable insights by comparing two CNN-based models and a ViT model, it is not comprehensive. Our objective was to investigate the potential of ViT in the realm of forest monitoring applications; hence, we chose a simpler ViT against well-established CNNs. However, there are numerous other ViT models available today designed for segmentation tasks [98] that can be applied in remote sensing. Future work should expand the comparative analysis to include a wider range of model architectures, as well as tasks such as instance segmentation, to rigorously validate and strengthen the current findings. Furthermore, we observed that although evaluation metrics may suggest otherwise, it is not always the model’s learning process that may fall short of expectations; it could also be our methodology in constructing the training data that adversely affects our metrics. Hence, beyond the annotation uncertainty analysis on evaluation metrics, future work should investigate the impact of annotation discrepancies on model training and learning behavior. By examining how models respond to noisy or ambiguous annotations, we can develop strategies to enhance their robustness and accuracy.

5. Conclusions

Our study presents a comparative analysis between CNN-based and Vision Transformer-based architectures for tree health monitoring in complicated forest scenes using publicly available aerial images. For this purpose, two established CNN architectures, U-Net and DeepLabv3+, and one Vision Transformer architecture, SegFormer, were compared with respect to the five different band combinations. Our results revealed that, among the CNN models, DeepLabv3+ always outperformed U-Net for UTC segmentation in all band combinations considered, and natural color bands (RGB) had the best performance. However, SegFormer outperformed both CNN-based models in different evaluation metrics like F1-score, recall, and precision by over 20% compared to DeepLabv3+, and by over 40% compared to U-Net.

The study findings underpin the success of Vision Transformers, particularly SegFormer, as suitable models for forest health monitoring and show how transfer learning helps boost model performance even for smaller ground truths. Notably, the RGB channel combination alone was highly effective for detecting UTCs. Also, the analysis of annotation uncertainty highlights the variability in the labeling process, emphasizing the need for improved consistency and validation. The inter-annotator IoU analysis indicated notable differences in agreement, which reflected variations in subjective decisions in boundary delineation. The annotation discrepancies contribute to segmentation challenges in deep learning models by introducing uncertainties in training data. It emphasizes the importance of standardized annotation protocols, especially in boundary selection, to improve model generalization and reliability. By addressing these uncertainties, the segmentation and classification model performance can be enhanced. Overall, SegFormer resulted in high accuracy for the mapping of UTCs within dense and complex forested areas. However, further investigations are needed to capture unhealthy individual crowns that are partly obscured by healthy trees, which is important to study forest health. Future research should be focused on (1) the verification of the performance of SegFormer on different severities of crowns decline; improving the methods of detection of single crowns, including those hidden by canopy layers, (2) quality control measures to improve consistency and reduce variability in annotation.

Author Contributions

Conceptualization, D.J. and C.W.; methodology, D.J.; software, D.J.; validation, D.J. and C.W.; formal analysis, D.J.; investigation, D.J.; resources, D.J.; data curation, D.J.; writing—original draft preparation, D.J.; writing—review and editing, C.W.; visualization, D.J.; supervision, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The NAIP image is freely available through the USGS Earth Explorer archive [https://earthexplorer.usgs.gov/, checked: 6 January 2025]. The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We would like to thank the Eversource Energy Center at the University of Connecticut, USA for supporting this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gauthier, S.; Bernier, P.; Kuuluvainen, T.; Shvidenko, A.Z.; Schepaschenko, D.G. Boreal Forest Health and Global Change. Science 2015, 349, 819–822. [Google Scholar] [CrossRef] [PubMed]
Burgess, T.I.; Oliva, J.; Sapsford, S.J.; Sakalidis, M.L.; Balocchi, F.; Paap, T. Anthropogenic Disturbances and the Emergence of Native Diseases: A Threat to Forest Health. Curr. For. Rep. 2022, 8, 111–123. [Google Scholar] [CrossRef]
Trumbore, S.; Brando, P.; Hartmann, H. Forest Health and Global Change. Science 2015, 349, 814–818. [Google Scholar] [CrossRef]
Dale, V.H.; Joyce, L.A.; McNulty, S.; Neilson, R.P.; Ayres, M.P.; Flannigan, M.D.; Hanson, P.J.; Irland, L.C.; Lugo, A.E.; Peterson, C.J. Climate Change and Forest Disturbances: Climate Change can Affect Forests by Altering the Frequency, Intensity, Duration, and Timing of Fire, Drought, Introduced Species, Insect and Pathogen Outbreaks, Hurricanes, Windstorms, Ice Storms, Or Landslides. Bioscience 2001, 51, 723–734. [Google Scholar] [CrossRef]
Sapkota, I.P.; Tigabu, M.; Odén, P.C. Changes in Tree Species Diversity and Dominance Across a Disturbance Gradient in Nepalese Sal (Shorea robusta Gaertn. F.) Forests. J. For. Res. 2010, 21, 25–32. [Google Scholar] [CrossRef]
Vilà-Cabrera, A.; Martínez-Vilalta, J.; Vayreda, J.; Retana, J. Structural and Climatic Determinants of Demographic Rates of Scots Pine Forests Across the Iberian Peninsula. Ecol. Appl. 2011, 21, 1162–1172. [Google Scholar] [CrossRef]
Song, W.; Deng, X. Land-use/Land-Cover Change and Ecosystem Service Provision in China. Sci. Total Environ. 2017, 576, 705–719. [Google Scholar] [CrossRef]
Jahn, W.; Urban, J.L.; Rein, G. Powerlines and Wildfires: Overview, Perspectives, and Climate Change: Could there be More Electricity Blackouts in the Future? IEEE Power Energy Mag. 2022, 20, 16–27. [Google Scholar] [CrossRef]
USDA Forest Service. Forests of Connecticut, 2021; Resource Update FS-370; U.S. Department of Agriculture, Forest Service, Northern Research Station: Madison, WI, USA, 2022.
Han, Z.; Hu, W.; Peng, S.; Lin, H.; Zhang, J.; Zhou, J.; Wang, P.; Dian, Y. Detection of Standing Dead Trees After Pine Wilt Disease Outbreak with Airborne Remote Sensing Imagery by Multi-Scale Spatial Attention Deep Learning and Gaussian Kernel Approach. Remote Sens. 2022, 14, 3075. [Google Scholar] [CrossRef]
Krzystek, P.; Serebryanyk, A.; Schnörr, C.; Červenka, J.; Heurich, M. Large-Scale Mapping of Tree Species and Dead Trees in Šumava National Park and Bavarian Forest National Park using Lidar and Multispectral Imagery. Remote Sens. 2020, 12, 661. [Google Scholar] [CrossRef]
Kamińska, A.; Lisiewicz, M.; Stereńczak, K.; Kraszewski, B.; Sadkowski, R. Species-Related Single Dead Tree Detection using Multi-Temporal ALS Data and CIR Imagery. Remote Sens. Environ. 2018, 219, 31–43. [Google Scholar] [CrossRef]
Yuan, X.; Shi, J.; Gu, L. A Review of Deep Learning Methods for Semantic Segmentation of Remote Sensing Imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Xue, X.; Jiang, Y.; Shen, Q. Deep Learning for Remote Sensing Image Classification: A Survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1264. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Chen, K.; Fu, K.; Yan, M.; Gao, X.; Sun, X.; Wei, X. Semantic Segmentation of Aerial Images with Shuffling Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 173–177. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very High Resolution Urban Remote Sensing with Multimodal Deep Networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Weinstein, B.G.; Marconi, S.; Bohlman, S.; Zare, A.; White, E. Individual Tree-Crown Detection in RGB Imagery using Semi-Supervised Deep Learning Neural Networks. Remote Sens. 2019, 11, 1309. [Google Scholar] [CrossRef]
Neupane, B.; Horanont, T.; Hung, N.D. Deep Learning Based Banana Plant Detection and Counting using High-Resolution Red-Green-Blue (RGB) Images Collected from Unmanned Aerial Vehicle (UAV). PLoS ONE 2019, 14, e0223906. [Google Scholar] [CrossRef]
Li, W.; Fu, H.; Yu, L.; Cracknell, A. Deep Learning Based Oil Palm Tree Detection and Counting for High-Resolution Remote Sensing Images. Remote Sens. 2016, 9, 22. [Google Scholar] [CrossRef]
Lee, M.; Cho, H.; Youm, S.; Kim, S. Detection of Pine Wilt Disease using Time Series UAV Imagery and Deep Learning Semantic Segmentation. Forests 2023, 14, 1576. [Google Scholar] [CrossRef]
Ulku, I.; Akagündüz, E.; Ghamisi, P. Deep Semantic Segmentation of Trees using Multispectral Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7589–7604. [Google Scholar] [CrossRef]
Zhang, J.; Cong, S.; Zhang, G.; Ma, Y.; Zhang, Y.; Huang, J. Detecting Pest-Infested Forest Damage through Multispectral Satellite Imagery and Improved UNet. Sensors 2022, 22, 7440. [Google Scholar] [CrossRef] [PubMed]
Jiang, S.; Yao, W.; Heurich, M. Dead Wood Detection Based on Semantic Segmentation of VHR Aerial CIR Imagery using Optimized FCN-Densenet. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, 42, 127–133. [Google Scholar] [CrossRef]
Liu, D.; Jiang, Y.; Wang, R.; Lu, Y. Establishing a Citywide Street Tree Inventory with Street View Images and Computer Vision Techniques. Comput. Environ. Urban Syst. 2023, 100, 101924. [Google Scholar] [CrossRef]
Schiefer, F.; Kattenborn, T.; Frick, A.; Frey, J.; Schall, P.; Koch, B.; Schmidtlein, S. Mapping Forest Tree Species in High Resolution UAV-Based RGB-Imagery by Means of Convolutional Neural Networks. ISPRS J. Photogramm. Remote Sens. 2020, 170, 205–215. [Google Scholar] [CrossRef]
Joshi, D.; Witharana, C. Roadside Forest Modeling using Dashcam Videos and Convolutional Neural Nets. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 46, 135–140. [Google Scholar] [CrossRef]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in Vegetation Remote Sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-Cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. pp. 234–241. [Google Scholar]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lobo Torres, D.; Queiroz Feitosa, R.; Nigri Happ, P.; Elena Cué La Rosa, L.; Marcato Junior, J.; Martins, J.; Ola Bressan, P.; Gonçalves, W.N.; Liesenberg, V. Applying Fully Convolutional Architectures for Semantic Segmentation of a Single Tree Species in Urban Environment on High Resolution UAV Optical Imagery. Sensors 2020, 20, 563. [Google Scholar] [CrossRef]
Diez, Y.; Kentsch, S.; Fukuda, M.; Caceres, M.L.L.; Moritake, K.; Cabezas, M. Deep Learning in Forestry using Uav-Acquired Rgb Data: A Practical Review. Remote Sens. 2021, 13, 2837. [Google Scholar] [CrossRef]
Brandt, M.; Tucker, C.J.; Kariryaa, A.; Rasmussen, K.; Abel, C.; Small, J.; Chave, J.; Rasmussen, L.V.; Hiernaux, P.; Diouf, A.A. An Unexpectedly Large Count of Trees in the West African Sahara and Sahel. Nature 2020, 587, 78–82. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Kang, Z.; Cheng, S.; Yang, Z.; Akwensi, P.H. An Individual Tree Segmentation Method Based on Watershed Algorithm and Three-Dimensional Spatial Distribution Analysis from Airborne LiDAR Point Clouds. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1055–1067. [Google Scholar] [CrossRef]
Dersch, S.; Heurich, M.; Krueger, N.; Krzystek, P. Combining Graph-Cut Clustering with Object-Based Stem Detection for Tree Segmentation in Highly Dense Airborne Lidar Point Clouds. ISPRS J. Photogramm. Remote Sens. 2021, 172, 207–222. [Google Scholar] [CrossRef]
Li, Q.; Yuan, P.; Liu, X.; Zhou, H. Street Tree Segmentation from Mobile Laser Scanning Data. Int. J. Remote Sens. 2020, 41, 7145–7162. [Google Scholar] [CrossRef]
Solberg, S.; Naesset, E.; Bollandsas, O.M. Single Tree Segmentation using Airborne Laser Scanner Data in a Structurally Heterogeneous Spruce Forest. Photogramm. Eng. Remote Sens. 2006, 72, 1369–1378. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Mekhalfi, M.L.; Nicolò, C.; Bazi, Y.; Al Rahhal, M.M.; Alsharif, N.A.; Al Maghayreh, E. Contrasting YOLOv5, Transformer, and EfficientDet Detectors for Crop Circle Detection in Desert. IEEE Geosci. Remote Sens. Lett. 2021, 19, 3003205. [Google Scholar] [CrossRef]
Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision Transformers for Remote Sensing Image Classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5900318. [Google Scholar] [CrossRef]
Kaselimi, M.; Voulodimos, A.; Daskalopoulos, I.; Doulamis, N.; Doulamis, A. A Vision Transformer Model for Convolution-Free Multilabel Classification of Satellite Imagery in Deforestation Monitoring. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 3299–3307. [Google Scholar] [CrossRef] [PubMed]
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Li, Z.; Chen, G.; Zhang, T. A CNN-Transformer Hybrid Approach for Crop Classification using Multitemporal Multisensor Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 847–858. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 30 December 2024).
Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Thisanke, H.; Deshan, C.; Chamith, K.; Seneviratne, S.; Vidanaarachchi, R.; Herath, D. Semantic Segmentation using Vision Transformers: A Survey. Eng. Appl. Artif. Intell. 2023, 126, 106669. [Google Scholar] [CrossRef]
Gibril, M.B.A.; Shafri, H.Z.M.; Shanableh, A.; Al-Ruzouq, R.; bin Hashim, S.J.; Wayayok, A.; Sachit, M.S. Large-Scale Assessment of Date Palm Plantations Based on UAV Remote Sensing and Multiscale Vision Transformer. Remote Sens. Appl. Soc. Environ. 2024, 34, 101195. [Google Scholar] [CrossRef]
Jiang, J.; Xiang, J.; Yan, E.; Song, Y.; Mo, D. Forest-CD: Forest Change Detection Network Based on VHR Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 2506005. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A.; Jmal, M.; Souidene Mseddi, W.; Attia, R. Wildfire Segmentation using Deep Vision Transformers. Remote Sens. 2021, 13, 3527. [Google Scholar] [CrossRef]
Bannari, A.; Morin, D.; Bonn, F.; Huete, A. A Review of Vegetation Indices. Remote Sens. Rev. 1995, 13, 95–120. [Google Scholar] [CrossRef]
Huete, A.R. Vegetation Indices, Remote Sensing and Forest Monitoring. Geogr. Compass 2012, 6, 513–532. [Google Scholar] [CrossRef]
Peddle, D.R.; Brunke, S.P.; Hall, F.G. A Comparison of Spectral Mixture Analysis and Ten Vegetation Indices for Estimating Boreal Forest Biophysical Information from Airborne Data. Can. J. Remote Sens. 2001, 27, 627–635. [Google Scholar] [CrossRef]
Franklin, S.E.; Waring, R.H.; McCreight, R.W.; Cohen, W.B.; Fiorella, M. Aerial and Satellite Sensor Detection and Classification of Western Spruce Budworm Defoliation in a Subalpine Forest. Can. J. Remote Sens. 1995, 21, 299–308. [Google Scholar] [CrossRef]
Coburn, C.A.; Roberts, A.C. A Multiscale Texture Analysis Procedure for Improved Forest Stand Classification. Int. J. Remote Sens. 2004, 25, 4287–4308. [Google Scholar] [CrossRef]
Franklin, S.E.; Hall, R.J.; Moskal, L.M.; Maudie, A.J.; Lavigne, M.B. Incorporating Texture into Classification of Forest Species Composition from Airborne Multispectral Images. Int. J. Remote Sens. 2000, 21, 61–79. [Google Scholar] [CrossRef]
Moskal, L.M.; Franklin, S.E. Relationship between Airborne Multispectral Image Texture and Aspen Defoliation. Int. J. Remote Sens. 2004, 25, 2701–2711. [Google Scholar] [CrossRef]
Karpatne, A.; Atluri, G.; Faghmous, J.H.; Steinbach, M.; Banerjee, A.; Ganguly, A.; Shekhar, S.; Samatova, N.; Kumar, V. Theory-Guided Data Science: A New Paradigm for Scientific Discovery from Data. IEEE Trans. Knowled. Data Eng. 2017, 29, 2318–2331. [Google Scholar] [CrossRef]
Nowak, S.; Rüger, S. How Reliable are Annotations Via Crowdsourcing: A Study about Inter-Annotator Agreement for Multi-Label Image Annotation. In Proceedings of the International Conference on Multimedia Information Retrieval, Philadelphia, PA, USA, 29–31 March 2010; pp. 557–566. [Google Scholar]
Vădineanu, Ş.; Pelt, D.M.; Dzyubachyk, O.; Batenburg, K.J. An Analysis of the Impact of Annotation Errors on the Accuracy of Deep Learning for Cell Segmentation. In Proceedings of the International Conference on Medical Imaging with Deep Learning, Zurich, Switzerland, 6–8 July 2022; pp. 1251–1267. [Google Scholar]
Rädsch, T.; Reinke, A.; Weru, V.; Tizabi, M.D.; Schreck, N.; Kavur, A.E.; Pekdemir, B.; Roß, T.; Kopp-Schneider, A.; Maier-Hein, L. Labelling Instructions Matter in Biomedical Image Analysis. Nat. Mach. Intell. 2023, 5, 273–283. [Google Scholar] [CrossRef]
Butler, B.J. Forests of Connecticut, 2017; Resource Update FS-159; US Department of Agriculture, Forest Service, Northern Research Station: Newtown Square, PA, USA, 2018; Volume 159, 3p.
Elaina, H. Planting New Trees in the Wake of the Gypsy Moths. UConn Today 2019. Available online: https://today.uconn.edu/2019/07/planting-new-trees-wake-gypsy-moths/ (accessed on 24 December 2024).
USDA-FPAC-BC-APFO Aerial Photography Field Office. NAIP Digital Georectified Image. Available online: https://apfo-usdaonline.opendata.arcgis.com/ (accessed on 20 November 2024).
Tucker, C.J. Red and Photographic Infrared Linear Combinations for Monitoring Vegetation. Remote Sens. Environ. 1979, 8, 127–150. [Google Scholar] [CrossRef]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the Radiometric and Biophysical Performance of the MODIS Vegetation Indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
Kaufman, Y.J.; Tanre, D. Atmospherically Resistant Vegetation Index (ARVI) for EOS-MODIS. IEEE Trans. Geosci. Remote Sens. 1992, 30, 261–270. [Google Scholar] [CrossRef]
Greenacre, M.; Groenen, P.J.; Hastie, T.; d’Enza, A.I.; Markos, A.; Tuzhilina, E. Principal Component Analysis. Nat. Rev. Methods Primers 2022, 2, 100. [Google Scholar] [CrossRef]
Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural Features for Image Classification. IEEE Trans. Syst. Man Cybern. 1973, SMC-3, 610–621. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Wagner, F.H.; Sanchez, A.; Tarabalka, Y.; Lotte, R.G.; Ferreira, M.P.; Aidar, M.P.; Gloor, E.; Phillips, O.L.; Aragao, L.E. Using the U-net Convolutional Network to Map Forest Types and Disturbance in the Atlantic Rainforest with very High Resolution Images. Remote Sens. Ecol. Conserv. 2019, 5, 360–375. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowled. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Banko, G. A Review of Assessing the Accuracy of Classifications of Remotely Sensed Data and of Methods Including Remote Sensing Data in Forest Inventory; IIASA Interim Report; IIASA: Laxenburg, Austria, 1998; IR-98-081. [Google Scholar]
Smith, B.; Varzi, A.C. Fiat and Bona Fide Boundaries. Philos. Phenomenol. Res. 2000, 60, 401–420. [Google Scholar] [CrossRef]
Karimi, D.; Dou, H.; Warfield, S.K.; Gholipour, A. Deep Learning with Noisy Labels: Exploring Techniques and Remedies in Medical Image Analysis. Med. Image Anal. 2020, 65, 101759. [Google Scholar] [CrossRef]
Sani-Mohammed, A.; Yao, W.; Heurich, M. Instance Segmentation of Standing Dead Trees in Dense Forest from Aerial Imagery using Deep Learning. ISPRS Open J. Photogramm. Remote Sens. 2022, 6, 100024. [Google Scholar] [CrossRef]
Tao, H.; Li, C.; Zhao, D.; Deng, S.; Hu, H.; Xu, X.; Jing, W. Deep Learning-Based Dead Pine Tree Detection from Unmanned Aerial Vehicle Images. Int. J. Remote Sens. 2020, 41, 8238–8255. [Google Scholar] [CrossRef]
He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral Image Classification using the Bidirectional Encoder Representation from Transformers. IEEE Trans. Geosci. Remote Sens. 2020, 58, 165–178. [Google Scholar] [CrossRef]
Shahid, M.; Chen, S.; Hsu, Y.; Chen, Y.; Chen, Y.; Hua, K. Forest Fire Segmentation Via Temporal Transformer from Aerial Images. Forests 2023, 14, 563. [Google Scholar] [CrossRef]
Deininger, L.; Stimpel, B.; Yuce, A.; Abbasi-Sureshjani, S.; Schönenberger, S.; Ocampo, P.; Korski, K.; Gaire, F. A Comparative Study between Vision Transformers and CNNs in Digital Pathology. arXiv 2022, arXiv:2206.00389. [Google Scholar]
Minařík, R.; Langhammer, J.; Lendzioch, T. Detection of Bark Beetle Disturbance at Tree Level using UAS Multispectral Imagery and Deep Learning. Remote Sens. 2021, 13, 4768. [Google Scholar] [CrossRef]
Ahlswede, S.; Schulz, C.; Gava, C.; Helber, P.; Bischke, B.; Förster, M.; Arias, F.; Hees, J.; Demir, B.; Kleinschmit, B. TreeSatAI Benchmark Archive: A Multi-Sensor, Multi-Label Dataset for Tree Species Classification in Remote Sensing. Earth Syst. Sci. Data Discuss. 2022, 15, 681–695. [Google Scholar] [CrossRef]
Hartling, S.; Sagan, V.; Sidike, P.; Maimaitijiang, M.; Carron, J. Urban Tree Species Classification using a WorldView-2/3 and LiDAR Data Fusion Approach and Deep Learning. Sensors 2019, 19, 1284. [Google Scholar] [CrossRef]
Hao, Z.; Lin, L.; Post, C.J.; Mikhailova, E.A.; Li, M.; Chen, Y.; Yu, K.; Liu, J. Automated Tree-Crown and Height Detection in a Young Forest Plantation using Mask Region-Based Convolutional Neural Network (Mask R-CNN). ISPRS J. Photogramm. Remote Sens. 2021, 178, 112–123. [Google Scholar] [CrossRef]
Alhazmi, K.; Alsumari, W.; Seppo, I.; Podkuiko, L.; Simon, M. Effects of Annotation Quality on Model Performance. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea, 20–23 April 2021; p. 63. [Google Scholar]
Nitze, I.; van der Sluijs, J.; Barth, S.; Bernhard, P.; Huang, L.; Lara, M.; Kizyakov, A.; Runge, A.; Veremeeva, A.; Jones, M.W. A Labeling Intercomparison of Retrogressive Thaw Slumps by a Diverse Group of Domain Experts. Permafr. Periglac. Process. 2025, 36, 83–92. [Google Scholar] [CrossRef]
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-Based Visual Segmentation: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]

Figure 1. Map of the study area. The red-colored rectangle is the extent of image patches used in DL model training and testing. Hand-annotated unhealthy tree crowns are shown in yellow outlines in a zoomed-in image tile.

Figure 2. A DL-based workflow for unhealthy tree crown detection. The workflow consists of four stages: data preprocessing, model training, evaluation, and annotation uncertainty analysis. During data preprocessing, a 4-band NAIP image was used to create different channel combinations, and annotation data were used to create a training and validation dataset. In the model training stage, CNN-based and ViT-based models were trained and tuned. In the evaluation stage, trained models were compared based on evaluation metrics, and finally uncertainty analysis on data annotation was conducted.

Figure 3. Different band combinations used to train our candidate algorithms. (a) Red, green, and blue bands. (b) NDVI, EVI, and ARVI bands. (c) False color (CIR) bands. (d) First three Principal components of Red, Green, Blue, and Near IR bands. (e) First three principal components of Textural features. (f) Natural color image and unhealthy tree crown polygons annotated in yellow color.

Figure 4. Example annotated image patch showing the shape and size of the UTCs. Polygon ‘A’ has an area of approximately 400 m² and may contain multiple tree crowns.

Figure 5. Distribution of annotated UTCs per image patch, showing that most patches contain fewer than 11 annotated polygons.

Figure 6. Distribution of UTC sizes (m²). The right-skewed pattern indicates most polygons represent smaller UTCs.

Figure 7. The SegFormer model architecture with Mix-FFN transformers.

Figure 8. U-net architecture for the UTC segmentation.

Figure 9. DeepLab v3+ with Resnet101 backbone architecture.

Figure 10. Inter-annotator variability. Thin colored outlines represent annotations from individual annotators (A_i, where i indicates each of the nine annotators). Thick outlines show the median (red), union (black), and intersection areas (blue) of all annotation polygons. Panel (a) illustrates a one-to-one relationship where each annotator outlines a single polygon for the same location. Panel (b) illustrates a one-to-many relationship, where annotator 1 drew two polygons (labeled 1A₁ and 2A₁) for a single location, while other annotators represented it as a single polygon.

Figure 11. Comparison of performance of the three candidate architectures on five different input channel combinations using F1-Score, Recall, Precision, and IoU of unseen test data for UTC only (background not included).

Figure 12. Segmentation results of different architectures for different band combinations. The first column shows the band combination used, i.e., input band combination, the second column shows the image patch, the third column represents the ground truth annotations, and the third, fourth, and fifth columns represent the detected mask by U-Net, DeepLabv3+, and SegFormer architecture, respectively. Yellow arrows show the ‘salt and pepper’ effect on the detected mask by U-Net, and blue circle areas represent the model’s abilities to distinguish the edges of UTCs.

Figure 13. Illustrates the training and validation loss curves for candidate architectures across different channel combinations. The x-axis represents the number of epochs, while the y-axis shows the loss values. The blue line corresponds to the training loss, and the orange line represents the validation loss. In the figure, RGB consists of red, green, and blue bands; CIR includes near-infrared, red, and green bands; RGBPC represents red, green, and the first principal component (derived from all four bands); VI includes NDVI, EVI, and ARVI bands; and PCA contains the first three principal components. PCA was applied to eight GLCM texture matrices generated for each band (Variance, Entropy, Contrast, Homogeneity, Angular Second Momentum, Dissimilarity, Mean, and Correlation).

Figure 14. Evaluating the performance of all candidate algorithms for crown segmentation on input channel combination. Ground truth annotations (yellow) and the detected labels (red) for each channel combination are visualized on the original image (RGB).

Figure 15. (a) Histogram displaying the area distribution of ground truth polygons (red) and detected polygons (blue) for the SegFormer-RGB model applied to the selected test dataset. Red and Blue solid lines represent the density curve corresponding to ground truth area and predicted area, respectively (b) Box plot illustrating the IoU distribution of selected polygons for the SegFormer-RGB model. (c) Scatter plot depicting model performance, with yellow markers indicating detected polygons smaller than the ground truth, and blue markers representing polygons with larger areas than the ground truth.

Figure 16. Bar graph showing descriptive information of annotations used for uncertainty analysis. A total of nine annotators annotated 30 image patches. The length of the bars shows the number of pixels annotated; the red dot represents the total number of polygons annotated by each annotator, and the area covered by the annotated UTCs.

Figure 17. (a) Figure showing all the annotations carried out by nine different annotators over guided locations of UTC. The varying shades of blue indicate areas where multiple annotations overlap, with darker shades representing higher levels of agreement. The black outline shows the union of all annotations, capturing the full extent covered by any annotator. The red outline represents the median polygon, providing a central approximation of all annotations. The yellow polygon highlights the intersection area, indicating the region where all annotators agreed. (b) Raster showing the percentage of agreement among annotators for id 1. (c) Three-dimensional visuals showing the RGB image (bottom) with UTC and the level of percentage agreement among the annotators (top).

Figure 18. Inter-annotator agreement matrix showing IoU scores among nine annotators. The matrix visualizes the level of agreement between pairs of annotators regarding the identification and delineation of UTC. Each cell represents the IoU score between two annotators, with values ranging from 0 (no overlap) to 1 (perfect overlap), providing a quantitative measure of annotation consistency. Annotator pairs with higher IoU scores (indicated in red) show stronger agreement, while lower scores (in blue) indicate greater variability in their annotations.

Table 1. Description of the band combinations used in the study and the abbreviations used.

Name	Abbreviation	Description
Natural Color Image Patch	RGB	Red, Green, and Blue bands
False Color Image Patch	CIR	Near-infrared, Red, and Green bands
RGBPC Image Patch	RGBPC	Red, Green, and 1st Principal Component (derived from all 4 bands)
Vegetation Indices Image patch	VI	NDVI, EVI, ARVI bands
Textural PC Image Patch	PCA	First 3 PCs bands. PCA was carried out on eight GLCM texture matrices generated for each band (Variance, Entropy, Contrast, Homogeneity, Angular Second Momentum, Dissimilarity, Mean, and Correlation)

Table 2. Information on parameters used in training different DL architectures.

Variables	Parameters
Number of NAIP scenes used	16 (each scene 12,596 × 12,596 pixels)
Image resolution	0.6 m
Spectral resolution	4 (Red, Green, Blue, Near infrared)
Date of image acquisition	September 2018
Random points generated	500
Number of image patches used	400
Image patches size	254 × 254 pixels
Total number of crowns annotated	5133
Minimum number of tree crowns in a patch	1
Maximum number of tree crowns in a patch	170
Average number of crowns in a patch	13
Vegetation type	Densely mixed forest
Season	Leaf on, early Fall season
Annotation software	ArcGIS Pro 3.2.2
Average time to manually annotate a single image patch	260 s
Train:Test:Validation data ratio	80:10:10

Table 3. Performance metrics of UTC segmentation from aerial images using different band combinations and DL architectures on unseen test data. (IoU: Intersection of Union, OA: overall accuracy, BG: background, UTC: unhealthy tree crown, Avg P: average precision, Avg R: average Recall, Avg F: average F1-score).

Model	Bands	IoU	OA	Precision		Recall		F1-Score		Avg P	Avg R	Avg F
				BG	UTC	BG	UTC	BG	UTC
U-Net	RGB	0.30	0.96	0.98	0.46	0.98	0.46	0.98	0.46	0.72	0.72	0.72
	CIR	0.30	0.96	0.98	0.43	0.98	0.47	0.98	0.45	0.71	0.73	0.72
	VI	0.32	0.96	0.99	0.43	0.97	0.56	0.98	0.49	0.71	0.77	0.74
	PCA	0.02	0.96	0.98	0.13	0.99	0.02	0.98	0.03	0.56	0.51	0.51
	RGBPC	0.09	0.96	0.97	0.26	0.99	0.12	0.98	0.17	0.62	0.56	0.58
DeepLab v3+	RGB	0.48	0.97	0.99	0.6	0.98	0.71	0.99	0.65	0.8	0.85	0.82
	CIR	0.45	0.97	0.99	0.59	0.98	0.66	0.99	0.62	0.79	0.82	0.81
	VI	0.4	0.96	0.99	0.49	0.97	0.68	0.98	0.57	0.74	0.83	0.78
	PCA	0.24	0.96	0.98	0.39	0.98	0.38	0.98	0.38	0.69	0.68	0.68
	RGBPC	0.30	0.95	0.99	0.37	0.96	0.61	0.97	0.46	0.68	0.79	0.72
SegFormer	RGB	0.54	0.98	0.99	0.62	0.98	0.8	0.99	0.7	0.81	0.89	0.85
	CIR	0.48	0.97	0.99	0.53	0.97	0.83	0.98	0.65	0.76	0.9	0.82
	VI	0.51	0.97	0.99	0.62	0.98	0.73	0.99	0.67	0.81	0.86	0.83
	PCA	0.24	0.94	0.98	0.30	0.96	0.52	0.97	0.38	0.64	0.74	0.68
	RGBPC	0.45	0.97	0.99	0.53	0.98	0.74	0.98	0.62	0.76	0.86	0.80

Table 4. Evaluation metrics for the SegFormer-RGB model using three reference polygon annotations: intersection (maximum consensus area), union (maximum extent), and median (most common consensus among annotators) for unhealthy tree crowns (UTCs). The table presents the intersection over union (IoU) scores and F1-scores for each annotation strategy, along with the relative percentage changes in these metrics compared to the median, which serves as the baseline. Negative values indicate a decrease in model accuracy relative to the baseline.

	Recall	Precision	F1-Score	IoU	Relative Change in IoU (%)	Relative Change in F1-Score (%)
Intersection	0.95	0.23	0.37	0.23	−57.41	−47.14
Union	0.57	0.73	0.64	0.46	−14.81	−8.57
Median	0.9	0.57	0.7	0.54	0	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Joshi, D.; Witharana, C. Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty. Remote Sens. 2025, 17, 1066. https://doi.org/10.3390/rs17061066

AMA Style

Joshi D, Witharana C. Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty. Remote Sensing. 2025; 17(6):1066. https://doi.org/10.3390/rs17061066

Chicago/Turabian Style

Joshi, Durga, and Chandi Witharana. 2025. "Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty" Remote Sensing 17, no. 6: 1066. https://doi.org/10.3390/rs17061066

APA Style

Joshi, D., & Witharana, C. (2025). Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty. Remote Sensing, 17(6), 1066. https://doi.org/10.3390/rs17061066

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Modeling Framework

2.3. Data Retrieval and Preparation

2.3.1. Image Data

2.3.2. Spectral Band Combination

2.3.3. Manual UTC Annotation

2.4. Deep Learning Models

2.4.1. SegFormer Model

2.4.2. U-Net Model

2.4.3. DeepLab Model

2.5. DL Model Training

2.6. Accuracy Assessment

2.7. Manual Annotation Uncertainty Analysis

3. Results

3.1. Evaluation Metrics

3.1.1. CNN Models

3.1.2. ViT-Based Model

3.2. Visual Quality Assessment

3.3. Annotation Uncertainty Analysis

4. Discussion

4.1. Performance of DL Models in UTC Segmentation

4.2. Performance of Band Combination in UTC Segmentation

4.3. Uncertainty and Limitations

4.4. Future Outlook

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI