Precision Detection and Assessment of Ash Death and Decline Caused by the Emerald Ash Borer Using Drones and Deep Learning

Emerald ash borer (Agrilus planipennis) is an invasive pest that has killed millions of ash trees (Fraxinus spp.) in the USA since its first detection in 2002. Although the current methods for trapping emerald ash borers (e.g., sticky traps and trap trees) and visual ground and aerial surveys are generally effective, they are inefficient for precisely locating and assessing the declining and dead ash trees in large or hard-to-access areas. This study was conducted to develop and evaluate a new tool for safe, efficient, and precise detection and assessment of ash decline and death caused by emerald ash borer by using aerial surveys with unmanned aerial systems (a.k.a., drones) and a deep learning model. Aerial surveys with drones were conducted to obtain 6174 aerial images including ash decline in the deciduous forests in West Virginia and Pennsylvania, USA. The ash trees in each image were manually annotated for training and validating deep learning models. The models were evaluated using the object recognition metrics: mean average precisions (mAP) and two average precisions (AP50 and AP75). Our comprehensive analyses with instance segmentation models showed that Mask2former was the most effective model for detecting declining and dead ash trees with 0.789, 0.617, and 0.542 for AP50, AP75, and mAP, respectively, on the validation dataset. A follow-up in-situ field study conducted in nine locations with various levels of ash decline and death demonstrated that deep learning along with aerial survey using drones could be an innovative tool for rapid, safe, and efficient detection and assessment of ash decline and death in large or hard-to-access areas.


Introduction
As a member of the Oleaceae family, ash trees (Fraxinus spp.) are known for their beauty and timber and are economically and ecologically important species [1,2]. Ash trees are distributed primarily in the Northern Hemisphere and some tropical forests in Mexico and Java. Approximately 18 species of ash are found in the USA, of which several are valued for the wood they produce [2]. They comprise approximately 7% of the forests in the eastern USA, and the most commonly used species for timber are white ash (F. americana) and green ash (F. pennsylvanica) [1,2]. White and green ash trees are scattered throughout the eastern and central parts of the United States, extending towards the southern parts of Canada [2]. The wood of these species is known for its resilience, stiffness, and being lightweight. This wood is used to make a wide range of products, such as baseball bats, pool cues, and electric and acoustic guitars. In addition to these benefits, ash seeds are used to feed various animals and insects. Green ash trees are planted as landscape trees and have thus proven to be a valuable asset for wetlands. transformer architectures in the field of computer vision is becoming popular (i.e., ViT [25], DeiT [26], and SWIN [27]).
Foresters and land managers seek a tool that is efficient and logistically simple not only for the early detection of invasive pests but also for long-term forest health monitoring. Currently, emerald ash borer adults, larvae, pupae, and associated signs and symptoms are targeted in monitoring and surveillance for early detection and rapid response [5,6]. The surveillance generally includes visual checking of ash trees looking for signs of exit holes and leaf feeding, and canopy dieback or death as well as trapping adult emerald ash borers using purple traps with volatile pheromones to attract males or larvae using trap trees [6]. One of the novel tools that can help detection and precise assessment of ash decline and death would be using high-resolution aerial imagery, which can cover a large area or hazardous locations (e.g., cliff) where ground surveys are limited. Several previous studies examined the potential of ash tree identification or ash decline using remotely sensed spectral data acquired from satellites or aircraft. Fraser et al. [28] used hyperspectral sensors and non-parametric algorithms for ash tree identification, and Chan et al. [29] demonstrated that hyperspectral sensors in a manned airplane could distinguish ash dieback caused by a fungus (Hymenoscyphus fraxineus). Sapkota and Liang [30] and Murfitt et al. [31] used satellite-based spectral data to map and assess ash decline. Although these previous studies provided excellent insights into using spectral sensors and image analysis for the identification of ash trees and detection of ash declines, a field method is still needed for foresters and pest management practitioners to conduct real-time, in-situ, safe, automated (low logistical need), and high precision surveys.
Therefore, we developed and tested the capability of deep learning coupled with high-resolution aerial images acquired with drones to detect ash decline and death by determining the efficiency of transformer-based deep learning architecture. We conducted a 2-year field survey to develop a deep learning tool (i.e., CNNs and Transformers using the state-of-the-art deep learning model called Mask2former) to assist land managers and foresters in the precise assessment of ash declines and death. We hypothesized that aerial surveys with drones to acquire aerial images could detect ash decline and death, and the deep learning capability could enhance detection with automation. Specifically, we developed and tested the deep learning models with aerial survey data obtained with drones and validated their capabilities of detecting ash decline and death. In addition, we conducted an in-situ field validation study to further test the deep learning capabilities along with aerial surveys with drones and its applicability to in-situ detection and generation of spatial distribution maps of ash trees.

Ash Decline and Death Assessed by Ground Survey
A total of 6174 images were annotated for generating training (70% of the images), validating (20% of the images), and testing (10% of the images) image data for deep learning (Table 1). Our ground survey confirmed that all the annotated trees for deep learning were indeed declining or dead ash trees. Most of the ash trees (Figure 1a Our ground survey confirmed that all the annotated trees for deep learning were indeed declining or dead ash trees. Most of the ash trees (Figure 1a-c) had D-shaped exit holes indicating adult emerald ash borer emergence from the infested trees (Figure 1d), which further indicated the high emerald ash borer infestation causing ash decline and death at the sites. Figure 1. Example aerial view of declining or dead ash trees from aerial images obtained with drones (a), ground view of declining or dead ash trees (b), and a noticeable opposite branching pattern of ash trees on the aerial images (c). The ground survey confirmed the presence of emerald ash borers with adults' exit holes (d). Yellow arrows in (a,b) indicate declining and dead ash trees, and arrows in (c,d) indicate an opposite branching pattern and D-shaped exit holes made by adult emerald ash borers, respectively.

Evaluation of Instance Segmentation Models in Deep Learning
Using the annotated data (see Figure S1 for example), we trained deep learning models for instance segmentation tasks. To analyze the performance of the instance segmentation models, Microsoft Common Objects in Context (MSCOCO) [32] evaluation metrics including mean average precisions (mAP) and the threshold average precisions at 50% and 75% (AP50 and AP75, respectively) were used. In addition, Intersection over Union (IoU), precision, and recall were used to further evaluate the model performance. IoU or the Jaccard index is the ratio of the intersection of the actual and predicted bounding box to their union. IoU gives the correctness of the prediction, precision gives the positive predictive value, and recall gives the true positive rate.
The area under the Precision-Recall curve, P-R, gives the Average Precision (AP) at IoU threshold T (for AP50, T = 50% and for AP75, T = 75%), which can be calculated as: The mean AP (mAP) for all the classes is given by: Since the number of classes is one in our dataset, mAP is calculated for a single class (i.e., ash trees). Figure 1. Example aerial view of declining or dead ash trees from aerial images obtained with drones (a), ground view of declining or dead ash trees (b), and a noticeable opposite branching pattern of ash trees on the aerial images (c). The ground survey confirmed the presence of emerald ash borers with adults' exit holes (d). Yellow arrows in (a,b) indicate declining and dead ash trees, and arrows in (c,d) indicate an opposite branching pattern and D-shaped exit holes made by adult emerald ash borers, respectively.

Evaluation of Instance Segmentation Models in Deep Learning
Using the annotated data (see Figure S1 for example), we trained deep learning models for instance segmentation tasks. To analyze the performance of the instance segmentation models, Microsoft Common Objects in Context (MSCOCO) [32] evaluation metrics including mean average precisions (mAP) and the threshold average precisions at 50% and 75% (AP50 and AP75, respectively) were used. In addition, Intersection over Union (IoU), precision, and recall were used to further evaluate the model performance. IoU or the Jaccard index is the ratio of the intersection of the actual and predicted bounding box to their union. IoU gives the correctness of the prediction, precision gives the positive predictive value, and recall gives the true positive rate.
The area under the Precision-Recall curve, P-R, gives the Average Precision (AP) at IoU threshold T (for AP50, T = 50% and for AP75, T = 75%), which can be calculated as: The mean AP (mAP) for all the classes is given by: All classes ∑ AP(class) Since the number of classes is one in our dataset, mAP is calculated for a single class (i.e., ash trees).

Deep Learning for Declining and Dead Ash Tree Detection
We comprehensively examined the performance of the ash tree dataset with various object recognition models. Specifically, we trained the object detection model (i.e., RetinaNet) and instance segmentation models (i.e., Yolact and Mask R-CNN), and then compared their performances with the Masked-attention Mask transformer (Mask2former) model. Mask2former with Swin transformer's small (SWIN-S) variant as backbone outperforms the other state-of-the-art baselines like Mask R-CNN for all AP50, AP75, and mAP metrics by +0.056, +0.128, and +0.118, respectively, for bounding boxes and +0.16, +0.208, and +0.161, respectively, for segmentation masks (Table 2). Therefore, Mask2former was chosen for detecting declining or dead ash trees on the segmentation maps generated from the aerial images obtained by drones. Table 2. Assessment of different state-of-the-art baselines and backbones for the detectability of declining or dead ash trees by using mean average precisions (mAP) and the threshold average precisions at 50% and 75% (AP50 and AP75, respectively).

Ablation Study of the Mask2former Model
To further analyze the performance of Mask2former, we have employed different CNN and transformer backbones to extract the features from each image. Specifically, we studied the performance of Mask2former on our dataset using ResNets with 50 and 101 layers and Swin transformer's tiny (SWIN-T), and small (SWIN-S) variants. Furthermore, we used ImageNet-1K [33] weights for all of the backbones.
The results of the test showed that the Mask2former model with ResNet-101 backbone outperforms the SWIN-S transformer model for bounding box mAP by +0.017 (Table 3). Whereas for mask mAP, SWIN-S was a better choice as it surpassed the ResNet-101 model by +0.005. Table 3. Instance segmentation results on the ash tree dataset using Mask2former with different backbones by using mean average precisions (mAP) and the threshold average precisions at 50% and 75% (AP50 and AP75, respectively). In addition to the quantitative analysis, we performed a detailed analysis of the false positives using Precision-Recall (P-R) curves [34]. We classified the declining or dead ash trees into small (area < 32 2 pixels), medium (32 2 pixels < area < 96 2 pixels), and large (area > 96 2 pixels) objects of the segmentation mask. Each plot comprised a series of P-R curves, including the standard P-R curves at IoU = 0.75 (C75), IoU = 0.5 (C50), and false negative (FN) P-R, which was obtained by removing all the errors making the total area under the curve 1.00. In addition to these P-R curves, based on the differences between the predicted bounding box and the associated ground-truths, the incorrect detection results were given by localization (Loc), similar objects (Sim), others (Oth), and background (BG) P-R curves. Loc is the P-R at IoU = 0.1, ignoring the localization errors. Sim provides the P-R after the super category false positives (FPs) are removed, whereas Oth provides the P-R after eliminating all class confusion. Loc, Sim, and Oth were the same in our case, as we had a single class (i.e., ash tree) instead of multiple classes. Finally, BG gives the P-R when all background FPs are removed. It is evident that C75 and C50 are much higher for larger ash trees than for medium and small trees, indicating that performance was degraded for medium and smaller areas of ash decline and death ( Figure 2). super category false positives (FPs) are removed, whereas Oth provides the P-R after eliminating all class confusion. Loc, Sim, and Oth were the same in our case, as we had a single class (i.e., ash tree) instead of multiple classes. Finally, BG gives the P-R when all background FPs are removed. It is evident that C75 and C50 are much higher for larger ash trees than for medium and small trees, indicating that performance was degraded for medium and smaller areas of ash decline and death ( Figure 2). Oth, Other; BG, background; FN, false negative. The sizes of the canopy dieback were categorized based on pixel size: small (area < 32 2 pixels), medium (32 2 pixels < area < 96 2 pixels), and large (area > 96 2 pixels).

Visualization of Predictions of Declining or Dead Ash Trees Using Mask2former
We tested our model's performance by conducting a validation with the testing dataset. For the testing dataset, we used 0.75 as a threshold to detect the presence of the declining or dead ash tree. The prediction results of the testing dataset showed that the model performed well in detecting declining or dead ash trees ( Figure 3). However, cars or human-made objects were sometimes predicted to be declining or dead ash. Oth, Other; BG, background; FN, false negative. The sizes of the canopy dieback were categorized based on pixel size: small (area < 32 2 pixels), medium (32 2 pixels < area < 96 2 pixels), and large (area > 96 2 pixels).

Visualization of Predictions of Declining or Dead Ash Trees Using Mask2former
We tested our model's performance by conducting a validation with the testing dataset. For the testing dataset, we used 0.75 as a threshold to detect the presence of the declining or dead ash tree. The prediction results of the testing dataset showed that the model performed well in detecting declining or dead ash trees ( Figure 3). However, cars or human-made objects were sometimes predicted to be declining or dead ash. Plants 2023, 12, x FOR PEER REVIEW 7 of 20 Figure 3. Two examples of Mask2former's predictions of declining or dead ash trees with a threshold of 0.75. Input aerial images were given to the model, and then the prediction was compared with the ground-truth data.

In-Situ Field Validation Study
To demonstrate the ability of Mask2former along with the use of drones for aerial surveys to detect declining or dead ash trees in a variety of environmental conditions, we conducted aerial surveys at nine sites with drones for in-situ field validation. We used lower thresholds (i.e., 0.1 and 0.5) as the field validation images are dissimilar in terms of lighting and altitudes when compared to the training data used for training the deep learning model. A composite image of each site was subjected to automated detection of declining or dead ash trees ( Figure 4). The results of this study showed that the Mask2former model correctly detected 81% ± 10.7 of declining or dead ash trees ranging from 59% to 100% at the confidence level of 0.1; the model did not detect 19% of declining or dead ash trees (i.e., false negative or type II error). A total of 30% ± 33.1 trees (ranging from 0% to 83%) were falsely detected as declining or dead ash trees (i.e., false positive or type I error) at the confidence level of 0.1. At the confidence level of 0.5, the model detected correctly 27% ± 13.6 of declining or dead ash trees but detected falsely 6% ± 8.8 (i.e., type I error). We also noticed no false positive detection of declining or dead ash trees in four of the nine sites (see Table S1 for detailed data).

In-Situ Field Validation Study
To demonstrate the ability of Mask2former along with the use of drones for aerial surveys to detect declining or dead ash trees in a variety of environmental conditions, we conducted aerial surveys at nine sites with drones for in-situ field validation. We used lower thresholds (i.e., 0.1 and 0.5) as the field validation images are dissimilar in terms of lighting and altitudes when compared to the training data used for training the deep learning model. A composite image of each site was subjected to automated detection of declining or dead ash trees ( Figure 4). The results of this study showed that the Mask2former model correctly detected 81% ± 10.7 of declining or dead ash trees ranging from 59% to 100% at the confidence level of 0.1; the model did not detect 19% of declining or dead ash trees (i.e., false negative or type II error). A total of 30% ± 33.1 trees (ranging from 0% to 83%) were falsely detected as declining or dead ash trees (i.e., false positive or type I error) at the confidence level of 0.1. At the confidence level of 0.5, the model detected correctly 27% ± 13.6 of declining or dead ash trees but detected falsely 6% ± 8.8 (i.e., type I error). We also noticed no false positive detection of declining or dead ash trees in four of the nine sites (see Table S1 for detailed data).  White boxes indicate declining or dead ash trees located by ground survey, and purple areas (bounding box) indicate declining or dead ash trees detected and located by Mask2former. By comparing ground survey data with Mask2former prediction map, type I and II errors could be calculated.

Discussion and Conclusions
This study aimed at developing an application with deep learning and automated drone operation to precisely assess the ash decline and death caused by emerald ash borers. Specifically, this study demonstrated that a deep learning model (i.e., Mask2former) coupled with aerial images acquired with autopiloted drones could generate distribution maps of ash decline and death, providing detailed images and geo-coordinates of individual ash trees for foresters and land managers to use for their management planning and action. In this study, we first tested various deep learning models and found that Mask2former, a supervised deep learning model, performed the best for the instance segmentation task on the ash tree images. Most previous studies on crop detection using aerial images focused on classification [28] or semantic segmentation tasks [22,23]. However, classification models cannot accurately locate the ash trees, and the semantic segmentation models mark all the ash trees as one single ash tree. Therefore, our model was trained to perform instance segmentation tasks by marking each ash tree as a separate instance. Moreover, we used a transformer decoder with a masked attention mechanism to extract the localized features from the ash tree images by constraining the cross attention within the foreground regions of the predicted mask. Based on our results, ash trees with more canopy dieback were detected more precisely and the performance of the model was degraded for medium and smaller symptoms of ash decline and death.
Although our deep learning model worked well with unseen testing data, the model failed to predict ash decline in the images with seasonal change (e.g., autumn season). For example, we tried to predict the dead or declining ash trees by feeding the autumn season images to the deep learning model (Figure 5a). With a threshold of 0.5 (Figure 5b), the model could not detect any of the dead or declining ash trees, but it detected only a few of the declining ash trees with a threshold of 0.1 (Figure 5c). We hypothesize that the cause for this deficiency is imprecise manual annotations around the possible ash trees, which include some green regions. As a result, the model failed to detect ash tree in autumn as the leaves are no longer green. Furthermore, the Mask2former failed to detect smaller ash trees on the aerial images taken. . White boxes indicate declining or dead ash trees located by ground survey, and purple areas (bounding box) indicate declining or dead ash trees detected and located by Mask2former. By comparing ground survey data with Mask2former prediction map, type I and II errors could be calculated.

Discussion and Conclusions
This study aimed at developing an application with deep learning and automated drone operation to precisely assess the ash decline and death caused by emerald ash borers. Specifically, this study demonstrated that a deep learning model (i.e., Mask2former) coupled with aerial images acquired with autopiloted drones could generate distribution maps of ash decline and death, providing detailed images and geo-coordinates of individual ash trees for foresters and land managers to use for their management planning and action. In this study, we first tested various deep learning models and found that Mask2former, a supervised deep learning model, performed the best for the instance segmentation task on the ash tree images. Most previous studies on crop detection using aerial images focused on classification [28] or semantic segmentation tasks [22,23]. However, classification models cannot accurately locate the ash trees, and the semantic segmentation models mark all the ash trees as one single ash tree. Therefore, our model was trained to perform instance segmentation tasks by marking each ash tree as a separate instance. Moreover, we used a transformer decoder with a masked attention mechanism to extract the localized features from the ash tree images by constraining the cross attention within the foreground regions of the predicted mask. Based on our results, ash trees with more canopy dieback were detected more precisely and the performance of the model was degraded for medium and smaller symptoms of ash decline and death.
Although our deep learning model worked well with unseen testing data, the model failed to predict ash decline in the images with seasonal change (e.g., autumn season). For example, we tried to predict the dead or declining ash trees by feeding the autumn season images to the deep learning model (Figure 5a). With a threshold of 0.5 (Figure 5b), the model could not detect any of the dead or declining ash trees, but it detected only a few of the declining ash trees with a threshold of 0.1 (Figure 5c). We hypothesize that the cause for this deficiency is imprecise manual annotations around the possible ash trees, which include some green regions. As a result, the model failed to detect ash tree in autumn as the leaves are no longer green. Furthermore, the Mask2former failed to detect smaller ash trees on the aerial images taken. The results of our aerial surveys with drones were very promising. The aerial images were high-resolution (4K), and even individual branches (declining or dead) of an ash tree could be examined (see Figure S3 for example). Our study suggests that pixel-based modeling would be possible because C75 and C50 were much higher for larger ash trees than for medium and small trees, indicating that performance was degraded for medium and smaller areas of ash decline and death (Figure 2). In addition, the in-situ validation study also demonstrated a promising ability to detect declining or dead ash trees. We trained our model on images captured 50-70 m above the ground in summer and tested its performance in nine sites in Pennsylvania and West Virginia, USA. We used lower thresholds (0.1 and 0.5) for in-situ field validation due to the dissimilarity in drone-flight altitudes and lighting compared to the training images. We found a higher type I error (29%) and lower type II error (19%) at the threshold of 0.1. Type I error (i.e., false positive) is generally acceptable in pest management because it indicates that no ash trees are declining or dead even though the model's output informs foresters or land managers to visit and The results of our aerial surveys with drones were very promising. The aerial images were high-resolution (4K), and even individual branches (declining or dead) of an ash tree could be examined (see Figure S3 for example). Our study suggests that pixel-based modeling would be possible because C75 and C50 were much higher for larger ash trees than for medium and small trees, indicating that performance was degraded for medium and smaller areas of ash decline and death (Figure 2). In addition, the in-situ validation study also demonstrated a promising ability to detect declining or dead ash trees. We trained our model on images captured 50-70 m above the ground in summer and tested its performance in nine sites in Pennsylvania and West Virginia, USA. We used lower thresholds (0.1 and 0.5) for in-situ field validation due to the dissimilarity in drone-flight altitudes and lighting compared to the training images. We found a higher type I error (29%) and lower type II error (19%) at the threshold of 0.1. Type I error (i.e., false positive) is generally acceptable in pest management because it indicates that no ash trees are declining or dead even though the model's output informs foresters or land managers to visit and check falsely detected declining or dead ash trees. However, type II error (false negative) is more critical for pest management because there are declining or dead ash trees, but the model's output indicates there are no declining or dead ash trees. In this case, foresters and land managers may not check the false-negative trees which can be a source of future spread of ash decline and death by emerald ash borer. At the confidence level of 0.5, the type II error was higher (71%), which may not be acceptable for pest management although the type I error was only 6%. In addition, we noticed that type I and II error rates varied by site (see Table S1 for details). Four of the nine sites had 0% type I error while the other sites had as high as 83% type I error at the precision of 0.1; only one site had 0% type II error. Therefore, to develop a tool that is applicable to a wide variety of forests, deep learning models need training data acquired from various environments and locations.
Although our study is the first to use high-resolution drone imagery and the Mask2former deep learning model together, previous studies used satellites and spectral sensors to identify ash trees and detect ash dieback. Spectral analyses from the previous studies [29][30][31]35] showed promising results for the detection of ash trees infested with emerald ash borers although the studies also found some limitations in using spectral sensors and satellites. Four main limitations were: (1) the differences in spectral reflectance between the leaves of infested and non-infested ash trees are sometimes difficult to distinguish, (2) emerald ash borer data were collected one time during the growing season while the factors that affect spectral response can vary across a growing season, (3) the high cost of the hyperspectral data may prohibit use by many foresters and land managers, and (4) some ash trees were confused with other trees because the ash detection is based on the classification maps. In addition to satellites, piloted or manned airplanes have been used for aerial visual surveys (e.g., aerial sketch mapping), which can cover a large area but still provide acceptable precision for forest health monitoring [28,[36][37][38]. Aerial sketch mapping methods are well adopted by federal and state agencies in the USA to cover large areas in a short period [39,40]. However, aerial sketch mapping methods are restricted by weather, cost, availability of airplanes and pilots, and safety.
Drones have become an excellent platform for easy, cheap, high-resolution, and realtime aerial surveys [41]. With the continuing advancement and growing application of drones, foresters and land managers have used them for evaluating and monitoring forests and even individual trees [28]. The application developed from this study using deep learning and drones generates distribution maps of declining and dead ash trees. The maps can help locate areas with declining and dead ash trees, especially in very dense large forests or hard-to-access areas. Even, the method we developed in this study can identify the location (i.e., geo-coordinates) of individual ash trees, which can help foresters and land managers develop an efficient forest health monitoring plan or survey route for periodical or repeated surveys of declining ash trees by using pre-planned, autopiloted drone flights [39,41]. In addition to mapping ash declines, drones can be used with ground robots or surveyors to enhance the real-time detection of ash decline and death. Using aerial surveys with drones (even with satellites and piloted airplanes), it is extremely difficult to identify the cause of ash decline and death based on the symptoms shown on the top of the canopy. The symptoms are visible from the sky when the upper branches of ash trees are deprived of water and nutrients, and thus the trees are defoliated, and parts of the crown lose leaves in spring and summer [35]. However, the symptoms of emerald ash borer damage can be similar to those of other environmental factors and cause agents such as fungal pathogens [35]. The presence of emerald ash borers is evidenced by the D-shaped exit holes in the rough and grooved ash bark and the larval tunneling under the bark [1], indicating a need for a ground survey. In addition, the visual symptoms of emerald ash borer infestation do not usually appear until 2-4 years after the arrival of emerald ash borers [40], and thus foresters and land managers may not notice signs of infestation until they begin to see bare branches [1]. Our study using aerial surveys did not attempt to detect emerald ash borer directly, but the methods developed in this study can guide foresters and land managers on where to go to find declining or dead ash trees with the ground survey or low-altitude drone flights to confirm and assess the ash decline and death. The methods also would help prioritize areas for the management of ash decline and emerald ash borer.
Our study was not designed to detect the early stage of ash decline by showing symptoms of yellowing or direct detection of insects; previous studies with hyperspectral sensors showed the potential for early detection of ash decline [28][29][30], which is encouraging the use of drones and hyperspectral sensors together to enable early detection and rapid response to emerald ash borers. Our study instead focused on the precise detection of ash decline and death using deep learning and drones together to develop a method that is (1) low logistics, (2) automated, (3) cheap, (4) safe, (5) using real-time methods, and (6) covering hard-to-access areas. These advantages overcome the limitations of satellite-and airplane-based forest health monitoring. A major disadvantage of the use of drones would be the limited coverage of survey areas compared to satellites and airplanes. However, the method we developed in this study can be combined with satellites or piloted airplanes, which would enhance the detectability and overcome the pros and cons of each method (i.e., data fusion [28]). Data or image fusion allows users to overcome the shortcomings of single data source limitations. For example, with the fusion of satellite and drone imagery, foresters and land managers could overcome the low spatial resolution of most satellite sensors and the limited coverage that can be accomplished by drones [28,41]. In addition, deep learning in artificial intelligence, which emerged as one of the potential tools for visual recognition tasks, could add to the data fusion and be used for short-term and long-term forest health monitoring. Federal and state agencies in the USA are currently using drones for routine forest health monitoring. As more aerial image data will be collected as a routine task, the data can be added to our deep learning model for training and validation, which eventually increases the precision and accuracy of the deep learning model. In addition, this method can be applied to detect trees damaged by other invasive pests such as insects, diseases, or weeds.
Although this study demonstrated the potential for the use of drones and deep learning together to overcome current limitations in the assessment and management of ash decline and death, further studies are needed to increase the precision and accuracy of the detection of ash decline and death caused by emerald ash borders. Because other environmental or biotic stressors (e.g., H. fraxineus, a fungal pathogen in Europe) also can cause ash decline and death, additional studies to differentiate the causal agents would be necessary. In addition, we aim to improve the process of marking the ash trees by incorporating multispectral, hyperspectral, and thermal data (i.e., fusion techniques) into our dataset along with the usage of dilated backbones to segment smaller objects, thereby making our model robust. Furthermore, we plan to expand the database of our ash tree dataset by collecting the images with seasonal change and re-train our model using transfer learning.

Study Sites and Types of Aerial Surveys
This study was conducted in Pennsylvania and Northern West Virginia, USA ( Figure 6). The size of the sites raged from 1.8 ha to 36 ha. Ash trees in these study sites had experienced emerald ash borer infestations for at least 5 years, causing severe decline and death of ash trees. Two separate sets of aerial surveys were conducted in this study: aerial surveys for obtaining data for deep learning (see

Acquisition of Aerial Imagery Using Drones for Deep Learning
Aerial surveys for deep learning were conducted in a 20-ha area located in Warren County, Pennsylvania, USA, during the summers of 2020 and 2021. Various levels of ash tree decline and death have been found in the area, making it an ideal area to obtain aerial images for deep learning.
Two rotary-wing drones (DJI Mavic 2 Enterprise Advanced and DJI Mavic 2 Pro; SZ DJI Technology Co., Ltd., Shenzhen, China) were used. The DJI Mavic 2 Enterprise Advanced carried a 3.3-megapixel thermal camera as well as an RGB camera with 48-megapixel photography and 4K videography capability. The DJI Mavic 2 Pro carried an RGB camera with 20-megapixel photography and 4K videography capability.
The drones were flown 50-75 m above the ground depending on the terrain and obstacles in the study sites. Our preliminary study showed that aerial surveys with 4K videography at <80 m above ground could detect declining or dead ash trees ( Figure S3). After the flight missions were completed, aerial images were downloaded from the drone and aligned according to the pre-planned flight path.

Data Processing and Labeling
From the 4K video footages from the drones, 3423 images were captured. Every 10 th image frame was chosen for this study to maintain positional variation, and all 343 images were manually annotated using the BeFunky Photo Editor (BeFunky, Portland, OR, USA). Annotating an image with the dead ash trees (Figure 7) took approximately 2 min per image. Ideally, the outline would be tighter to the branches, but that would increase the time required to annotate. Dead or declining ash trees could be distinguished from an aerial view via their oppositely arranged, stout, and relatively straight branches, which can be seen in the ground view image (Figure 7a). They are also particularly white in color. In addition, ground surveys were conducted to confirm that the declining or dead trees were indeed ash trees.

Acquisition of Aerial Imagery Using Drones for Deep Learning
Aerial surveys for deep learning were conducted in a 20-ha area located in Warren County, Pennsylvania, USA, during the summers of 2020 and 2021. Various levels of ash tree decline and death have been found in the area, making it an ideal area to obtain aerial images for deep learning.
Two rotary-wing drones (DJI Mavic 2 Enterprise Advanced and DJI Mavic 2 Pro; SZ DJI Technology Co., Ltd., Shenzhen, China) were used. The DJI Mavic 2 Enterprise Advanced carried a 3.3-megapixel thermal camera as well as an RGB camera with 48megapixel photography and 4K videography capability. The DJI Mavic 2 Pro carried an RGB camera with 20-megapixel photography and 4K videography capability.
The drones were flown 50-75 m above the ground depending on the terrain and obstacles in the study sites. Our preliminary study showed that aerial surveys with 4K videography at <80 m above ground could detect declining or dead ash trees ( Figure S3). After the flight missions were completed, aerial images were downloaded from the drone and aligned according to the pre-planned flight path.

Data Processing and Labeling
From the 4K video footages from the drones, 3423 images were captured. Every 10 th image frame was chosen for this study to maintain positional variation, and all 343 images were manually annotated using the BeFunky Photo Editor (BeFunky, Portland, OR, USA). Annotating an image with the dead ash trees (Figure 7) took approximately 2 min per image. Ideally, the outline would be tighter to the branches, but that would increase the time required to annotate. Dead or declining ash trees could be distinguished from an aerial view via their oppositely arranged, stout, and relatively straight branches, which can be seen in the ground view image (Figure 7a). They are also particularly white in color. In addition, ground surveys were conducted to confirm that the declining or dead trees were indeed ash trees. As each image's dimensions are 3840 pixels × 2160 pixels, it is computationally expensive to train a deep learning model on these 4K images. Therefore, non-overlapping patches with dimensions 640 pixels × 720 pixels were cropped from each 4K image producing 18 patches from each image (Figure 8). This patch extraction leads to many negative samples (i.e., no ash trees) (Figure 8a,b). This yields 6174 images in total for deep learning. Figure 8. An illustration of patches generated from a 4K image acquired with drones. A total of 18 patches with dimensions of 640 pixels by 720 pixels were extracted from each 4K image for deep learning. Image patches (a,b) without declining ash trees, and image patch (c) has a declining or dead ash tree.
The annotated aerial images were converted into binary masks, which served as the ground-truth for training the deep learning models (Figure 9). These binary masks for ash trees were generated using MATLAB R2021a (The MathWorks Inc., Natick, MA, USA). The workflow (Figure 9) included extracting the red channel and grayscale information As each image's dimensions are 3840 pixels × 2160 pixels, it is computationally expensive to train a deep learning model on these 4K images. Therefore, non-overlapping patches with dimensions 640 pixels × 720 pixels were cropped from each 4K image producing 18 patches from each image (Figure 8). This patch extraction leads to many negative samples (i.e., no ash trees) (Figure 8a,b). This yields 6174 images in total for deep learning. As each image's dimensions are 3840 pixels × 2160 pixels, it is computationally expensive to train a deep learning model on these 4K images. Therefore, non-overlapping patches with dimensions 640 pixels × 720 pixels were cropped from each 4K image producing 18 patches from each image (Figure 8). This patch extraction leads to many negative samples (i.e., no ash trees) (Figure 8a,b). This yields 6174 images in total for deep learning. Figure 8. An illustration of patches generated from a 4K image acquired with drones. A total of 18 patches with dimensions of 640 pixels by 720 pixels were extracted from each 4K image for deep learning. Image patches (a,b) without declining ash trees, and image patch (c) has a declining or dead ash tree.
The annotated aerial images were converted into binary masks, which served as the ground-truth for training the deep learning models (Figure 9). These binary masks for ash trees were generated using MATLAB R2021a (The MathWorks Inc., Natick, MA, USA). The workflow (Figure 9) included extracting the red channel and grayscale information Figure 8. An illustration of patches generated from a 4K image acquired with drones. A total of 18 patches with dimensions of 640 pixels by 720 pixels were extracted from each 4K image for deep learning. Image patches (a,b) without declining ash trees, and image patch (c) has a declining or dead ash tree.
The annotated aerial images were converted into binary masks, which served as the ground-truth for training the deep learning models (Figure 9). These binary masks for ash trees were generated using MATLAB R2021a (The MathWorks Inc., Natick, MA, USA). The workflow (Figure 9) included extracting the red channel and grayscale information of the RGB images (Figure 9b,c), subtracting the grayscale image from the red channel image (Figure 9d), converting the subtracted image to a binary image with a 0.35 threshold that could include the small areas of annotations (Figure 9e), and using a morphological operation to fill up the marked zones after having the outline of the annotations. The final ground-truth mask with the white regions constituted the presence of declining or dead ash trees, and the black areas were the backgrounds (Figure 9f). of the RGB images (Figure 9b,c), subtracting the grayscale image from the red channel image (Figure 9d), converting the subtracted image to a binary image with a 0.35 threshold that could include the small areas of annotations (Figure 9e), and using a morphological operation to fill up the marked zones after having the outline of the annotations. The final ground-truth mask with the white regions constituted the presence of declining or dead ash trees, and the black areas were the backgrounds (Figure 9f). The dataset for deep learning consisting of 6174 images was split into 4357 training images, 1200 validation images, and 617 test images (7:2:1). As mentioned earlier, each 4K image was divided into 18 patches to make the dataset computationally efficient. Potentially, we could downsize the 4K image, but the downsizing would lead to information loss while generating the annotation masks ( Figure S1). Thus, we used image patches with different augmentation techniques. These training, validation, and testing datasets had positive (presence of ash decline and death) and negative (absence of ash decline and death) images. These images and masks were used to create annotations in the standard MSCOCO dataset format. The procedure for the prediction of ash trees from a given video is given as a flowchart shown in Figure 10. The dataset for deep learning consisting of 6174 images was split into 4357 training images, 1200 validation images, and 617 test images (7:2:1). As mentioned earlier, each 4K image was divided into 18 patches to make the dataset computationally efficient. Potentially, we could downsize the 4K image, but the downsizing would lead to information loss while generating the annotation masks ( Figure S1). Thus, we used image patches with different augmentation techniques. These training, validation, and testing datasets had positive (presence of ash decline and death) and negative (absence of ash decline and death) images. These images and masks were used to create annotations in the standard MSCOCO dataset format. The procedure for the prediction of ash trees from a given video is given as a flowchart shown in Figure 10.

Overview of Deep Learning Procedures
To detect declining or dead ash trees, we used Mask2former [18]. Unlike Mask R-CNN, the semantic segmentation task is not limited to detected bounding boxes. Instead, each segment in a digital image is converted into a C-dimensional vector (DETR [42]), mask former [17]), and fed to a trained decoder. The architecture consists of a backbone, a pixel decoder, and a transformer decoder with masked attention. The purpose of the

Overview of Deep Learning Procedures
To detect declining or dead ash trees, we used Mask2former [18]. Unlike Mask R-CNN, the semantic segmentation task is not limited to detected bounding boxes. Instead, each segment in a digital image is converted into a C-dimensional vector (DETR [42]), mask former [17]), and fed to a trained decoder. The architecture consists of a backbone, a pixel decoder, and a transformer decoder with masked attention. The purpose of the backbone is to extract low-resolution features. The pixel decoder upsamples these low-resolution features and converts them into high-resolution per-pixel embeddings. The transformer decoder takes in the image features and converts them into object queries. From these per-pixel embeddings with object queries, the final binary mask predictions are obtained. The input RGB image of the size W × H was given as input to the backbone to extract the feature map F R Cx H S X W S where C was the number of channels, H and W were the height and width of the input image, and S was the feature map's stride. In this study, we used RGB images so the number of channels would be three, but the stride of the feature map varied with different backbones. We explored CNN-based architectures (ResNet50 and ResNet101 [43]) and transformer-based architecture (SWIN [27]) as backbones for this study. These low-resolution features were then fed to a pixel decoder to gradually upsample and produce per-pixel embeddings E pixel R C g xHxW where C g was the dimension of the embeddings. The overall architecture is shown in the Figure 11.

Overview of Deep Learning Procedures
To detect declining or dead ash trees, we used Mask2former [18]. Unlike Mask R-CNN, the semantic segmentation task is not limited to detected bounding boxes. Instead, each segment in a digital image is converted into a C-dimensional vector (DETR [42]), mask former [17]), and fed to a trained decoder. The architecture consists of a backbone, a pixel decoder, and a transformer decoder with masked attention. The purpose of the backbone is to extract low-resolution features. The pixel decoder upsamples these lowresolution features and converts them into high-resolution per-pixel embeddings. The transformer decoder takes in the image features and converts them into object queries. From these per-pixel embeddings with object queries, the final binary mask predictions are obtained. The input RGB image of the size W × H was given as input to the backbone to extract the feature map F ℝ where C was the number of channels, H and W were the height and width of the input image, and S was the feature map's stride. In this study, we used RGB images so the number of channels would be three, but the stride of the feature map varied with different backbones. We explored CNN-based architectures (ResNet50 and ResNet101 [43]) and transformer-based architecture (SWIN [27]) as backbones for this study. These low-resolution features were then fed to a pixel decoder to gradually upsample and produce per-pixel embeddings ℰ ℝ where Cg was the dimension of the embeddings. The overall architecture is shown in the Figure 11. The chief component of the transformer block was the masked attention operator. The function of this operator was to extract features from the predicted mask by limiting the cross attention [20] to just foregrounds instead of the entire feature map. Along with that, the pixel decoder's successive feature maps were fed to successive transformer decoder layers thereby utilizing the high-resolution features to extract smaller objects. The standard cross attention at any layer l was given as: where X l refers to the query features at lth layer. Q l , K l , and V l are the image features under the linear transformations (i.e., f Q , f K , and f V, respectively). X o denotes the transformer decoder's input query features. By adding the attention matrix to the standard crossattention, we had: The matrix M l−1 at any feature point (x,y) gives the binarized mask predicted from the previous transformer layer (l − 1)th with a threshold of 0.5.
The query features of the architecture of the decoder block ( Figure S2) are processed in the following order: cross-attention, self-attention, and feed-forward network (FFN). The query features are affiliated with learnable positional embeddings and are initiated to zero before feeding into the decoder.

Data Augmentation and Training Procedure
Mask2former is a supervised deep learning model. Training such a model requires many image examples with proper annotations to learn complex patterns for higherlevel data abstraction. Annotating large datasets is very time-consuming (e.g., 2 min per image in our study). To improve the data efficiency of the deep learning models, we employed data augmentation techniques [44]. We used the large-scale jittering technique (LSJ) [45,46] as our primary data augmentation strategy. We used a fixed crop size of 1024 pixels × 1024 pixels and an image resize technique with a ratio range of 0.1 to 2.0. Along with LSJ, padding and normalization techniques were also used for training the model. We used test-time augmentation (TTA) with multiple scales and flipping provided by mmdetection [47]. Inspired by the standard Mask R-CNN inference settings, the image scale was set to 800 pixels and 1333 pixels on the shorter and longer side, respectively.
We performed all our experiments on a single NVIDIA GeForce RTX 3090. Our implementations are based on mmdetection [47]. Furthermore, we used the AdamW optimizer with a learning rate of 0.0001 and a weight decay of 0.05 for all the baselines. The batch size was set to 2 and the other environment details are as follows: Python 3.10.4, GCC 9.3, CUDA 11.6, PyTorch 1.12.1, MMdetection 2.25.1, CuDNN 8.5, and Ubuntu 20.04.1.
We employed weighted loss functions for calculating the loss for training the model. For mask loss, L mask , we used binary cross entropy loss with a weight λ mask = 5.0 and for class loss L cls , we used the binary cross entropy loss instead of cross entropy (i.e., single class dataset) with weight λ cls = 2.0. Along with these loss functions, we used Dice loss [48] L Dice with a weight λ dice = 5.0. The total loss (L) was a combination of L mask , L cls , and L Dice and expressed as,

In-Situ Field Validation Study
To assess the detection ability of the finalized deep learning models, we conducted an in-situ field validation study. This type of study also implies the applicability of the developed deep learning models to real-world problem-solving.
A total of nine sites were selected for in-situ validation study, and the sites were located in Pennsylvania (Greene, Warren, Elk, McKean, and Forest counties) and Northern West Virginia (Monongalia County). These sites experienced low to heavy emerald ash borer infestations, leading to the death and decline of ash trees. A detailed description of the sites can be found in Table S1.
A protocol to obtain aerial images for the detection of declining or dead ash trees was developed to efficiently and promptly complete the survey. The main steps in the protocol included: (1) a target area for the aerial survey was defined and a drone flight path was planned, (2) a drone was deployed with an autopilot function for surveys with aerial photography over the target area, and (3) aerial images were downloaded and stitched to generate a composite image of the target area ( Figure 12). generate a composite image of the target area ( Figure 12).
The drone was flown 50-75 m above the ground depending on terrain and obstacles. After the flight missions were completed, aerial images were downloaded from the drone and aligned according to the pre-planned flight path. The aerial images were stitched with Pix4DMapper (Pix4D, Prilly, Switzerland) to generate a composite image, which was then georeferenced to confer spatial attributes to the map coordinate system, i.e., ortho-mosaicking with ArcInfo ® 10 (ESRI, Redland, CA, USA). The composite images were cropped into patches of 640-pixel-by-720-pixel dimensions and were fed to the trained Mask2former model. The output of the model was the series of predictions on the cropped patches. Finally, these predicted patches were stitched together to form the composite image. The final outputs were maps with boundary boxes of declining or dead ash trees.
To compare the outputs of the deep learning model with those of ground-truth data, we conducted ground surveys. The locations of individual declining and dead ash trees were recorded with GPS at each of nine sites. The results of the ground survey were used to validate the deep learning outputs by calculating and comparing the type I error (the aerial survey detected declining or dead ash trees, but these trees were not detected by ground validation) and the type II error (the aerial survey did not detect declining or dead ash trees, but trees were detected by ground survey) [49].
Supplementary Materials: The following supporting information can be downloaded at: www.mdpi.com/xxx/s1, Table S1: The number of declining or dead ash trees detected by ground survey and the Mask2former (deep learning model), and associated statistics including types I and II errors. This in-situ validation study was conducted in nine different sites in Pennsylvania and The drone was flown 50-75 m above the ground depending on terrain and obstacles. After the flight missions were completed, aerial images were downloaded from the drone and aligned according to the pre-planned flight path. The aerial images were stitched with Pix4DMapper (Pix4D, Prilly, Switzerland) to generate a composite image, which was then georeferenced to confer spatial attributes to the map coordinate system, i.e., ortho-mosaicking with ArcInfo ® 10 (ESRI, Redland, CA, USA).
The composite images were cropped into patches of 640-pixel-by-720-pixel dimensions and were fed to the trained Mask2former model. The output of the model was the series of predictions on the cropped patches. Finally, these predicted patches were stitched together to form the composite image. The final outputs were maps with boundary boxes of declining or dead ash trees.
To compare the outputs of the deep learning model with those of ground-truth data, we conducted ground surveys. The locations of individual declining and dead ash trees were recorded with GPS at each of nine sites. The results of the ground survey were used to validate the deep learning outputs by calculating and comparing the type I error (the aerial survey detected declining or dead ash trees, but these trees were not detected by ground validation) and the type II error (the aerial survey did not detect declining or dead ash trees, but trees were detected by ground survey) [49].
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/plants12040798/s1, Table S1: The number of declining or dead ash trees detected by ground survey and the Mask2former (deep learning model), and associated statistics including types I and II errors. This in-situ validation study was conducted in nine different sites in Pennsylvania and West Virginia, USA, and two different thresholds (0.1 and 0.5) for the Mask2former were used; Figure S1. Disadvantages of image resizing in segmentation. (a), RGB image; (b), mask generated after cropping the RGB image into patches; (c), mask generated by resizing the RGB image; Figure S2. Transformer decoder architecture of the Mask2former model. The Queries (Q), Keys (K), and Values (V) vectors derived from the RGB image are fed to the cross-attention block, self-attention block, and the feed-forward network (FFN) to obtain segmentation maps and bounding boxes with a class label; Figure S3. Resolutions of aerial images showing ash tree dieback at different drone flight altitudes. Even 80 m above the ground, aerial images show the opposite branching pattern of declining ash trees.

Data Availability Statement:
The data that support the findings of this study will be available upon request.