2.1. Study Areas and Reference Data
The study areas were in Helsinki and Lahti in Southern Finland, and in Ruokolahti in South-Eastern Finland, as shown in
Figure 1. There were seven study sites in total: Paloheinä (PH) in the Helsinki city central park (60°15′25.200″N, 24°55′19.200″E), Kerinkallio (KK) and Mukkula (MK) in Lahti (60°59′16.080″N, 25°41′2.040″E), and Murtomäki (MM), Paajasensalo (PS), Ryhmälehdonmäki (RM) and Viitalampi (VL) in Ruokolahti (61°29′21.840″N, 29°3′0.720″E) (
Figure 2). The study sites are dominated by mature Norway spruce and were known to have ongoing infestations by
I. typographus. The datasets were collected during several years using RGB cameras mounted on different UASs: Lahti in autumn 2013, Ruokolahti in autumn 2015, 2016, 2019, and 2020, and Helsinki in spring and autumn 2020. Details about the Lahti dataset are given in [
5] and the Helsinki dataset in [
11].
Tree health data was collected by experts by evaluating and classifying various symptoms in field surveys. All symptoms were rated on a discrete numerical scale representing the severity of the symptom (
Table 1).
In Lahti and Ruokolahti, circular sampling plots were established, and all trees within a plot were surveyed using the approach developed in [
19]. The center of each plot was located with a Trimble Geo XT GPS-device (Trimble Navigation Ltd., Sunnyvale, CA, USA). Individual trees were located by measuring the distance and azimuth to each tree from the plot center. The radius of the plots was 5 m in the Ruokolahti datasets from 2015 and 2016, and 10 m in all other Ruokolahti and Lahti datasets. Within a sampling plot, the tree species, diameter at breast height (DBH) and height of each tree was recorded. For spruce trees, five symptoms indicative of bark beetle infestation were recorded: crown discoloration, defoliation, bark beetle entry and exit holes in the trunk, resin flow spots, and declined bark condition. The original scoring was 1–4 for discoloration and defoliation and 1–3 for holes, resin flow, and bark condition.
In Paloheinä, instead of using sampling plots, health data was recorded for selected individual spruce trees dispersed throughout the area, aiming for a uniform distribution of different health statuses [
11]. The positions of the trees were obtained from orthophotos collected before tree selection. The recorded health symptoms were crown color, defoliation, resin flow, bark damage, and decreased canopy height; the scoring was 0–5 for the crown color, 0–4 for defoliation, and 0–2 for resin flow and bark damage.
The scoring scales were slightly different in these two schemes, so they were scaled to the same range (
Table 1). The scoring for entry and exit holes and canopy height are not shown, because they were not evaluated for both areas and therefore were not used in scoring. There were altogether 1616 reference spruce trees and 1743 trees in total.
Two symptom rules were considered: one based only on the crown color, and another based on both crown and trunk symptoms. Previous studies have already shown that the classification is successful when there are visible crown symptoms [
9,
13], but it is also of interest to study whether the spruce trees with green crowns and stem symptoms could be distinguished from the healthy green spruce trees with the proposed method.
Symptom rule A: The crown color was used to classify the spruce trees into classes healthy, infested, and dead. Only spruce trees were labeled; other trees were considered background. In the Lahti and Ruokolahti datasets, green spruce trees were labeled as healthy, yellowish and reddish as infested, and gray as dead. In the Paloheinä datasets, green and faded green spruce trees were labeled as healthy, yellowish, yellow, and reddish/brown as infested, and gray as dead. Green attack was not included as a separate class in this study, because the field surveys were performed mostly late in the summer, when there was not a significant amount of trees in this phase.
Symptom rule B: A health index for the reference trees was calculated using a modified version of the equation initially presented in [
11]:
The trees with reddish, brown, or gray color were classified as dead. Trees with a Health_index of two or lower were classified as healthy and those with a Health_index greater than two as infested.
The number of trees in each class is presented in
Table 2. With symptom rule A, the class healthy was the largest and the class infested the smallest; particularly the class infested was small in Lahti and Ruokolahti, with 14 and 17 trees, respectively. With symptom rule B, the class distribution was more even. In Lahti and Ruokolahti, many trees classified as healthy using rule A were moved to the class infested. These were trees that had a green or faded green crown color despite having stem symptoms.
The class distribution represented well the outbreak status in the areas. In Paloheinä, the outbreak was at an active state, with many fresh infestations and high mortality. In Ruokolahti, the high number of dead spruce trees indicated a strong earlier colonization pressure; the portion of freshly infested reference trees was small due to an overridden outbreak peak. The fact that several trees with trunk symptoms had green crowns was due to either low population densities or an early phase of colonization.
2.2. Remote Sensing Datasets
Remote sensing images were collected during the field inspections in 2013–2020 (
Table 3). Different RGB cameras and multirotor UASs were used in different years. The cameras included the Samsung NX1000 (20.3 megapixels 23.5 × 15.7 mm CMOS sensor and 16 mm lens), Samsung NX300 (20.3 megapixels 23.5 × 15.7 CMOS sensor and 16 mm lens), Sony A7R (36.4 megapixels 35.9 × 24 mm CMOS sensor and 28 mm lens), and Sony A7R II (42.4 megapixels 35.9 × 24 mm CMOS sensor and 35 mm lens) operated in single and dual camera modes. The flying altitudes were 90–140 m. Datasets were collected under cloudy, sunny, and varying conditions.
Orthophotos were calculated with ground sample distances (GSDs) of 2.4–6 cm with the Agisoft Metashape software using methods described in [
20]. The images were scaled to a digital number range of 0–255 without further radiometric calibration.
Annotated datasets were created using the orthophotos and reference trees. Tree crown segmentations were made to determine the extent of individual reference treetops (
Figure 3). For the Paloheinä dataset, this was done by automatically segmenting airborne laser scanning (ALS) data provided by the city of Helsinki and visually checking potential fallen trees by comparing segments to orthophotos [
11]. Segmentations for Lahti and Ruokolahti data were made manually using the orthophotos and height data. Finally, bounding boxes were created around each segmented tree. The ground-surveyed reference tree coordinates in the Lahti and Ruokolahti datasets did not match perfectly with the tree positions in the orthophotos; the displacements were corrected during the manual segmentation process. The Paloheinä reference trees were initially located in the orthophotos.
As the YOLOv4-p5 network can only process images up to a certain size without loss of information, the orthophotos were divided into several smaller images. For the Lahti and Ruokolahti areas, rectangular images corresponding roughly to the sample sites were cropped (
Figure 3a). Possible unlabeled trees in the perimeters of the plots were removed by replacing them with zero-value pixels. In the Paloheinä area, individual reference trees were spread over the area. In this case, convex hulls for the tree crown segments with an additional 0.5 m buffer around them were created and the resulting shapes were cut from the orthoimages. Cropped images with a black background were then created (
Figure 3b). The resulting images were not as realistic as the sampling plot images, as the trees were not surrounded by normal forest vegetation.
2.3. YOLOv4-p5 Implementation
The original YOLO network was published in 2016 [
12]. It learns to detect the location and class of objects in an image. As a supervised deep learning approach, it requires lots of labeled training data to learn the task. Since the first YOLO release [
12], several refined network versions have been presented, for example, YOLOv2 [
21], YOLOv3 [
22], and YOLOv4 [
23]. Scaled-YOLOv4 [
24], specifically its version YOLOv4-p5 which has been scaled up from YOLOv4, was used in this study. It has a deeper network than YOLOv4, and its input size (896 × 896 pixels) is larger than that of YOLOv4 (608 × 608 pixels). Scaled-YOLOv4 builds on the earlier YOLO models, as well as the Ultralytics YOLOv3 implementation [
25], their YOLOv5 [
26], and concepts from other neural networks.
Scaled-YOLOv4 consists of a backbone network, a neck, and a detection head [
24]. The backbone is a CNN that extracts hierarchical features from three-band input images. The neck consists of add-on blocks that combine information from the feature maps at different levels. Finally, the detection head predicts object locations and their classes based on the combined features. It predicts a fixed number of bounding boxes, outputting for each the coordinates of the box center, the width and height of the box, an objectness score, and a class probability for each possible class [
24,
27]. The objectness score estimates the probability of the box containing an object of any class, while the class probabilities represent the conditional probability of the box containing an object of a specific class, given that it contains some object [
12,
21].
The network is trained by minimizing the error between its outputs and the ground-truth labels for a set of training images. The error is measured by a loss function (Equation (2)), which is the weighted sum of objectness (L
obj), classification (L
cls), and bounding box (L
bb) losses:
where λ
1, λ
2, and λ
3 are the respective weights for objectness, classification, and bounding box loss. Binary cross-entropy (BCE) loss is used for the objectness and classification losses, and Complete IoU loss [
28] for the bounding box loss. Training is done using mini-batch gradient descent (PyTorch SGD optimizer) with Nesterov momentum [
29,
30], weight decay, and a cosine learning rate scheduler. The network is trained for a chosen number of epochs. An exponential moving average (EMA) of the model weights is maintained during training, and the EMA model with the best performance on the validation set is chosen as the final model [
27]. The performance is evaluated using the fitness score
where AP indicates average precision;
[email protected] and
[email protected]:0.95 are defined in
Section 2.4.
Models were not trained from scratch but by fine-tuning a pre-trained model. Fine-tuning refers to taking a model trained for one task and continuing its training using another dataset. This way, the knowledge learned for a more general task may be used as a starting point for learning a specific task. This can be beneficial when only a limited amount of data is available for the task at hand. All layers of the network may be fine-tuned, or some may be frozen before training. As a starting point for training, a pre-trained YOLOv4-p5 was used; it had been trained on the COCO 2017 object detection dataset, which contains 118,000 training and 5000 validation images with labeled instances from 80 common object classes [
31]. Additionally, models were fine-tuned for different subareas by first fine-tuning the COCO-trained model on the full tree health dataset, and then further fine-tuning the model with data from one subarea. All the layers of the network were fine-tuned.
Training details can be adjusted by changing the model hyperparameters. Models were initially trained using the default hyperparameters, then the hyperparameters for two different models were optimized. The hyperparameters control details of the optimization algorithm, loss function, and data augmentation. The initial learning rate, momentum, and weight decay parameters affect the optimization algorithm, while the objectness, classification, and bounding box loss gains adjust the contribution of each component loss to the total loss. The BCE loss positive weights for the objectness and classification losses control the influence of positive samples on these losses. The anchor-multiple threshold controls the matching of predicted and ground-truth bounding boxes when computing loss. The Focal loss gamma is a parameter of focal loss [
32], which can optionally be used in place of BCE loss. The remaining hyperparameters control data augmentation during training.
Scaled-YOLOv4 uses a set of data augmentation techniques on the training set to improve performance. Data augmentation provides more data for training and creates new types of examples, which helps learning. The dataset is augmented by combining multiple images into one training image using a mosaic technique [
26] and MixUp augmentation [
33], and then applying standard photometric and geometric distortions: random hue-saturation-value (HSV) changes, translation, scaling, and flipping. Additional augmentation options include rotation, shear, and perspective shifts. Each augmentation hyperparameter controls the probability or possible range of one augmentation method.
The hyperparameters were optimized for two of the models using a genetic algorithm (e.g., [
34]) available in the Scaled-YOLOv4 implementation [
27]. In each generation, the model was trained for 70 epochs, and hyperparameters were tuned by selection and mutation according to the performance on the validation set. The performance was measured with the fitness score (Equation (3)). The genetic algorithm was run for 300 generations. The hyperparameters were initialized with the default values, except those with default value 0 (details are given in
Section 3.1).
When making detections with the trained model, Scaled-YOLOv4 applies some post-processing to its outputs. Non-maximum suppression (NMS) is used to remove redundant detections: it compares the confidence scores of overlapping boxes of the same class and removes the predictions with lower confidence. The model also uses a confidence threshold (CT) to filter the outputs, removing predictions with a score lower than the threshold. In this study, the confidence threshold was selected by balancing the precision and recall values on the validation dataset, and it varied from 0.1 to 0.6 for the different models.
The official PyTorch implementation of YOLOv4-p5 [
27] was used. The code was run inside a Docker container created from the NVIDIA PyTorch container image, release 20.06 [
35]. The container was an Ubuntu 18.04 environment with Python 3.6, PyTorch 1.6.0a0+9907a3e, and NVIDIA CUDA 11.0.167. For full details, see [
35]. The host system was a desktop computer with an Ubuntu 18.04 operating system and a NVIDIA GeForce GTX 1080 Ti (11 GB) graphics processing unit (GPU).
Multiple trained YOLOv4-p5 models were produced by varying the training data, symptom rule, and hyperparameters (
Table 4). A batch size of four was used for training. The impact of hyperparameter optimization was evaluated for two models (trFull-syrA-hyopt, trFull-syrB-hyopt). The models trained with the full data (all areas combined) as well as using separately the subareas Paloheinä (PH) and Lahti/Ruokolahti (LR) data were evaluated. Three different model options were evaluated in the subareas: (1) training the model with the full data and testing with the subarea test data (trFull-syrA-PHtest, trFull-syrA-LRtest), (2) training and testing the model using only the subarea data (trPH-syrA, trLR-syrA), and (3) initial training using the full datasets and fine-tuning and testing using the subarea data (trFull-syrA-PHft, trFull-syrA-LRft) (in the previous, for the sake of clarity the models are given only for symptom rule A; evaluations were made similarly for symptom rule B).
The full dataset was divided randomly into training, validation, and test datasets. 20% of the images were reserved for testing, and the remaining 80% were divided 80:20 into training and validation datasets, respectively. The subarea datasets were created after the division by removing the images from other areas from each dataset. Summary of the samples in the training dataset is given in
Table 5 and in the validation and testing datasets in
Table 6.
2.4. Performance Assessment
The trained models were evaluated using test datasets that consist of labeled data not used in training (see
Table 6). To determine whether a predicted box correctly detects an object, it is compared to the ground-truth (GT) labels. A prediction is considered correct, or true positive (TP), if its Intersection over Union (IoU) [
36] with a ground-truth box of the same class is greater than a chosen threshold; IoU threshold 0.5 was used in this study. If a prediction overlaps with multiple ground-truth boxes by more than the threshold, it is considered to predict the ground-truth object that it has the highest IoU with. The predictions that do not overlap a GT box of the same class by more than the threshold are considered false positives (FP). FPs are thus caused by location and classification errors.
Detection results of the trained model were evaluated with the metrics precision, recall, F-score, and average precision (AP) [
36]. Precision and recall are calculated using a TP IoU threshold 0.5. Precision is defined as the fraction of predictions that are correct:
Recall is the fraction of ground-truth objects that were identified correctly, i.e.,
where FN indicates the number of false negative detections. Precision and recall may be adjusted by changing the confidence threshold. Recall decreases as the confidence threshold increases, while precision may increase or decrease. The F-score [
36] is a harmonic mean of precision and recall:
A precision-recall curve may be plotted by plotting precision against recall at different confidence thresholds from 0 to 1. An ideal object detector has high precision at all recall levels, and thus the area under the curve (AUC) is close to 1. AP is an estimate of the AUC, computed at a chosen true positive IoU threshold [
36]. AP at IoU threshold 0.5 (
[email protected]) and the mean of APs at 10 equally spaced IoU thresholds from 0.5 to 0.95 (
[email protected]:0.95) were computed for the different classes. The term mean average precision (mAP) refers here to the mean of APs of the different classes; in other works, it is sometimes also used interchangeably with AP, or to denote the average over different IoU thresholds. Overall precision, recall,
[email protected], and
[email protected]:0.95 were computed as the mean of the class-wise results.