Automatic Delineation and Height Measurement of Regenerating Conifer Crowns under Leaf-Off Conditions Using UAV Imagery

The increasing use of unmanned aerial vehicles (UAV) and high spatial resolution imagery from associated sensors necessitates the continued advancement of efficient means of image processing to ensure these tools are utilized effectively. This is exemplified in the field of forest management, where the extraction of individual tree crown information stands to benefit operational budgets. We explored training a region-based convolutional neural network (Mask R-CNN) to automatically delineate individual tree crown (ITC) polygons in regenerating forests (14 years after harvest) using true colour red-green-blue (RGB) imagery with an average ground sampling distance (GSD) of 3 cm. We predicted ITC polygons to extract height information using canopy height models generated from digital aerial photogrammetric (DAP) point clouds. Our approach yielded an average precision of 0.98, an average recall of 0.85, and an average F1 score of 0.91 for the delineation of ITC. Remote height measurements were strongly correlated with field height measurements (r2 = 0.93, RMSE = 0.34 m). The mean difference between DAP-derived and field-collected height measurements was −0.37 m and −0.24 m for white spruce (Picea glauca) and lodgepole pine (Pinus contorta), respectively. Our results show that accurate ITC delineation in young, regenerating stands is possible with fine-spatial resolution RGB imagery and that predicted ITC can be used in combination with DAP to estimate tree height.


Introduction
Successful reforestation following harvest is an integral component of sustainable forest management, as the degree to which stands become established will influence the entire lifecycle of a forest [1]. Given that regenerating stands are harvested approximately 80 years after planting, the reforestation phase represents a significant financial and ecological investment, the success of which rests heavily upon the ability to monitor and manage these young, regenerating stands [1,2]. The frequency at which stands are monitored, however, is currently limited by the cost of field-based assessment methods, which can result in poorly performing stands being assessed with insufficient frequency [3].
The pronounced size, abundance, and remoteness of Alberta's forested land base relative to other jurisdictions increases the cost and difficulty of operating field-based assessments. These factors are confounded by the restricted availability of field crews as well as poor road access to many harvested sites. Moreover, licensee holders are obligated to return harvested stands to a free-to-grow status, and silvicultural surveys determining this status present significant operational challenges, including additional financial constraints [4]. However, the increasing use of remote sensing datasets for forest inventory and assessment has seen marked improvements to the speed, scale, and cost of forest monitoring and the continued development of these applications provides new solutions to address some of the difficulties managers are faced with around regeneration assessment [5].
One area of long-term investment in optical remote sensing applications is the extraction of information at the individual tree level at a pace that is more efficient and less expensive than field surveying or photo-interpretation. This need has driven research into tree crown detection and delineation algorithms since the 1980s, when advances in digital optical sensors made individual tree crowns (ITC) visible in imagery and increased computing power made automated crown delineation possible [6]. The foundational concept driving these algorithms is that in nadir imagery, treetops will receive more solar illumination than other parts of the crown. Accordingly, pixels corresponding to treetops will be the brightest pixels within a tree crown. In early studies (e.g., [7]), tree crowns were detected by identifying the brightest pixels (local maxima) within a moving window of fixed dimensions (for example, 5 by 5 pixels) as the window scanned across an image. This technique was named local maxima filtering. Subsequently, the valley-following algorithm was developed [8], which isolated ITC by detecting local minima in shaded gaps between crowns rather than by detecting local maxima within them. An extension of the valley-following algorithm, watershed algorithms saw wide-spread adoption for ITC delineation in the early 2000s [5]. Watershed algorithms generate a surface in which the brightest pixels correspond to the lowest elevations where water would gather if the surface were flooded, thus representing watersheds. Boundaries formed where watersheds met, which allowed for discrete tree crowns to be delineated.
While effective for stands with evenly sized and well-spaced crowns, these algorithms faced challenges with complex forest conditions where crowns varied in diameter and grew in clustered formations [5]. The fixed window size of local maxima filtering approaches was found to be inadequate under these conditions. Large windows captured large crowns and missed small ones, while small windows captured small crowns and counted large crowns multiple times. Local maxima filtering algorithms with variable window sizes were developed; however, overlapping crowns with minimal shadow between them were still a hindrance to both local maxima filtering and watershed algorithms. These algorithms also demonstrated sensitivity to solar angle and elevation as shadow-obscured crowns tended to be missed. In addition to these challenges, spatial resolution proved to be a barrier to the detection and delineation of ITC in regenerating forests. When Gougeon and Leckie [9] applied their valley-following methods to regeneration in multispectral aerial imagery with a 30 cm spatial resolution, the small crowns of five-year old trees were completely missed. Later inquiries by Gougeon and Leckie [10] yielded the delineation of approximately 85% of crowns with 2.5-4.0 m diameters using a semiautomatic methodology applied to IKONOS (GSD 1 m) imagery resampled to 50 cm spatial resolution. However, clusters of multiple crowns were commonly delineated as a single entity. Leckie [11] reported similar errors for the delineation of 20 year-old conifers in multispectral aerial imagery with a 60 cm spatial resolution, which they attributed to imagery coarseness.
Currently, high spatial resolution RGB imagery is widely available, allowing even small tree crowns to be visible from practical altitudes. A new generation of algorithms have been developed to overcome some of the aforementioned challenges in delineating ITC. Convolutional neural networks (CNN) have been increasingly utilized for automatic semantic segmentation (the prediction of class labels for each pixel in an input image) in remotely sensed images and have been proven capable of outperforming conventional techniques [12][13][14][15][16]. CNN are comprised of layers of convolution filters that work in succession to extract increasingly nuanced object features from input imagery. By analyzing annotated reference images, a CNN learns by example to activate layers that appropriately capture aspects of its target objects, resulting in powerful image segmentation capabilities that are not based exclusively on pixel brightness. Furthermore, trained CNN are highly transferable; the layer activation patterns learned by a CNN are stored in a single file, which can either be disseminated for wide-spread use or to initialize the training of a new CNN (the latter is known as transfer learning) [17]. The aptitude of CNN for regeneration monitoring has been demonstrated by Fromm et al. [18] and Pearse et al. [19] who applied CNN to automatically detect individual conifer seedlings in UAV-acquired RGB imagery (DJI Mavic Pro and DJI Phantom 4 Pro, respectively). Likewise, Fuentes-Pacheco et al. [20] employed a CNN to detect and delineate phenotypically variable fig plants and Kentsch et al. [21] trained a CNN to detect and classify patches of invasive and native vegetation. Both of these studies used RGB imagery collected with a DJI Phantom 4 UAV.
The proliferation of CNN applications over the last decade and their application to remotely sensed datasets has precipitated the ongoing improvement of network architectures better designed to identify features on image datasets [14]. One key improvement has been the increased use of region-based CNN (R-CNN), which address the difficulty of detecting multiple objects of varying aspect ratios by using a selective search algorithm [22] to generate a set amount of region proposals from which object bounding boxes may be extracted [23]. Building on this, Mask R-CNN [24] provides an additional step to R-CNN with an additional network branch dedicated to generating a per-pixel, binary mask for each instance of an object detected within a bounding box [25]. This distinguishes Mask R-CNN from other R-CNN, whose processing pipelines terminate with outputting bounding boxes. While Mask R-CNN was designed for object instance segmentation in natural imagery, the advantage of per-pixel object masks has seen quick uptake in medical imaging [26][27][28] and subsequently, remote sensing [29][30][31][32]. In the context of forest monitoring, Braga et al. [32] employed Mask R-CNN to delineate ITC in a tropical forest with RGB, UAV-acquired imagery. They reported precision, recall, and F1 scores of 0.91, 0.81, and 0.86, respectively.
In addition to accurate delineation of tree crowns, a key requirement for forest management is the ability to extract information such as height from ITC. Recent advances in digital aerial photogrammetry (DAP) have enabled the construction of three-dimensional (3D) point clouds from overlapping UAV-acquired imagery [33][34][35]. From these 3D point clouds, canopy height models (CHM) [36] and fine-resolution orthomosaics can be constructed simultaneously, providing access to valuable height and spectral data.
In this study, working in regenerating stands 12-14 years postharvest, we assess the ability of Mask R-CNN to delineate individual coniferous trees in imagery collected during the early spring, before deciduous green-up (leaf-off conditions). We then extract height information from DAP-derived CHM and compare the results to reference data collected in the field. To do so, we first train Mask-R-CNN to automatically delineate regeneration crowns using UAV-acquired RGB imagery. We assess the efficacy of this approach across a range of regenerating stands of different boreal forest types and growing conditions in managed forests in west-central Alberta, Canada. We then explore the potential to use predicted ITC polygons to automatically extract tree-level height information from a DAP-generated CHM. This research serves as a contribution towards the goal of automating regeneration monitoring using remote sensing techniques.

Study Sites
Eighteen regenerating stands in the Drayton Valley and Rocky Mountain House regions of Alberta, Canada, were selected, representing 7 of the 10 typical stand types in Alberta [37] (Table 1, Figure 1). Dominant species were lodgepole pine (Pinus contorta), white spruce (Picea glauca), and aspen (Populus tremuloides). Balsam poplar (Populus balsamifera), white birch (Betula papyrifera), and black spruce (Picea mariana) were also present. These stands are typical of other stands of similar types Remote Sens. 2020, 12, 4104 4 of 26 across the province because: (1) forest companies are required by law to create the stand type that was harvested; and (2) all stands across the province are required to meet the same base 10 stratum specific minimum stocking standard based on a minimum tree height requirement [2]. Four of the selected stands are located in the Upper Foothills Natural Subregion with the remaining fourteen being located in the Lower Foothills Natural Subregion. All sites were within 50 km of each other. The Upper Foothills Natural Subregion is characterized by conifer and mixed conifer stands while the Lower Foothills Natural Subregion is characterized by codominance of conifer (lodgepole pine, white spruce, black spruce) and deciduous (aspen, balsam poplar, white birch) stands [38]. These stands and tree species represent five of the seven commercial tree species harvested in Alberta. The Lower Foothills Natural Subregion represents a transition from the aspen-white spruce dominated forests of the Central Mixedwoods Natural Subregion and the lodgepole pine dominated forests of the Upper Foothills Natural Subregion. Topography is generally rolling (morainal deposits) in the Lower Foothills Natural Subregion and strongly rolling in the Upper Foothills Natural Subregion [38]. However, most of the selected stands displayed mild slopes.

Field Measurements
To meet the regulatory requirements of the Reforestation Standard of Alberta, each stand was visited by a trained field crew in the fall of 2017. Circular plots (10 m 2 ) were established using a systematic sampling design, with the number of plots proportional to the size of each sampled stand. At each plot location, trees were tallied by species. In addition, at every 4th plot a larger 100 m 2 circular plot was also established with the same center. At each larger plot location, top height trees (largest diameter trees at breast height by Aw, Pl, Sw, and Sb tree species groups) were identified by trees species and measured for total height and total age for site index determination. Plots were spatially located using a Garmin GPSMAP 64s device and height measurements were taken with a Vertex hypsometer.
In the spring of 2018 prior to bud flush, plot centers sharing both the small and large plot were relocated and marked with spray paint. Height and species data were verified for trees within each plot. Measured top height trees were marked with a spray paint halo to ensure visibility in the imagery. Azimuth and distance from plot center for top height trees were also recorded. Once the plots were marked and measured, imagery was collected. Measured trees were then related to corresponding orthomosiacs using their azimuth and distance to plot centers. During this step, some points required manual adjustments and fifteen field-measured conifers could not be located in orthomosaics with certainty. Of the remaining 214 field-measured conifers that were located, 76 were lodgepole pine

Field Measurements
To meet the regulatory requirements of the Reforestation Standard of Alberta, each stand was visited by a trained field crew in the fall of 2017. Circular plots (10 m 2 ) were established using a systematic sampling design, with the number of plots proportional to the size of each sampled stand. At each plot location, trees were tallied by species. In addition, at every 4th plot a larger 100 m 2 circular plot was also established with the same center. At each larger plot location, top height trees (largest diameter trees at breast height by Aw, Pl, Sw, and Sb tree species groups) were identified by trees species and measured for total height and total age for site index determination. Plots were

Leaf-Off Conditions
Imagery was acquired in early spring after snow-melt but before deciduous green-up had occurred. At this time of year, deciduous trees were leafless and grasses were still dormant and therefore light brown in color. Hence, under these leaf-off conditions, conifers (lodgepole pine and white spruce) were almost exclusively the only source of green reflectance (some patches of green moss were present). Furthermore, conifers situated beneath deciduous overstorey, which would otherwise be obscured during periods of deciduous foliage, were visible under leaf-off conditions.

Acquisition Parameters
True colour RGB imagery was acquired with a DJI Phantom 4 pro quadcopter over a 6-day period in May 2018. Imagery was acquired with a 20-megapixel 1" CMOS sensor. Prior to flights, at least three 60 × 60 cm targets (bicolor panels) were evenly distributed throughout each site. The location of the center point of each target was measured utilizing a Trimble Geo7× to produce high-accuracy ground control point coordinates ( Table 2). Flight patterns were programmed to achieve approximately 85% and 80% along-and across-track overlap, respectively. Flights were 100 m above ground level. Ground sampling distance (GSD) was approximately 3 cm. Flights occurred throughout the day under mostly clear sky conditions with minimal wind. Table 2. Input UAV imagery, ground control point counts with mean root mean square error (RMSE), and output point density (mean and standard deviation) for each of the eighteen stands. Mean point density was calculated by sampling ten 50 m 2 plots, which were evenly distributed throughout each site.

Images
Ground

Point Cloud Processing
UAV imagery was georeferenced using ground control point locations prior to the construction of DAP point clouds and orthomosaics with Pix4D software [39]. After tie-point pixels between overlapping imagery were matched, DAP point clouds were produced and 3-band (RGB) orthomosaics were exported. DAP point clouds were processed in LAStools [40] where they were classified into ground or nonground points. Points with maximum height values were used to construct digital surface models. Ground points were used to construct digital terrain models. Canopy height models (CHM) with a 3 cm resolution were then created according to the methods detailed by Khosravipour et al. [36]. DAP processing took approximately 3.5 h per hectare of imagery.
For processing within Mask R-CNN (described below), we subset the 3-band (RGB) orthomosaics into 600 × 600-pixel image tiles derived by splitting said orthomosaics with a 20% overlap (120 pixels in all directions). Orthomosaics were subset with overlap to account for trees that may be occluded by image tile boundaries, ensuring that all trees were fully visible in at least one image tile.

Training and Validation Data
One hundred and ten subset image tiles were selected for manual annotation, during which ITC were delineated as polygons. Fifty-two percent of these image tiles were from stands in which lodgepole pine were the dominant coniferous species. Remaining image tiles contained dominant stands of white spruce. Manually annotated subset image tiles were verified by a qualified Alberta Vegetation Inventory Level II interpreter. Figure 2 provides examples of these manually annotated subset image tiles. The 110 tiles were divided into 88 training tiles, 11 internal validation tiles, and 11 external validation tiles, following a 80:10:10 ratio [29].

Mask R-CNN
In the following section, we provide a high-level explanation of our Mask R-CNN approach for this study and refer readers to He et al. [41] for a more detailed background to the approach.

The Learning Process
The learning process for Mask R-CNN consists of two phases: training and validation [24]. During the training phase, Mask R-CNN applies a succession of convolutional filters to input imagery

Mask R-CNN
In the following section, we provide a high-level explanation of our Mask R-CNN approach for this study and refer readers to He et al. [41] for a more detailed background to the approach. The learning process for Mask R-CNN consists of two phases: training and validation [24]. During the training phase, Mask R-CNN applies a succession of convolutional filters to input imagery with the goal of distinguishing target ITC from background features [41]. The validation phase then tests filter combinations on examples that Mask R-CNN has not yet been exposed to-the internal validation set [24]. During the validation phase, Mask R-CNN will score its ability to delineate ITC using the filter combinations developed during the training phase and subsequently determine a new configuration of convolutional filters that are likely to yield a higher score. The training and validation phases are then repeated across the entire training and internal validation datasets. Each repetition of the training and validation phases is called an epoch and the total number of epochs Mask R-CNN performs is dictated by the user. Following the completion of an epoch, we accessed Mask R-CNN's performance on the external validation set, using average F1 score (described in Section 2.7) as a determinant of success.

Pretrained Networks
Previous studies have shown that initializing training using a pretrained CNN (transfer learning) reduced the duration of training time and subsequently improved results [17,18,42]. One key dataset that has been widely used to pretrain CNN across a variety of applications is the Microsoft Common Objects in Context (hereby referred to as MS COCO) dataset [43], which contains upwards of 91 object types ("Dog", "Person", "Plane", etc.). While a new model may not be trained to identify these object classes, the low-level features a CNN is trained to detect, including edges, textures, and simple geometry, are components of all objects and thus, transfer learning has demonstrated the ability of CNN to repurpose low-level features for the semantic segmentation of new object classes [17]. As demonstrated by Hu et al. [44] and Zhao, et al. [17], initializing training on remotely sensed imagery using a CNN pretrained with natural imagery improved results. Likewise, Fromm et al. [18] found pretraining to positively influence the performance of Faster R-CNN, a progenitor of Mask R-CNN.
We trained Mask R-CNN in separate instances using two Mask R-CNN models that were pretrained on different versions of the MS COCO dataset. The first pretrained network (hereby referred to as COCO), like many, was trained on the MS COCO dataset as presented by [43]. The second pretrained network (hereby referred to as Balloon) was trained on an expanded version of the MS COCO dataset that included a new object class for the semantic segmentation of individual and overlapping balloons in natural imagery [24]. We hypothesized that the similarities between the roughly circular footprints of our target conifer ITC and those of balloons would pose an advantage for the Balloon pretrained network over the COCO pretrained network.

Training Strategy
The architecture of Mask R-CNN is comprised of two main components: the network backbone and the network heads [25]. The backbone consists of convolutional layers used for feature extraction, while the heads consist of convolutional layers used for mask and bounding box prediction [25]. During the learning process, convolutional filters of the backbone or heads can be separately adjusted or frozen [24]. Because the backbone of a pretrained network is already trained to extract object features, it may not be necessary to adjust the backbone and adjusting only the heads will suffice to adapt to a new task [17]. However, allowing for the network's backbone and heads to be adjusted simultaneously can facilitate the learning of new features which will allow for Mask R-CNN to better solve a new task [17].
We investigated in separate instances (1) adjusting only network heads and (2) adjusting the heads and backbone simultaneously (all layers), where the first training strategy would elucidate the transferability of pretrained features and the second would demonstrate whether performance improvements could be made if the entire network were adjusted to solve our problem. In all instances, Remote Sens. 2020, 12, 4104 9 of 26 performance on the external validation set plateaued before 60 epochs had elapsed. Each epoch took approximately 18 min using a NVIDIA Quadro M4000 GPU.

Hyperparameter Changes
While repurposing a pretrained network is likely to improve results, some hyperparameters may need to be adjusted to account for differences between applications. For example, in order to adapt a version of Mask R-CNN pretrained with COCO to detect nuclei in medical imagery, Waleed [24] needed to account for differences between images in each dataset. Namely, images in the MS COCO dataset contained fewer object instances than did images in the nucleus dataset. Objects in the MS COCO dataset were generally larger than those in the nucleus dataset. To account for this, Waleed [24] increased the maximum number of object instances utilized per image during training from 100 to 200 and increased the maximum number of final detections from the default of 100 to 400. Additionally, to increase the number of detections, the non-maximum suppression threshold for the region proposal network was increased from 0.7 to 0.9, which increased the likelihood of detecting all nuclei in a given image.
In this research, the maximum number of detections was set to 300 on the assumption that training image tiles contain between approximately 1 and 250 delineated trees. The threshold for the number of ground truth instances used during training was set to 300 to ensure that all delineated trees within an image tile were used. To increase the number of proposals and detections during training, the region proposal network's non-maximum suppression threshold was set to 0.9 following [24]. The number of post non-maximum suppression regions of interest was set to 2000 to increase the number of opportunities for Mask R-CNN to make positive detections (especially in images with many densely clustered ITC). The minimum acceptable detection confidence was set to 0.5 to reduce the likelihood of incurring false negatives for small ITC that blend in with the background. Additionally, we used the Adaptive Moment Estimation (Adam) optimization algorithm, which has seen extensive use in CNN applications and has been recommended as the best general option for optimization algorithms [45].

Image Augmentation
To reduce the impact of overfitting these parameters on small training datasets, image augmentation in the form of simple geometric transitions, including flipping and rotating input training data, have been employed to expose CNN to statistically different examples of their assigned target objects [18,42]. Likewise, blurring images and altering their brightness can improve the robustness of a neural network towards radiometric variation between images [29]. For this research, we used the default Mask R-CNN image augmentation routine. This included blurring original images and altering their brightness in addition to performing random vertical and horizontal flips, and 90-, 180-, and 270degree rotations. Each time Mask R-CNN loaded an image during training, it performed zero to two of these image augmentations.

Model Evaluation
To quantify agreement between predicted ITC delineations and ground truth ITC, we used intersection over union (IoU), which is calculated as the number of pixels shared between predicted ITC and ground truth ITC divided by the total number of pixels contained in both entities [18] (Equation (1)). We considered a predicted ITC to be a true positive (TP) when the IoU value was greater than or equal to 0.5.
We calculated precision, recall, and F1 score at an IoU threshold of 0.5 for each image tile in the external validation set. Precision is the percentage of predicted ITC that were deemed true positives, and recall is the percentage of ground truth ITC that were correctly predicted. F1 Score is the harmonic mean of precision and recall, which we used as an indicator of overall accuracy [46]. Precision, recall, and F1 score values range from zero to one, where higher values indicate better results.
To facilitate comparison between models, we calculated average precision (AP) and average recall (AR) for all image tiles in the external validation set. We then calculated the average F1 score for each model as the harmonic mean of AP and AR.

Height Extraction and Comparison to Field Measurements
Following model evaluation, we applied our best-performing model to delineate ITC in each stand, using overlapping image tiles as input. We employed an automated postprocessing routine to eliminate redundant ITC polygons generated from predictions made on overlapping input image tiles. ITC polygons located within the 20% overlap zone of each output tile were labeled and shrunk to minimize the number of ITC polygons with shared borders. ITC polygons that overlapped after shrinkage were merged, thereby eliminating redundant or partial ITC polygons generated from overlapping input image tiles and unifying output tiles for each study site as single shapefile spanning the extent of that site. The maximum CHM value within each ITC polygon was then appended as an attribute. Predicted ITC within the surveyed plots were visually inspected. Those ITC that encompassed field-measured tree locations were deemed correct detections. Field-measured and CHM-extracted height measurements for detected ITCs were then compared by simple linear regression.
CHM-extracted height measurement error was quantified using Equations (5)- (7). Error was calculated as remote height minus field height (Equation (5)) so as to indicate over-and underestimation with positive and negative error values, respectively. Because error values were inherently smaller for shorter conifers, unsigned relative error (Equation (7)) was used to express discrepancies between field and remote height measurements as a percentage of field height, which required the calculated of unsigned error (Equation (6)). Error = Height remote − Height field (5) Unsigned Error = Error 2 Unsigned Relative Error = Unsigned Error Height field × 100 (7)

Mask R-CNN
When only the network heads were adjusted during training, the performance of the Balloon pretrained network plateaued at epoch 27 (Figure 3a). The COCO pretrained network plateaued at epoch 34. Despite the extended training period of the COCO pretrained network, its average F1 score was approximately 0.12 less than that of the Balloon pretrained network (Table 3). These factors support our hypothesis that the features learned by the Balloon pretrained network were more transferable to solving the task of ITC delineation.     When all network layers (heads + backbone) were adjusted during training, the impact of the selected pretrained networks on performance was minimal (Figure 3b, Table 3). The Balloon pretrained network plateaued at epoch 45 with an average F1 score of 0.91 and the COCO pretrained network plateaued at epoch 47 with an average F1 score of 0.89. Both of these F1 scores were greater than any of those yielded by training only the network heads.

Individual External Validation Tiles
Differences between the Balloon and COCO pretrained networks where all layers were adjusted during training were more pronounced for some individual external validation tiles than others. These external validation tiles are shown in Figure 4 with each tile name corresponding to the items in the first column of Tables 3 and 4. Generally, the Balloon pretrained network yielded more TP at the cost of a slight increase in FP (Tables 4 and 5). This was true for all external validation tiles except for tiles g and k, where the COCO pretrained network yielded slightly more TP. The greatest difference in F1 score was observed for tile e (+0.06 for the Balloon pretrained network).
Remote Sens. 2020, 12, x FOR PEER REVIEW 14 of 28 Figure 4. External validation tiles. The 3-digit codes denote which stand the tile was sampled from. See Table 1 for more information regarding these forest stands.

Case Study 1: External Validation Tile b (B21)
Tile b is representative of the smallest ITC found in all training and validation tiles. Tile b yielded the lowest F1 and recall scores among all external validation tiles. Mean crown area was 0.23 m 2 and 0.17 m 2 for predicted and annotated ITC, respectively ( Table 6). The mean inter-crown distance was 0.50 m and 0.35 m for predicted and annotated ITC, respectively. Predicted ITC in Tile b displayed the worst fit to  Table 1 for more information regarding these forest stands. interpreted mean crown area and annotated mean inter-crown distance. However, both the annotated mean crown area and predicted mean crown area also were lowest found in any external validation tile.  Tile d was the densest external validation tile with respect to tree crowns, with the highest number of total trees (157) and the lowest mean inter-crown distance. Similar conditions are found in Tile i, which exhibits clustered crowns and a high number of total trees. Localized clustering of crowns of varying sizes is also found in Tile c. Tile d contained both the highest number of FP and the highest number of FN. Tile d yielded the second lowest mean IoU between predicted and annotated ITC. The mean inter-crown distance was 0.09 m and 0.07 m for predicted and annotated ITC, respectively. Mean crown area was 0.81 m 2 and 0.75 m 2 for predicted and annotated ITC, respectively (Table 6).  Tile e is representative of the shading and obscuration by adjacent deciduous trees found in hard-and mixed-wood forests. This tile contains the lowest mean IoU between predicted and annotated ITC as well as the second lowest F1 and recall scores among all external validation tiles. External validation tiles, j and h display similar deciduous forest components. However, Tile j and Tile h contain fewer total trees and contain ITC that are more dispersed than those found in Tile e.

Case Study 4: External Validation Tile g (B39)
Tile g yielded the highest F1 and recall scores as well as the greatest mean IoU between predicted and annotated ITC among all external validation tiles. The mean inter-crown distance was 0.80 m for both predicted and annotated ITC. Mean crown area was 0.64 m 2 and 0.63 m 2 for predicted and annotated ITC, respectively. Tile g displayed the closest fit to annotated mean crown area. External validation tiles, a and k, exhibit similar conditions as those found in Tile g, with Tile a containing more total trees and higher global clustering, and Tile k containing fewer total trees, which are more dispersed than those found in Tile a or Tile g.

Height Extraction and Comparison to Field Measurements
Of the 214 field-measured conifers, 90.6% were detected (Table 7).  Mean crown area, mean inter-crown distance, and mean IoU between predicted and interpreted ITC are also provided to aid the reader in characterizing each case study ( Table 6). Each of these case studies are representative of various conditions found across all training and validation data, which are discussed in detail below. Tile b is representative of the smallest ITC found in all training and validation tiles. Tile b yielded the lowest F1 and recall scores among all external validation tiles. Mean crown area was 0.23 m 2 and 0.17 m 2 for predicted and annotated ITC, respectively ( Table 6). The mean inter-crown distance was 0.50 m and 0.35 m for predicted and annotated ITC, respectively. Predicted ITC in Tile b displayed the worst fit to interpreted mean crown area and annotated mean inter-crown distance. However, both the annotated mean crown area and predicted mean crown area also were lowest found in any external validation tile.

Case Study 2: External Validation Tile d (B29)
Tile d was the densest external validation tile with respect to tree crowns, with the highest number of total trees (157) and the lowest mean inter-crown distance. Similar conditions are found in Tile i, which exhibits clustered crowns and a high number of total trees. Localized clustering of crowns of varying sizes is also found in Tile c. Tile d contained both the highest number of FP and the highest number of FN. Tile d yielded the second lowest mean IoU between predicted and annotated ITC. The mean inter-crown distance was 0.09 m and 0.07 m for predicted and annotated ITC, respectively. Mean crown area was 0.81 m 2 and 0.75 m 2 for predicted and annotated ITC, respectively (Table 6).

Case Study 3: External Validation Tile e (B33)
Tile e is representative of the shading and obscuration by adjacent deciduous trees found in hardand mixed-wood forests. This tile contains the lowest mean IoU between predicted and annotated ITC as well as the second lowest F1 and recall scores among all external validation tiles. External validation tiles, j and h display similar deciduous forest components. However, Tile j and Tile h contain fewer total trees and contain ITC that are more dispersed than those found in Tile e.

Case Study 4: External Validation Tile g (B39)
Tile g yielded the highest F1 and recall scores as well as the greatest mean IoU between predicted and annotated ITC among all external validation tiles. The mean inter-crown distance was 0.80 m for both predicted and annotated ITC. Mean crown area was 0.64 m 2 and 0.63 m 2 for predicted and annotated ITC, respectively. Tile g displayed the closest fit to annotated mean crown area. External validation tiles, a and k, exhibit similar conditions as those found in Tile g, with Tile a containing more total trees and higher global clustering, and Tile k containing fewer total trees, which are more dispersed than those found in Tile a or Tile g.

Height Extraction and Comparison to Field Measurements
Of the 214 field-measured conifers, 90.6% were detected (Table 7). Table 7. Count and summary information for 194 field-measured conifers. The best-performing version of Mask R-CNN (Balloon pretrained network where all layers were adjusted during training) was applied to delineate 214 conifers that were measured in the field. Ten of these conifers were not detected. Of the detected field-measured conifers, there were approximately two white spruce for each lodgepole pine. Mean error, unsigned error, and unsigned relative error were −0.33 m, 0.42 m, and 20%, respectively for all field-measured conifers (Table 8). Thirty-two DAP-derived height measurements overestimated their corresponding field measurements, with the remaining 162 being underestimated. Mean unsigned error was 0.27 m and 0.45 m for overestimated and underestimated trees, respectively. Mean unsigned relative error was 11% and 22% for overestimated and underestimated trees, respectively. A simple linear regression (r 2 = 0.93, RMSE = 0.34 m) revealed that field and remote height measurements were highly correlated (Figure 9). Furthermore, the line of best fit runs nearly parallel to that of the 1:1 line, revealing a systematic trend of height underestimation (Figure 9).

Species
Median error was closer to zero for lodgepole pine individuals than it was for white spruce individuals across all height classes (Figure 10a). However, lodgepole pine error was less consistent than white spruce error, which is demonstrated by greater interquartile ranges for lodgepole pine across all height classes (Figure 10a). The 1st and 3nd height classes contained the closest median error values to zero for white spruce and lodgepole pine, respectively (Figure 10a). The difference in median error between the lodgepole pine and white spruce distributions was greatest at the 5th height class interval (>4 m) (Figure 10a). Median lodgepole pine error undulated across all height class intervals it appeared in, whereas median white spruce error became increasingly distant from zero between the 1st and 5th height class intervals (<1 m to >4 m) (Figure 10a). Unsigned relative error was greatest for conifers beneath 2 m in height ( Figure 10b). As with error in Figure 10a, the difference in median unsigned relative error between the white spruce and lodgepole pine distributions was greatest at the 5th height class interval (Figure 10b).

Species Mean Median Std Range
Error (m) A simple linear regression (r 2 = 0.93, RMSE = 0.34 m) revealed that field and remote height measurements were highly correlated (Figure 9). Furthermore, the line of best fit runs nearly parallel to that of the 1:1 line, revealing a systematic trend of height underestimation (Figure 9).

Species
Median error was closer to zero for lodgepole pine individuals than it was for white spruce individuals across all height classes (Figure 10a). However, lodgepole pine error was less consistent than white spruce error, which is demonstrated by greater interquartile ranges for lodgepole pine across all height classes (Figure 10a). The 1st and 3nd height classes contained the closest median error values to zero for white spruce and lodgepole pine, respectively (Figure 10a). The difference in median error between the lodgepole pine and white spruce distributions was greatest at the 5th

Forest Type
Median values for white spruce error distributions are clustered between approximately −0.5 m and −0.3 m, with the highest and lowest median values coming from PlHw and SwHw stands, respectively (Figure 11a). Error and unsigned relative error distributions for white spruce sampled from SwHw stands have the most variation for this species (Figure 11). The greatest median unsigned relative error values for white spruce came from Hw, HwPl, SwHw, and PlHw stands (in descending order) (Figure 11). The Hw unsigned relative error distribution has the highest minimum, maximum, and median values for white spruce (Figure 11b). For lodgepole pine, those sampled from Hw stands had the furthest median error value from zero (Figure 11a). Error distributions for lodgepole pine sampled from Pl stands have the most variation for this species (Figure 11a). Unlike with white spruce, the forest type associated with the greatest variation in unsigned relative error for lodgepole pine (HwSx) is not the same as the forest type associated with the greatest variation in error (Pl) (Figure 11).
Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 28 Figure 11. Error (a) and unsigned relative error (b) distributions for lodgepole pine and white spruce individuals sampled from seven different forest types.

ITC Delineation
Due to the scarcity of recent studies on the delineation of regenerating, as opposed to mature, conifer crowns, in this section, we draw on literature pertaining to ITC detection and delineation in various forest types of various ages for comparison to our research. To facilitate comparison to studies reporting detection rates as a percentage, we calculate our delineation accuracy as the total number of TP divided by the total number of annotated reference trees (83.5%) and we use average F1 score for comparison with other studies using CNN for ITC delineation. All studies hereafter mentioned in this section were carried out in mature forests or plantations, except for [19,47]. Generally speaking, increasingly heterogeneous forest conditions (age, species, etc.) will have a negative effect on detection and delineation accuracy and the results reviewed in this section should be interpreted with this in mind. The stands sampled in this study were managed sites following harvest. They were approximately 14 years in age with crowns much smaller than those found in mature forests (approximately 42% of field-measured trees had crown diameters less than one metre and 93% were

ITC Delineation
Due to the scarcity of recent studies on the delineation of regenerating, as opposed to mature, conifer crowns, in this section, we draw on literature pertaining to ITC detection and delineation in various forest types of various ages for comparison to our research. To facilitate comparison to studies reporting detection rates as a percentage, we calculate our delineation accuracy as the total number of TP divided by the total number of annotated reference trees (83.5%) and we use average F1 score for comparison with other studies using CNN for ITC delineation. All studies hereafter mentioned in this section were carried out in mature forests or plantations, except for [19,47]. Generally speaking, increasingly heterogeneous forest conditions (age, species, etc.) will have a negative effect on detection and delineation accuracy and the results reviewed in this section should be interpreted with this in mind. The stands sampled in this study were managed sites following harvest. They were approximately 14 years in age with crowns much smaller than those found in mature forests (approximately 42% of field-measured trees had crown diameters less than one metre and 93% were below two metres). With regard to spacing and age, conditions ranged from open, evenly spaced and evenly aged crowns (similar to a planation), to dense, clustered and overlapping crowns of mixed ages.
In the context of natural regeneration in a Norway spruce dominant forest, Röder et al. [47] detected individual trees with a success rate of 39.1% using a local maxima-based approach. Yancho et al. [48] achieved 47.6% detection accuracy with a subcrown point clustering approach applied in a mature, coniferous dominant, temperate rainforest. In a mature, mixed-wood forest, Berra [49] yielded overall accuracies as high as 80% for the delineation of coniferous trees in plots containing approximately 40 trees using a watershed-based approach. In plantation settings, Kattenborn et al. [50] and Torres-Sánchez et al. [51] yielded overall accuracies of 86% and 90%, respectively for ITC delineation with object-based image analysis (OBIA) approaches.
Weinstein et al. [52] applied a semi-supervised CNN approach to detect ITC in an open woodland in central California, achieving an F1 score of 0.65. In a complex, tropical forest, Braga et al. [32] achieved an F1 score of 0.86 using a Mask R-CNN-based workflow to delineate ITC. Li et al. [53] achieved an F1 score of 0.95 using a CNN approach to delineate ITC in a palm oil plantation. Finally, Pearse et al. [19] achieved F1 scores of approximately 0.99 and 0.98 for the detection of coniferous seedlings in two regeneration trials in New Zealand using a Faster-RCNN architecture. Detection was carried out on RGB imagery with an approximate GSD of 1 cm, in which seedlings size and density was mostly uniform [19].
Despite the heterogeneity of our study sites, our results are comparable to those of Pearse et al. [19], demonstrating the feasibility of the proposed methods for operational delineation of ITC in regenerating boreal forests. However, small, and clustered ITC-challenges of past studies [5,9,10,54]-remain contemporary challenges as well [47]. Heterogeneity in crown size and spacing proved to be challenges for this research, which Röder et al. [47] note as sources of error for their approach to automatic ITC detection. Namely, tightly clustered ITC whose branches overlap were a source of error for our approach to ITC delineation. Juvenile trees tend to grow in tight clumps where multiple seeds had become established in close proximity. This is exemplified in External Validation Tile b, the lowest scoring external validation tile, where several small ITC (area ≤ 0.1 m 2 ) were omitted. Likewise, Mask R-CNN had difficulty delineating all ITC in external validation tiles such as Tile d, where ITC of varying sizes were adjacent and overlapping. Difficulty delineating all ITC in Tile d was compounded by the number of total trees contained in this tile (157). However, given that Tile b and Tile e contained fewer total trees than Tile d and that Mask R-CNN performed better on Tile d, crown size may be considered to pose a greater challenge for our approach to ITC delineation.

Height
Correlation between field and remote height measurements (r 2 = 0.93) was comparable to what has been reported by other studies using only optical sensors. Röder et al. [47] and Dandois et al. [55] yielded r 2 values of 0.75 and 0.86, respectively. In regenerating forests, Castilla et al. [35] reported an r 2 of 0.67 for the measurement of coniferous seedlings. However, most of the seedlings in this study were less than 1 m in height. Goodbody et al. [34] achieved a Spearman's r of 0.91, which is comparable to results from Lisein et al. [56] (r 2 = 0.91) and Puliti et al. [57] (r 2 = 0.97) who produced remote height measurements with hybrid LiDAR-optical data products in mature forests. Variation between results from mature forests [47,55] and regenerating forests [34] suggests that canopy closure may impede the production of accurate ground models from DAP and in turn, the extraction of accurate remote height measurements. The high r 2 values of Lisein et al. [56] and Puliti et al. [57] demonstrate the utility of LiDAR data in this regard as only active sensors can penetrate dense canopies to normalize terrain occluded by forest canopy. However, regenerating forests in which canopies have not yet fully closed may be uniquely suited for terrain characterization and remote height extraction by DAP as indicated by the results yielded by Goodbody et al. [34] and this study. With this being said, the remote measurement of small trees is not trivial and as increasingly fine GSD are utilized to measure increasingly small targets, error in photogrammetric reconstruction is likely to affect height estimates [35].

Underestimation: Whorl-Height and Top Height
The majority of field height measurements were underestimated by our remote sensing approach. This discrepancy may be driven by the structure of measured trees. Field measurements were taken from the base of a tree to the top of its terminal leader (annual growth). Leaders are isolated growths that extend vertically from the highest whorl of a tree. They are narrow structures (~1-2 cm wide) and as such, provide limited surface area from which pixels can be extracted during photogrammetric processing. Remote height measurements may therefore correspond to the highest whorl of a tree rather than its leader. This would explain discrepancies between field and remote height measurements for white spruce trees as white spruce leaders are between 0.32 and 0.44 m tall in trees of this age [58] and mean error for white spruce trees was 0.38 m.
Discrepancies between field and remote height measurements are not as easily explained for lodgepole pine trees as their leaders are between 0.20 and 0.40 m in height in trees of this age [59] and mean error for lodgepole pine trees was 0.22 m. However, this may be explained by structural differences between white spruce and lodgepole pine trees. First, the needles of lodgepole pine are approximately 1-2 cm longer than those of white spruce and are more dispersed [60], meaning an average lodgepole pine leader will contain more surface area than an average white spruce leader. Moreover, the branches of lodgepole pine trees arc upwards, growing alongside that tree's leader, while those of white spruce trees tend to grow outwards on a horizontal plane, perpendicular to their leaders. Therefore, the greatest available surface area towards the top of lodgepole pine trees occurs where top-whorl branches terminate around leader, which is approximately at 2/3 the height of the leader. For a lodgepole pine tree with a leader of 0.40 m height, top-whorl branches will terminate at approximately 0.26 m, which is within 0.04 m of the mean error for lodgepole pine trees in this study.
Additional field measurements and image analysis should be undertaken and include height as measured to the top of a tree's leader in addition to height as measured to the base of the tallest whorl.

Overestimation: Deciduous and Adjacent Conifers
Thirty-two DAP-derived height measurements overestimated their corresponding field measurements. Of these 32 measurements, 13 overestimated field measurements by greater than 0.3 m. Manual inspection of the ITC polygons from which these height measurements were extracted revealed that the pixel coordinates of maximum CHM values were located in the peripheries of said polygons rather than their approximate centers. These pixels likely fall outside of the actual boundary of the tree crowns they were attributed to and rather, they likely belong to taller, adjacent trees whose crowns overlap those of the field-measured trees (Exemplified in Figures 12 and 13).
Overestimation is exemplified in Figure 12 where the border of a target ITC polygon (yellow) encloses pixels from a neighboring crown, which can be distinguished by its lighter colour. The field height measurement for this tree was 4.18 m and the remote height measurement was 4.88 m. Figure 12b displays the CHM within the polygon delineating this tree, in which two distinct clusters of pixels with high values are visible. The right-hand cluster (2) contains the maximum CHM value (4.88 m) within the polygon and overlaps with the light green pixels of the neighboring crown, while the left-hand cluster (1) is better aligned with the center of the polygon. The greatest CHM value within the left-hand cluster (1) is 4.28 m, which is closer to the field-measured height (4.18 m). This example highlights the sensitivity of our height extraction method to the quality of ITC delineation; when conifers are clustered and overlapping, there is an increased likelihood that a given predicted ITC polygon will contain pixels that belong to an adjacent conifer. If that adjacent conifer is taller than the target conifer, the maximum CHM value extracted from the predicted ITC polygon will correspond with the adjacent conifer rather than the intended target.

Overestimation: Deciduous and Adjacent Conifers
Thirty-two DAP-derived height measurements overestimated their corresponding field measurements. Of these 32 measurements, 13 overestimated field measurements by greater than 0.3 m. Manual inspection of the ITC polygons from which these height measurements were extracted revealed that the pixel coordinates of maximum CHM values were located in the peripheries of said polygons rather than their approximate centers. These pixels likely fall outside of the actual boundary of the tree crowns they were attributed to and rather, they likely belong to taller, adjacent trees whose crowns overlap those of the field-measured trees (Exemplified in Figures 12 and 13). Overestimation is exemplified in Figure 12 where the border of a target ITC polygon (yellow) encloses pixels from a neighboring crown, which can be distinguished by its lighter colour. The field height measurement for this tree was 4.18 m and the remote height measurement was 4.88 m. Figure   Figure 12. Predicted ITC polygon (yellow) whose remote height measurement overestimated its corresponding field measurement. (a) displays the polygon in true colour, while (b) shows canopy height models (CHM) values within the polygon. The numbered labels in (b) refer to two distinct clusters of high CHM values found within the polygon. The maximum CHM value within the polygon is contained in Cluster 2, which overestimates the polygon's corresponding field height measurement by 0.70 m. The greatest CHM value contained in Cluster 1 is 0.20 m greater than the polygon's corresponding field height measurement.  (1) is better aligned with the center of the polygon. The greatest CHM value within the left-hand cluster (1) is 4.28 m, which is closer to the field-measured height (4.18 m). This example highlights the sensitivity of our height extraction method to the quality of ITC delineation; when conifers are clustered and overlapping, there is an increased likelihood that a given predicted ITC polygon will contain pixels that belong to an adjacent conifer. If that adjacent conifer is taller than the target conifer, the maximum CHM value extracted from the predicted ITC polygon will correspond with the adjacent conifer rather than the intended target.  Figure 13 provides another example of a conifer whose height was overestimated by our approach. Unlike the previous example, there were no adjacent conifers whose crowns overlapped that of the target ITC polygon (yellow). However, the grey pixels surrounding this polygon belong to a leafless deciduous overstory, which also partially obscures the green response of the conifer in Figure 13. Predicted ITC polygon (yellow) located beneath a deciduous overstory whose remote height measurement overestimated its corresponding field measurement. (a) displays the polygon in true colour, while (b) shows CHM values within the polygon. The greatest CHM values within the polygon are located towards its bottom and right-hand borders, where green values in the true colour image are relatively low. Figure 13 provides another example of a conifer whose height was overestimated by our approach. Unlike the previous example, there were no adjacent conifers whose crowns overlapped that of the target ITC polygon (yellow). However, the grey pixels surrounding this polygon belong to a leafless deciduous overstory, which also partially obscures the green response of the conifer in question. The greatest CHM values occur in regions of the polygon where green values are low, which suggests that the maximum CHM value extracted from this polygon should be attributed to the deciduous overstory and not the conifer situated below it ( Figure 13b). As such, stands with significant deciduous overstory pose a challenge to our approach to height extraction; predicted ITC polygons for conifers located beneath deciduous overstory are likely to contain overstory pixels whose CHM values will exceed those of the conifers. Therefore, maximum CHM values may provide overestimated remote height measurements in stands with significant deciduous overstory.

Limitations
Manually delineating ITC is a difficult and time-consuming exercise in photo-interpretation. While our results indicate that our manually delineated datasets were created with adequate fidelity, it is likely that these datasets contain some elements of user error inherent to photo-interpretation. For example, the inclusion of pixels from adjacent conifers in some predicted ITC (discussed in Section 4.2.2.) may indicate that the boundaries of some ITC were drawn too loosely when they were manually delineated and as a consequence, Mask R-CNN may have learned to reproduce this behavior.
Aside from the possibility of interpretation error, the requirement of manually delineated datasets to train CNN is considered to be a major limitation for their use [32]; performance tends to improve relative to the number of samples available to a CNN [18], yet manually delineating these samples is labor-intensive. While it is probable that this study could have benefitted from a larger manually delineated dataset, we had to compromise due to time constraints. However, future studies may circumvent this trade-off as recently, algorithms have been developed that generate large synthetic datasets using a relatively small number of manually annotated samples as input [32,52].
Another limitation stems from the likely discrepancies between what is visible in the UAV imagery and what exists in the field. ITC were manually delineated with high fidelity on the UAV imagery; however, this may also have a degree of interpreter error which highlights the continued need for high quality field data to calibrate these types of models.

Future Research
While the results of this study are encouraging, additional research is specifically required around issues such as top-height versus whorl-height estimation, reliable species classification for both deciduous and coniferous trees, and the utilization of imagery acquired from piloted aircraft rather than UAV given the large areas of regenerating forest that need to be assessed annually. Investigations into each of these issues are needed to meet Alberta's regeneration monitoring requirements. The use of CNN to delineate and classify crowns is promising, leaving open the question of whether our methods can be applied to with similar success imagery collected using different platforms.

Conclusions
In this study, we have presented a workflow for the automatic delineation of individual conifer crowns in true colour orthomosaics using Mask R-CNN and the subsequent extraction of remote height measurements. Our best model was generated by Balloon pretrained network, where all layers of the network were adjusted during training. This achieved an average precision of 0.98, an average recall of 0.85, and an average F1 score of 0.91. Our results show that Mask R-CNN is capable of automatically delineating individual coniferous crowns in a range of conditions common to western boreal forest regeneration under leaf-off conditions and that these crown delineations can be used to extract height measurements at the individual tree level.
Our height measurement analysis found field and remote height measurements to be highly correlated (r 2 = 0.93). Remote height measurements tended to underestimate field height measurements (162/194 samples) and mean error was greater for white spruce trees than it was for lodgepole pine (−0.37 m and −0.24 m, respectively). We speculated as to why the majority of remote height measurements underestimated their corresponding field samples as well as why lodgepole pines may be less underestimated than white spruce, and we offered suggestions for future studies to address this challenge. Furthermore, we observed that remote height measurements may overestimate their corresponding field height measurements when a delineated conifer crown contains pixels belonging to taller, adjacent coniferous or deciduous trees.
Our proposed methods to automatically delineate and extract height information for individual conifers in true colour imagery have the potential to improve the speed and scale at which forest regeneration monitoring can occur. However, techniques to classify tree species and delineate individual deciduous crowns from true colour imagery are still required if the criteria of field-based regeneration assessments are to be met by remote sensing approaches. As such, our research serves as a stepping-stone towards the remote automation of regeneration monitoring and future studies should focus on the deciduous components of boreal regeneration as well as the classification of individual tree species.