Two-Stage Deep Learning Framework for Individual Tree Crown Detection and Delineation in Mixed-Wood Forests Using High-Resolution Light Detection and Ranging Data

Li, Qian; Hu, Baoxin; Shang, Jiali; Remmel, Tarmo K.

doi:10.3390/rs17091578

Open AccessArticle

Two-Stage Deep Learning Framework for Individual Tree Crown Detection and Delineation in Mixed-Wood Forests Using High-Resolution Light Detection and Ranging Data

by

Qian Li

¹,

Baoxin Hu

^1,*

,

Jiali Shang

² and

Tarmo K. Remmel

³

¹

Department of Earth and Space Science and Engineering, York University, 4700 Keele Street, Toronto, ON M3J 1P3, Canada

²

Ottawa Centre for Research and Development, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, ON K1A 0C6, Canada

³

Faculty of Environmental and Urban Change, York University, 4700 Keele Street, Toronto, ON M3J 1P3, Canada

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(9), 1578; https://doi.org/10.3390/rs17091578

Submission received: 1 March 2025 / Revised: 10 April 2025 / Accepted: 25 April 2025 / Published: 29 April 2025

(This article belongs to the Special Issue Lidar for Forest Parameters Retrieval)

Download

Browse Figures

Versions Notes

Abstract

Accurate detection and delineation of individual tree crowns (ITCs) are essential for sustainable forest management and ecosystem monitoring, providing key biophysical attributes at the individual tree level. However, the complex structure of mixed-wood forests, characterized by overlapping canopies of various shapes and sizes, presents significant challenges, often compromising accuracy. This study presents a two-stage deep learning framework that integrates Canopy Height Model (CHM)-based treetop detection with three-dimensional (3D) ITC delineation using high-resolution airborne LiDAR point cloud data. In the first stage, Mask R-CNN detects treetops from the CHM, providing precise initial localizations of individual trees. In the second stage, a 3D U-Net architecture clusters LiDAR points to delineate ITC boundaries in 3D space. Evaluated against manually delineated reference data, our approach outperforms established methods, including Mask R-CNN alone and the lidR itcSegment algorithm, achieving mean intersection-over-union (mIoU) scores of 0.82 for coniferous plots, 0.81 for mixed-wood plots, and 0.79 for deciduous plots. This study demonstrates the great potential of the two-stage deep learning approach as a robust solution for 3D ITC delineation in mixed-wood forests.

Keywords:

ITC detection and delineation; deep learning; LiDAR; mixed-wood forest

1. Introduction

Individual tree crown (ITC) detection and delineation are foundational elements in contemporary forest inventory systems, providing tree-level insights that enable precise forest management and ecological research. In commercial forestry, ITC data facilitates selective timber identification based on specific metrics, such as diameter at breast height (DBH) and crown volume, optimizing resource utilization while supporting sustainable practices [1]. The detailed inventory not only optimizes resource utilization to maximize economic returns but also aligns with sustainable forest management principles [2]. Beyond commercial applications, ITC data underpins ecological research by providing critical information on forest structural complexity for biodiversity assessments, carbon sequestration estimation, and ecosystem health monitoring [3]. The high-resolution remote sensing data enables researchers to quantify crown dimensions and canopy structure while enabling biomass estimation and revealing indicators of forest vigor and disturbance patterns.

Three-dimensional (3D) ITC data offers value for extracting key biophysical parameters like aboveground biomass (AGB), which serves as a critical indicator for carbon sequestration models that predict future carbon fluxes and climate change impacts [4,5]. The demand for accurate 3D ITC information has increased significantly in recent years, driven by the increasing complexity of forest management challenges and the pressing need for sustainable resource use. This urgency is compounded by global climate changes and natural disturbances, such as fires, storms, and pest infestations, highlighting the pivotal role of ITC data in adaptive forest management and ecological conservation.

Mixed-wood forests present significant challenges for accurate ITC detection and delineation due to overlapping canopies and variations in tree shapes and sizes, which are primarily influenced by species composition and age structure. The diverse structural characteristics of these forests make it difficult to distinguish individual crowns, especially in dense stands, leading to potential errors in ITC detection and delineation accuracy [6,7,8]. High-resolution Light Detection and Ranging (LiDAR) data have revolutionized this field by providing detailed structural information about forest canopies, enabling precise measurements of tree height, crown diameter, and canopy structures. The integration of deep learning techniques with LiDAR data has significantly enhanced the accuracy and efficiency of ITC detection and delineation [6,7,9,10].

Current ITC delineation methodologies using LiDAR data are generally divided into two main approaches: (1) 2D Canopy Height Model (CHM)-based methods and (2) 3D LiDAR point-based methods. CHM-based methods are widely adopted due to their simplicity and computational efficiency [2,8,10]. When integrating deep learning techniques with CHM-based methods, two types of networks are commonly applied: semantic segmentation networks and instance segmentation networks. Semantic segmentation networks, such as U-Net [11], FCN [12], and DeepLabV3+ [13,14] have demonstrated impressive capabilities in semantic segmentation from raster images. While these networks are effective in identifying general individual tree crown areas, they often struggle to distinguish individual crowns in dense or overlapping canopies. Instance segmentation networks, including Mask R-CNN [15,16] and its variants, like Cascade R-CNN [17], offer enhanced ability to distinguish individual crowns, even in dense canopies. Traditional approaches include watershed segmentation [18,19], region growing [8], graph-based methods [20], and point cloud clustering [21]. The watershed algorithm, particularly its marker-controlled variants, remains widely used due to its computational efficiency and interpretability. Similarly, Dalponte’s algorithm [22] utilizes a region growing approach from local maxima seed points on Canopy Height Models to delineate individual tree crowns, demonstrating robust performance across diverse forest structures while maintaining computational efficiency. While these methods perform well on 2D representations of LiDAR data (e.g., CHMs), they do not fully leverage the rich 3D structural information available in LiDAR point clouds. As a result, these methods can face challenges, particularly in heterogeneous forests with intersecting canopies, leading to reduced performance in areas with overlapping crowns.

In parallel, 3D LiDAR point-based methods offer a more detailed structural characterization, making them particularly well-suited to complex forest environments. Algorithms, such as K-means clustering [23], voxel space projection [24], and PointNet++ [25,26,27], process raw point clouds directly and have been successfully applied to ITC delineation. Similarly, algorithms operate directly on 3D point clouds rather than rasterized surfaces, leveraging both horizontal and vertical structural characteristics to segment individual trees [28]. The integration of advanced machine learning techniques, including deep learning models, like PointNet++, significantly improves ITC delineation accuracy by identifying intricate patterns within the data that traditional methods may miss. PointNet was combined with voxelization and height gradient information from LiDAR data to achieve effective tree crown detection rates, ranging from 0.80 to 0.90, and accurate crown breadth estimations (R² > 0.79) across various forest environments [27]. Point-based 3D LiDAR methods better preserve 3D structural details, enabling more accurate crown delineations [12,19]. However, these methods are computationally intensive and can be difficult to scale to large forested areas. Similarly, 3D Convolutional Networks, including 3D U-Net and VoxelNet, have been adapted to process LiDAR data volumetrically, maintaining 3D structural integrity [29]. These approaches, however, often rely on voxelization, which can lead to the loss of fine-grained structural information.

In summary, despite advancements in integrating deep learning with LiDAR data, several challenges remain. Many CHM-based deep learning methods (e.g., FCN and U-Net) excel in either detection or delineation but rarely achieve both simultaneously. Advanced networks, such as Mask R-CNN, can perform treetop detection, bounding box regression, and ITC mask prediction in an integrated manner. However, they face limitations in fully utilizing the 3D structural richness of LiDAR data. Furthermore, 3D LiDAR point-cloud-based methods often suffer from computational inefficiency and a lack of comprehensive 3D training references, especially when applied to extensive forest environments. This highlights a critical research gap: the need for a deep learning framework that can fully exploit 3D LiDAR data while addressing computational constraints and improving the preparation of training datasets.

To address these challenges and research gaps in ITC detection and delineation within mixed-wood forests, this study proposes a novel two-stage deep learning strategy that integrates CHM-based treetop region segmentation using Mask R-CNN with LiDAR point cloud clustering using the 3D U-Net model. In the first stage, Mask R-CNN is applied to a CHM, leveraging its robust instance segmentation capabilities while maintaining computational efficiency by avoiding direct processing of LiDAR point clouds. The second stage refines crown boundaries in 3D space through a specialized 3D U-Net architecture that integrates spatial and structural information from LiDAR point clouds. The main objectives of this study are as follows:

i.: To propose a two-stage deep learning framework for improving ITC detection and delineation by fully exploiting LiDAR data.
ii.: To create a benchmark 2D and 3D reference dataset for airborne LiDAR (LAS) data in mixed-wood forests.
iii.: To produce 3D and 2D ITC products for practical forestry applications.

2. Study Area and Data Pre-Processing

2.1. Study Area

The study site (46°33′43″ to 46°34′03″N, 83°25′10″ to 83°25′25″W) is situated east of Sault Ste. Marie, Ontario, Canada, within the Great Lakes-St. Lawrence Forest region (Figure 1). This region is the second-largest forest zone in Ontario and serves as a transitional area between the southern deciduous forests and the northern boreal forests. The site is characterized by mixed-wood stands where coniferous and deciduous species coexist, forming a mosaic of vegetation types. Common deciduous species include aspen (Populus tremuloides Michx.), white birch (Betula papyrifera Marsh.), and sugar maple (Acer saccharum). The dominant coniferous species are jack pine (Pinus banksiana Lamb.) and black spruce (Picea mariana Mill. BSP). The study forest features closed, multi-layered canopy structures with unevenly aged trees ranging from 40 to 90 years old, comprising trees of variable heights, DBHs, and growth stages.

2.2. LAS Data Acquisition and Pre-Processing

LiDAR data for this study were acquired in August 2009 using a Riegl Q-560 scanner (Riegl Laser Measurement Systems GmbH, Horn, Austria). The flight height was approximately 200 m above ground, and two overlapping flight lines resulted in a high pulse density of approximately 40 pulses/m². Initial pre-processing included noise filtering and point classification using the Cloth Simulation Filter algorithm implemented in CloundCompare software (Version 2.13.2) [30] to separate ground from non-ground points. The normalized point cloud was generated by subtracting the ground elevation from each point’s original height. For raster generation, we binned the points into a grid with a resolution of 0.15 m, processed using NumPy (1.26.0) [31] and Rasterio (1.3.8) [32]. Given that the LiDAR data had a minimum sampling spacing of approximately 0.15 m, the Digital Surface Model (DSM) and the Digital Elevation Model (DEM) were derived with a grid size of 0.15 m × 0.15 m. The DSM was calculated as the maximum elevation of vegetation points within each grid cell using a TIN-based approach, while the DEM was derived as the minimum elevation of ground points using the same method. Missing values in the DEM were filled using cubic interpolation via the “griddata” method from the scipy.interpolate module. The CHM was computed as the difference between the DSM and the DEM and subsequently post-processed using a 3 × 3 median filter to correct invalid or negative values and reduce artifacts. The resulting CHM is presented in grayscale in Figure 1.

2.3. Reference Dataset

A ground validation dataset comprising approximately 1600 manually delineated ITCs was created. The dataset was divided into a training set, covering 80% of the study area (1200 trees), and a validation set, covering the remaining 20% (400 trees), as shown in Figure 2. The validation dataset was further split into three plots based on the forest’s composition: Plot-1, dominated by coniferous trees; Plot-2, characterized by mixed-wood stands; and Plot-3, dominated by deciduous trees. Zoomed-in views of each plot are provided in Figure 2 for detailed visualization.

The ITC reference polygons were created using the editing tools in ArcGIS Pro (Version 3.2) [33] through visual interpretation of tree heights and canopy structures based on the color gradient in the CHM image. To ensure high accuracy, ITC delineation was performed by integrating CHM and LiDAR point cloud data. High-resolution CHM imagery served as the primary reference, providing a clear 2D representation of the canopy structure for identifying and annotating tree locations and crown boundaries. For trees with ambiguous or overlapping crowns, delineations were further refined using the 3D profile view of the LiDAR point cloud data, enabling a detailed examination of the vertical structure and the spatial distribution of individual tree points. This combined 2D and 3D approach ensured precise crown segmentation and provided a robust ground truth for training and evaluating ITC delineation models, particularly in densely vegetated or mixed-species areas in mixed-wood forest environments. An example of manually delineated tree crowns overlaid on the CHM is illustrated in Figure 2.

To construct a 3D ITC reference dataset from 2D manually delineated polygons and LiDAR point cloud data, the 2D polygon boundaries of the ITCs were first spatially aligned with the corresponding LiDAR data. For each 2D polygon, LiDAR points falling within its spatial extent were extracted to generate a 3D point cloud representation of the respective tree crown. A unique identifier was assigned to each ITC to ensure traceability and maintain consistency with the original 2D annotations. The 3D boundaries of the ITCs were further refined through visual inspection and segmentation adjustments using CloudCompare software to enhance the accuracy. Crown structures were examined at multiple height intervals, with coniferous trees primarily identified based on their top-down view characteristics, while deciduous trees required additional inspection of point distributions corresponding to branches and stems. The final 3D ITC dataset (Figure 3) provides a robust and comprehensive training resource for developing and evaluating 3D ITC delineation algorithms.

While ground-based validation of these manually delineated crowns was not feasible due to the dense forest structure and accessibility limitations, the accuracy of our reference dataset was maintained through standard quality control. Multiple independent interpretations of the same forest regions were conducted. The CHM images with 0.15 m resolution provided a clear view of tree canopies in the horizontal plane, while the high spatial resolution of our LiDAR data (40 pulses/m²) enabled detailed visualization of crown structures in vertical profiles, identifying both stems and the canopy structure in 3D space and further enhancing the reliability of the manual delineations. The reference dataset included essential information for each tree, including unique ID numbers, spatial coordinates, crown area, and the LiDAR points belonging to each tree crown, creating a robust foundation for model training and evaluation.

3. Methodology

This paper proposes a two-stage framework for individual tree crown (ITC) detection and delineation by integrating CHM-based Mask R-CNN detection and delineation with U-Net model-based LiDAR point cloud clustering.

In the first stage, Mask R-CNN serves as the foundation for initial treetop regions’ identification. This deep learning architecture employs a ResNet50 backbone augmented with Feature Pyramid Networks (FPNs) to process multi-resolution remote sensing imagery. The model systematically generates region proposals across varying scales, enabling it to detect trees of different sizes. The network’s architecture consists of four main components: (1) a backbone network for feature extraction, (2) a Region Proposal Network (RPN) that identifies potential crown locations, (3) a Region of Interest (RoI) alignment layer that precisely maps features to proposed regions, and (4) parallel branches for bounding box regression, classification, and mask generation. Training incorporates overlapping patch sampling with carefully tuned overlap ratios to ensure detection continuity across patch boundaries, while non-maximum suppression algorithms resolve duplicate detections in overlapping areas. The model effectively handles the spectral variability of tree crowns through data augmentation techniques including rotation, scaling, and intensity adjustments.

The second stage employs 3D U-Net to refine the initial treetop regions from the first stage by incorporating structural information from LiDAR point clouds. This specialized 3D neural network processes point-wise features through a series of operations; initial point cloud data are processed through MLP (Multi-Layer Perceptron) convolution layers, followed by four successive down-sampling blocks that reduce spatial resolution while increasing feature depth. The encoder-decoder architecture with skip connections preserves fine-grained spatial details while capturing broader contextual information. The network’s key innovation lies in its offset prediction mechanism, which calculates displacement vectors from each crown point to its corresponding stem location. This offset-based approach enables precise clustering of points belonging to individual crowns, effectively handling complex crown structures and occlusions. The integration of Mask R-CNN and 3D U-Net creates a powerful complementary framework that utilizes the precise treetop region detection capabilities of CHM imagery with the rich structural information from 3D point clouds, resulting in more accurate 3D ITC detection and delineation.

For the 3D point cloud clustering stage, 3D U-Net was selected after comparing it with PointNet++ [34] and traditional K-means clustering [35] algorithms. 3D U-Net demonstrated superior performance in capturing the complex spatial relationships within the point cloud while maintaining computational efficiency. PointNet++ showed promise but required significantly more computational resources when applied to our high-density LiDAR data. The hierarchical feature extraction capabilities of 3D U-Net proved particularly effective for discriminating between adjacent crowns in 3D space, especially in mixed-wood environments where deciduous and coniferous trees exhibit vastly different structural characteristics.

The model was implemented using Python version 3.11.4 within the PyCharm (Version 2023.2) [36], leveraging packages for data processing, machine learning, and visualization. In the first stage, Mask R-CNN was applied to CHM to accurately detect treetop masks, providing precise initial localizations of individual trees. In the second stage, a 3D U-Net architecture was employed to cluster LiDAR points based on treetop masks, delineating ITC boundaries in three-dimensional space. This refinement process utilizes both spatial and height information from the LiDAR point cloud data. By combining the high-resolution segmentation capability of Mask R-CNN with the spatial and structural refinement provided by the 3D U-Net model, this two-stage approach bridges the gap between CHM-based and LiDAR point-cloud-based delineation. It offers a robust framework for capturing both the horizontal extent and vertical structure of tree crowns. The proposed approach was evaluated using accuracy metrics and detailed result analysis. Figure 4 illustrates the workflow of the two-stage framework proposed in this study.

3.1. First Stage: 2D Treetop Region Segmentation Using Mask R-CNN

Mask R-CNN represents a state-of-the-art deep learning architecture that integrates object detection and segmentation, enabling end-to-end training for treetop region segmentation tasks [37]. The selection of Mask R-CNN for CHM processing and 3D U-Net for point cloud clustering was based on their complementary strengths and experimentation. Mask R-CNN was chosen for the first stage due to its superior instance segmentation capabilities, which are essential for distinguishing individual tree crowns in the 2D CHM representation. Unlike semantic segmentation networks (e.g., Fully Connected Network (FCN) and SegNet), which excel at pixel-level classification but struggle with separating adjacent instances, Mask R-CNN’s RPN effectively identifies discrete tree crowns, even in dense canopies. We experimented with several alternative approaches, including Faster R-CNN and YOLO for detection, but found that they lacked the precise mask generation needed for accurate crown delineation. Similarly, semantic segmentation networks, FCN and SegNet, were tested but proved less effective at separating adjacent trees in the CHM.

Figure 4 illustrates the workflow of 2D treetop region segmentation using Mask R-CNN, which consists of the following steps: (i) the pre-processed CHM image is put into the pre-trained ResNet-50 neural network to extract feature maps, (ii) these feature maps are processed by RPN to identify Regions of Interest (ROIs), (iii) the ROI Align operation standardizes the ROIs into a uniform shape, (iv) a FCN classifies ITCs and refines the bounding box positions and sizes, and (v) fully Convolutional Networks generate masks by segmenting pixels corresponding to each tree crown.

The Mask R-CNN model was trained using a dataset of approximately 1400 tree crowns. Random patches were generated using a sliding window approach, ensuring the model was exposed to diverse portions of the image during training. These patches included overlapping regions, allowing the model to learn from both the central and boundary areas of tree crowns. To improve generalization across various tree crown shapes and orientations, data augmentation techniques, random rotations and flipping, were applied. Mask R-CNN model was trained to predict both bounding boxes and segmentation masks for each tree crown. During training, the model minimized a combined loss function that included classification, bounding box regression, and mask prediction. Key hyperparameters, including the learning rate, batch size, and IoU threshold, were optimized to enhance performance and ensure accurate tree crown delineation after conducting a series of methodical tests. The optimal learning rate was determined to be 1e−5 using the AdamW optimizer. This conservative learning rate allowed the model to converge effectively to the optimal solution while avoiding local minima. For batch size, it was found that using 2 images per batch provided the best balance between training stability and computational efficiency given our hardware constraints. Smaller batch sizes resulted in noisy gradient updates, while larger batches required excessive memory without proportional improvements in accuracy. The IoU threshold for non-maximum suppression was set at 0.5 after testing values ranging from 0.3 to 0.7. This threshold value optimally balanced the trade-off between detecting closely positioned trees (which requires a lower threshold) and avoiding duplicate detections (which demands a higher threshold). The model utilized the default anchor scales provided by the Mask R-CNN implementation to accommodate the variable sizes of tree crowns in our dataset. The model was trained for up to 20,000 epochs with 100 iterations per epoch and early stopping based on mean IoU (mIoU) validation metrics with a small improvement threshold of 0.001. The best model was saved whenever the validation mIoU improved beyond this threshold. The training implementation used an overlap ratio of 0.2 for patch creation during inference to ensure complete coverage of the study area. This approach was particularly important given our dataset consisting of high-resolution (0.15 m) Canopy Height Model (CHM) imagery.

Post-processing included an optimized polygon merging algorithm that utilized spatial indexing through cKDTree to efficiently identify and merge overlapping tree crown detections based on centroid proximity (within a search radius of 8) and an overlap ratio threshold of 0.5, thereby preventing duplicate detections while preserving unique crown morphologies. For prediction filtering, we applied a distance transformation to each Mask R-CNN generated treetop region, retaining only pixels with confidence values above 0.7, which effectively preserves the core structure of each crown ready for the second stage while reducing edge localization errors. For validation, the best-trained model was evaluated on a separate validation set of about 400 trees with three plots.

3.2. Second Stage: 3D ITC Delineation Using 3D U-Net Model

In the second stage, ITC points were clustered based on treetop masks generated in the first stage using a 3D U-Net regression model. This model predicted 3D offset vectors (ΔX, ΔY, ΔZ) for each voxel, representing the displacement from the voxel’s current position to the nearest ITC point or crown centroid. The 3D U-Net architecture was designed to extract and encode features essential for ITC classification. It employed a series of convolutional layers in the encoder to capture features at progressively coarser spatial scales, followed by decoder layers that reintroduced finer spatial details through concatenation and interpolation. The predicted offsets were then used to adjust the position of each voxel, and the resulting positions were mapped back to the original point cloud by assigning voxel labels to the corresponding points.

The 3D U-Net model is built on a U-Net backbone for point cloud processing. The network begins with an input layer that processes the initial point cloud data through an MLP convolution layer. The encoder path consists of four successive down-sampling blocks (marked as DownConv with channel dimensions [32, 64, 128, 256]), each followed by 3D convolution operations to capture hierarchical features at different scales. The decoder path mirrors the encoder with four up-sampling blocks (UpConv with channel dimensions [32, 64, 128, 256]), each followed by fusion blocks that combine features from the corresponding encoder level through skip connections. These skip connections help preserve fine-grained spatial information that might be lost during down-sampling. The fusion blocks integrate information from both the encoder and decoder paths, enabling the network to effectively combine both local and global features. After each fusion operation, 3D convolution layers further process the combined features. The network culminates in a head layer that processes the final features, ultimately producing offset predictions that help identify and cluster crown points to their corresponding treetop regions. The network employs LeakyReLU (α = 0.2) for intermediate layers and tanh for the output layer, with batch normalization after each convolution and 0.2 dropout in the bottleneck. The model was trained using Adam optimizer (learning rate = 0.0005, β₁ = 0.9, β₂ = 0.999) with a batch size of 8 voxel blocks (64 × 64 × 32) for 50 epochs (with early stopping after 10 epochs of no improvement), minimizing mean squared error loss for offset prediction to cluster crown points to their corresponding treetop regions. The systematic use of down-sampling and up-sampling operations allows the network to capture both fine-grained local details and broader contextual information necessary for accurate crown point clustering.

To improve performance, a binary voxel block was introduced as a secondary input, indicating the initial crown of treetops identified in the previous stage. This modification was essential for guiding the 3D U-Net to recognize treetops and computing offsets relative to these anchor points. The model was trained using the mean squared error (MSE) loss function to ensure high precision in offset predictions. By applying the computed offsets, ITC points were clustered around their corresponding treetops, streamlining the isolation process. Each treetop’s point was then assigned an ID to enable accurate clustering and delineation of tree crowns.

3.3. Evaluation and Accuracy Assessment

To evaluate the proposed framework, we performed a comparative analysis against the itcSegment algorithm (an established direct 3D point cloud segmentation approach) and Mask R-CNN, using manually delineated reference data. As the benchmark method, we selected the algorithm implemented in the itcLiDAR function from the R package itcSegment [22]. Unlike methods relying on 2D CHM-based segmentation, this technique preserves the complete vertical structure and architectural complexity of forest canopies. The process begins by analyzing the spatial distribution of points within the cloud, using the provided EPSG code (32617) for spatial context and operating at a specified resolution of 0.15 m. Potential initial points or “seeds” within crowns are identified using a multi-scale search strategy (ranging from 3 to 9) combined with a relative height threshold (0.55). Following seed identification, the algorithm effectively partitions the point cloud by assigning spatially proximate points to individual crown segments. This assignment is governed by distance constraints (1 m to10 m) and criteria evaluating 3D structure and height similarity between points, influenced by parameters like TRESHCrown (0.6) and, potentially, cell weighting (1). Points falling below an absolute Height Threshold (1 m) above ground are excluded from the crown segments. Ultimately, the algorithm assigns a unique numerical identifier (itcSegmentTreeID) to each point belonging to a delineated tree, partitioning the point cloud into discrete 3D tree entities and enabling comparison analysis with the proposed method.

Two accuracy assessments were conducted to evaluate the performance of the proposed two-stage network method [10]. The first assessment focused on the accuracy of Mask R-CNN delineation from the CHM image using manually delineated ground truth data. The second assessment evaluated the final accuracy of the combined two-stage network through a fully automated approach without manual interventions between stages. In this scenario, errors from each network could propagate to subsequent steps, potentially compounding inaccuracies.

For both assessments, reference data for accuracy evaluation were generated stepwise, tailored to the specific requirements of each network. The 3D LiDAR points of the ITC dataset were derived from manually delineated reference trees on the CHM (Figure 3). This structured approach ensured that each network’s performance was assessed against precisely relevant criteria, providing detailed insights into both the efficacy of individual networks and the system’s overall performance under both manual and automated conditions.

For ITC detection, precision (Equation (1)) and recall (Equation (2)) metrics were utilized to assess the accuracy of detected trees compared with the reference data. The F1 score, calculated from these metrics (Equation (3)), represents the overall detection accuracy. True positive (TP), false positive (FP), and false negative values were used to quantify detection performance: TP represents the number of correctly detected trees, FP denotes incorrectly detected trees, and FN corresponds to ground truth trees that are overlooked in the detection results.

Accuracy indices were calculated based on these three metrics. Precision evaluates the algorithm’s ability to correctly identify trees, while recall assesses its capability to detect all ground truth trees. A detection was considered correct if a single predicted ITC mask was located within the boundaries of a reference tree. If no ITC mask was found or multiple ITC masks were assigned to a single reference tree, the detection was deemed incorrect. Delineation accuracy is defined as the ratio of correctly delineated tree crowns to the total number of reference tree crowns.

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

R e c a l l = \frac{T P}{T P + F N}

(2)

F 1 s c o r e = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

In ITC delineation, which involved ITC mask and crown clustering, an alternative delineation definition was used. The delineation accuracy percentage represents the proportion of trees that were correctly identified and delineated by the algorithm relative to the reference dataset. A tree crown is considered correctly delineated if the Intersection over Union (IoU) between the predicted crown and the reference crown exceeds a predetermined threshold of 0.5. The mean Intersection over Union (mIoU) across all IoUs was adopted to measure the overall agreement between model-generated (Aseg) and reference (Aref) tree boundaries, providing a comprehensive view of the model’s performance in delineating ITC. mIoU can be calculated using the following equation.

m I o U = \frac{1}{n} \sum_{1}^{n} \frac{A_{r e f} \cap A_{s e g}}{A_{r e f} \cup A_{s e g}}

(4)

4. Results

In this study, we evaluated the performance of the two-stage deep learning approach, as shown in Table 1. The results demonstrate that the proposed approach outperformed both Mask R-CNN and lidR itcSegment across all three validation plots. The first column presents the results for three validation plots: Plot-1 (coniferous), Plot-2 (mixed-wood), and Plot-3 (deciduous).

4.1. ITC Detection and Delineation Results from the Two-Stage Network

The quantitative assessment of the 3D U-Net model for ITC delineation involves various metrics that evaluate the model’s accuracy in 3D space, as shown in Table 1. In the coniferous plot (Plot-1), the two-stage method achieved the highest mean Intersection over Union (mIoU) of 0.82, surpassing Mask R-CNN (0.78) and itcSegment (0.76). It also obtained the best delineation accuracy (91.48%) and precision (0.92), with an F1 score of 0.88, indicating a strong balance between recall (0.89) and precision. The Mask R-CNN method followed closely behind, while itcSegment showed the weakest performance, particularly in precision (0.72). These results suggest that the two-stage method is highly effective in delineating coniferous trees, where tree structures are more distinct and less complex than those in mixed or deciduous forests.

In the mixed-wood (Plot-2) and deciduous (Plot-3) plots, performance declined across all methods, reflecting the increased complexity of these forest types. The two-stage method maintained the highest accuracy, achieving mIoU values of 0.81 and 0.80 in the mixed-wood and deciduous plots, respectively. Delineation accuracy dropped to 84.18% and 82.38% in these plots, representing a decline of 7–9% compared with the coniferous plot. Precision was notably lower in the mixed-wood (0.86) and deciduous (0.84) plots, suggesting increased challenges in accurately segmenting individual trees. Mask R-CNN and itcSegment performed worse overall, particularly in precision and F1 score, with itcSegment struggling the most. Despite these challenges, the two-stage method consistently outperformed the baseline methods, demonstrating its robustness across different forest compositions.

Figure 5 illustrates representative results from our two-stage approach. The 3D view (Figure 5a) displays the point cloud colored by individual tree crown segments, clearly showing the separation between adjacent trees. The corresponding 2D view (Figure 5b) demonstrates how the approach effectively delineates crown boundaries, even in areas with overlapping canopies. The consistent coloring between the 3D and 2D representations highlights the spatial coherence of our segmentation results. This visualization demonstrates the ability to capture both the horizontal extent and vertical structure of individual tree crowns, particularly in areas with complex canopy arrangements.

In the first stage, the Mask R-CNN model trained on a CHM dataset with annotated tree crown polygons performs 2D detection and delineation of ITCs. The model generates delineation masks that outline individual tree crowns, providing a spatial footprint of each tree. The MasK R-CNN method was evaluated across three validation plots, as shown in Table 1. The fourth column presents the results for these validation plots. The MASK R-CNN method performed slightly worse than the two-stage method in all three plots. In the coniferous plot, it achieved an F1 score of 87% and delineation accuracy of 90.03%, with recall and precision values of 85% and 91%, respectively. Performance in the mixed-wood and deciduous plots was lower, with F1 scores of 81% and 83%, respectively. The mixed-wood plot yielded a precision of 75%, which was 11% lower than the two-stage method, while the deciduous plot showed a slightly better precision of 78%. Among all plots, the lowest mIoU was observed in the deciduous plot at 0.72. These results indicate that while MASK R-CNN demonstrated competitive performance, it generally underperformed compared with the two-stage deep learning approach, particularly in mixed-wood and deciduous plots.

Figure 6 presents the results from the first stage of our approach using Mask R-CNN. The left panel (Figure 6a) shows a comparison between reference ITC polygons (red) and Mask R-CNN delineated ITCs (blue), highlighting both the successes and limitations of the 2D approach. Areas of good alignment indicate accurate delineation, while discrepancies reveal the challenges in complex canopy environments. The right panel (Figure 6b) displays the ITC detection results, with red dots indicating the centroid of each detected tree crown. This visualization demonstrates the detection accuracy of Mask R-CNN, but it also reveals limitations in distinguishing closely spaced trees in 3D space. Traditional 2D ITC delineation approaches cannot fully capture the complex 3D spatial information of forest structures, particularly where tree crowns overlap or intertwine vertically. While 2D methods effectively represent horizontal crown extent, they fail to account for the vertical stratification of vegetation, leading to potential misclassification of points belonging to different trees occupying the same horizontal space at different heights. This limitation becomes especially pronounced in multi-layered forest canopies in mixed-wood forests.

4.2. Tree Attributes Derived from the Proposed Method

Tree attributes were then derived from ITCs generated using the proposed method in an Excel file, providing valuable insights into forest structure and tree-level characteristics. Table 2 presents an example of tree attributes extracted for 20 trees. As shown in Table 2, each ITC is assigned a unique identifier (ITC_ID) along with precise three-dimensional coordinates (x, y, z), enabling accurate spatial positioning within the forest landscape. Tree heights range from 7.26 m to 25.50 m (mean: 21.25 m), demonstrating considerable vertical stratification within the study area. Crown areas vary substantially, from as small as 0.07 m² to as large as 68.65 m² (mean: 27.79 m²), indicating diverse canopy structures. The number of local peaks per tree crown, ranging from 0 to 29, serves as an indicator of canopy complexity and potential multi-stemmed individuals. Notably, trees with larger crown areas (e.g., ITC_ID 2 with 58.73 m² and ITC_ID 20 with 68.65 m²) tend to have higher numbers of local peaks (nine and eight, respectively), suggesting a correlation between crown size and structural complexity. Additionally, IoU values, ranging from 0.005 to 0.939, provide a quantitative assessment of segmentation accuracy, with higher values indicating better correspondence between predicted and reference crown delineations. Additionally, the model integrates LiDAR point cloud data corresponding to each ITC, enabling further analysis of structural variations and canopy density. These attributes enable detailed forest inventory and ecological analyses, supporting applications in forestry management, biodiversity assessment, and ecological modelling. For example, crown volume and vertical distribution metrics can be used to estimate aboveground biomass using allometric relationships. Crown shape parameters can aid in species classification, while structural metrics support habitat suitability assessments for wildlife studies.

5. Discussion

5.1. Performance on Coniferous, Deciduous, and Mixed Plots

The relationship between crown area and delineation accuracy varies across different forest types, as observed in the three validation plots. In the coniferous-dominated plot (Plot-1), where trees typically have smaller and more uniform crowns, the proposed two-stage deep learning method achieved high accuracy, reaching delineation accuracy of 91.48% and an F1 score of 0.88. This outperformed both Mask R-CNN (90.03% accuracy, 0.87 F1) and lidR itcSegment (85.58% accuracy, 0.82 F1) within this study. The compact, well-defined crown shapes of coniferous trees enhance the model’s ability to accurately distinguish individual trees, minimizing segmentation errors. The high precision (0.92) and recall (0.89) values indicate that the model effectively detects and delineates coniferous trees with minimal over-segmentation or omission errors. This superior performance in coniferous plots aligns with findings from previous studies utilizing high-resolution LiDAR data [38,39]. The proposed method shows a remarkable improvement, likely due to the enhanced ability of the two-stage approach to capture the typically regular crown shapes of coniferous trees. For instance, studies using fused terrestrial and UAV LiDAR reported F scores around 0.78 for coniferous forests, while approaches using local maxima algorithms on UAV-derived Canopy Height Models in mixed conifer forests have shown F scores around 0.86 [38]. Some deep learning studies, like those using Mask R-CNN, have reported F1 scores up to 0.91, but often in specific contexts like plantations, placing the results among the higher performers, especially for coniferous stands, where accuracy is often higher due to distinct tree shapes [4].

In contrast, delineation accuracy declined in mixed-wood (Plot-2) and deciduous plots (Plot-3), reaching 84.18% and 82.38%, respectively. This was slightly lower than its performance in the coniferous plot, a common trend observed in ITC delineation due to the complex and often overlapping crowns of deciduous trees. In the deciduous plot, Mask R-CNN achieved a comparable F1 score of 0.83, while itcSegment scored 0.80. The larger and more irregular crown areas in these plots complicate the segmentation process, as the model struggles to differentiate overlapping or closely spaced tree crowns. The deciduous plot exhibited the lowest precision (0.84), indicating a higher likelihood of false positives, likely due to the broader, less-defined tree crowns. Additionally, the fluctuating structure of deciduous trees, influenced by factors like seasonal foliage changes, introduces further uncertainty in delineation. These findings suggest that crown area is a critical factor affecting segmentation accuracy, with smaller, more defined crowns yielding better results [18]. To improve accuracy in mixed and deciduous forests, integrating additional data sources, such as spectral information or multi-temporal imagery, could help refine crown boundary detection and enhance model performance across diverse forest structures. Research using fused LiDAR reported F scores around 0.8 for broadleaf forests [40], while other deep learning applications, like Mask R-CNN, have shown variable results in delineating unseen test trees, with an F1 score of 0.64and a score of 0.74 for the tallest trees based on aerial RGB imagery in complex tropical forests [4]. These findings suggest that the deep learning methods evaluated in the present study perform well relative to established benchmarks for challenging deciduous environments. The mixed-wood plot (Plot-2) presented intermediate results, with the two-stage approach achieving 84.18% accuracy, outperforming Mask R-CNN (80.04%) and itcSegment (76.15%) by approximately 5.2% and 10.5%, respectively. These results were similar to those reported in a comparable study using UAV-LiDAR data in mixed forests [41]. The improvement is most pronounced in this challenging mixed-forest environment, where the two-stage method’s ability to leverage both 2D and 3D information proves particularly advantageous for distinguishing adjacent trees of different species and structures, making it a promising approach for operational forest inventory applications across diverse forest compositions.

5.2. Strengths and Limitations of the Proposed Method

The proposed method demonstrates strengths in 3D ITC delineation compared to the traditional itcSegment approach. As illustrated in Figure 7, the two-stage deep learning method delineates tree crowns in both horizontal and vertical profiles, addressing a critical limitation of conventional CHM-based approaches. The method’s ability to distinguish individual trees in complex vertical arrangements is evident where the blue-colored tree is positioned beneath the green-colored tree in Figure 7(a1) yet segmented as a distinct entity. This capability represents a substantial advancement over itcSegment, which struggles with points at adjacent boundaries and under-canopy positions, as shown in Figure 7(b2). Furthermore, the proposed approach preserves the natural irregularity of crown boundaries, which is especially important for deciduous trees with large volumes and irregular extensions, contrasting with itcSegment’s tendency to create artificially smooth crown boundaries, as shown in Figure 7(b1,b3). By using treetop regions as clustering references, the two-stage approach achieves more realistic and biologically relevant delineation results, aligning with actual forest structures rather than applying arbitrary geometric constraints.

While the two-stage method improves delineation accuracy by capturing vertical canopy structures more effectively, it still inherits the limitations of Mask R-CNN, particularly regarding boundary accuracy and small tree detection. Mask R-CNN occasionally produces imprecise crown boundaries and fails to detect smaller trees within dense canopies, errors that subsequently propagate to the second stage of our pipeline. This error accumulation is particularly problematic, as treetops missed in the initial detection phase cannot be recovered in the subsequent 3D segmentation stage, resulting in permanent omissions in the final delineation results. Additionally, CHM-based methods like Mask R-CNN face inherent challenges with deciduous trees, which often lack the pronounced height variations found in coniferous species. The relatively homogeneous crown surfaces of deciduous trees frequently lead to under-segmentation, as the subtle height transitions between adjacent crowns are insufficient for effective boundary delineation.

Although the two-stage approach outperforms both Mask R-CNN and itcSegment across all forest types, the performance gap between coniferous (91.48% accuracy) and deciduous forests (82.38% accuracy) indicates remaining challenges in complex canopy environments. Similar performance patterns have been observed in previous studies using high-resolution LiDAR data [4,27,41], where deciduous tree delineation typically yielded lower accuracy than coniferous trees due to complex crown structures. Additionally, the computational demands of the two-stage deep learning approach are considerably greater than those of traditional methods, potentially limiting its application in time-sensitive or resource-constrained scenarios.

Future work should focus on refining the initial detection stage to improve small tree recognition and enhance boundary definition for deciduous trees, potentially by incorporating texture features or spectral information to supplement the height-based differentiation currently employed. Additionally, post-processing techniques, such as applying a secondary refinement step using point cloud clustering, could help recover missed trees and improve recall in complex forest environments.

5.3. 2D and 3D Reference Datasets for ITC Delineation Using LiDAR Data

The development of both 2D and 3D reference datasets plays a crucial role in evaluating the accuracy of ITC delineation models, with each offering distinct advantages and limitations. However, to the best of our knowledge, these aspects have not been extensively discussed in the literature. The 2D reference dataset, primarily derived from CHM, provides a simplified top-down view of the forest structure, making it useful for efficiently annotating individual tree crowns. However, the limited vertical information in CHM-based datasets often results in fuzzy ITC boundaries, particularly in areas where crowns overlap or trees exhibit irregular growth patterns [4,15]. This limitation is especially pronounced in mixed-wood and deciduous forests, where the lack of detailed height differentiation complicates accurate crown delineation. The crown boundaries of coniferous trees, characterized by well-defined treetops, are more easily identified using 2D datasets, whereas deciduous trees, with their broad and interconnected canopies, often exhibit segmentation inconsistencies. Despite these challenges, the 2D dataset remains a practical tool for rapid forest assessment and management, especially in coniferous-dominated areas.

In contrast, the 3D reference dataset, constructed from LiDAR point clouds, offers a more comprehensive representation of the forest’s structure by capturing the full vertical profile of individual trees. The 3D dataset overcomes the limitations of 2D datasets by providing detailed structural information, enabling more precise forest management and analysis. Although creating a 3D dataset is time consuming, it effectively captures the three-dimensional nature of each ITC, closely reflecting the true structure of individual trees in a forest. This level of detail is particularly beneficial for deciduous trees, where stem information is critical for accurately identifying tree numbers and delineating ITCs. The proposed methodology implemented a progressive approach beginning with 2D annotations to generate 3D training samples, followed by manual verification to produce final 3D reference data. This hybrid workflow reduces the burden of training data acquisition while maintaining annotation quality. The progression from 2D to 3D reference datasets establishes an accessible pathway for developing sophisticated 3D delineation algorithms, which is particularly crucial as deep learning applications in forestry continue to advance.

With the growing adoption of deep learning approaches in forestry applications, the importance of benchmark datasets cannot be overstated. While public access to our dataset would enhance research impact and reproducibility, the labor-intensive nature of manual tree crown delineation has limited our current validation to 1600 of the estimated 5500 trees in the study area. We intend to make the complete benchmark dataset, including all trees over the study area, publicly accessible online upon completion of validation work in the future, providing a valuable resource for the advancement of LiDAR-based ITC detection and delineation methods.

6. Conclusions

This study introduced a novel two-stage deep learning framework for improving individual tree crown (ITC) detection and delineation in mixed-wood forests. By integrating CHM-based treetop region segmentation using Mask R-CNN with LiDAR point cloud clustering via a specialized 3D U-Net architecture, our approach effectively addresses the limitations of both 2D and 3D methodologies. The first stage leverages Mask R-CNN’s robust instance segmentation capabilities while maintaining computational efficiency, while the second stage refines crown boundaries in 3D space by incorporating the rich structural information available in LiDAR point clouds. Evaluated against manually delineated reference data, our approach outperforms established methods, including Mask R-CNN alone and the lidR itcSegment algorithm, achieving mean mIoU scores of 0.82 for coniferous plots, 0.81 for mixed-wood plots, and 0.79 for deciduous plots. This study demonstrates the great potential of the two-stage deep learning approach as a robust solution for 3D ITC delineation in mixed-wood forests.

Our work will contribute to the field by establishing a comprehensive benchmark dataset of 2D and 3D references for airborne LiDAR data in mixed-wood forests. This benchmark addresses a critical gap in the literature by providing high-quality training data that capture the complexity of heterogeneous forest environments.

The 3D and 2D ITC products generated through this research have significant practical applications in sustainable forest management, biodiversity assessment, and carbon stock estimation. By providing accurate individual tree-level information, forest managers can make more informed decisions regarding resource allocation, conservation strategies, and climate change mitigation efforts. Future work should focus on refining the initial detection stage to improve small tree recognition and enhance boundary definition for deciduous trees, potentially by incorporating texture features or spectral information to supplement the height-based differentiation currently employed. Additionally, post-processing techniques, such as applying a secondary refinement step using point cloud clustering, could help recover missed trees and improve recall in complex forest environments.

Author Contributions

Conceptualization, Q.L., B.H., J.S. and T.K.R.; methodology, Q.L., B.H., J.S. and T.K.R.; software, Q.L.; validation, Q.L., B.H., J.S. and T.K.R.; formal analysis, B.H., J.S. and T.K.R.; investigation, Q.L., B.H., J.S. and T.K.R.; resources, Q.L. and B.H.; data curation, Q.L. and B.H.; writing—original draft preparation, Q.L. and B.H.; writing—review and editing, J.S. and T.K.R.; visualization, Q.L. and T.K.R.; supervision, B.H., J.S. and T.K.R.; project administration, B.H.; funding acquisition, B.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council (NSERC) of Canada, grant number RGPIN-2021-03624, and the Forest Resources Inventory Knowledge Transfer and Tool Development Program under the Forestry Trust Fund Ontario, grant number 13A-2024.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to Baoxin Hu at baoxin@yorku.ca.

Acknowledgments

The authors gratefully acknowledge the funding provided by the Natural Sciences and Engineering Research Council (NSERC) of Canada and the Forest Resources Inventory Knowledge Transfer and Tool Development Program under Forestry Trust Fund Ontario. Their financial support has been instrumental in enabling the research presented in this manuscript, facilitating the development of innovative tools and the transfer of knowledge critical to advancing forest resource management and sustainability.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lindenmayer, D.B.; Laurance, W.F. The Ecology, Distribution, Conservation and Management of Large Old Trees. Biol. Rev. 2017, 92, 1434–1458. [Google Scholar] [CrossRef] [PubMed]
Fassnacht, F.E.; Latifi, H.; Stereńczak, K.; Modzelewska, A.; Lefsky, M.; Waser, L.T.; Straub, C.; Ghosh, A. Review of Studies on Tree Species Classification from Remotely Sensed Data. Remote Sens. Environ. 2016, 186, 64–87. [Google Scholar] [CrossRef]
Atkins, J.W.; Bhatt, P.; Carrasco, L.; Francis, E.; Garabedian, J.E.; Hakkenberg, C.R.; Hardiman, B.S.; Jung, J.; Koirala, A.; LaRue, E.A. Integrating Forest Structural Diversity Measurement into Ecological Research. Ecosphere 2023, 14, e4633. [Google Scholar] [CrossRef]
Ball, J.G.C.; Hickman, S.H.M.; Jackson, T.D.; Koay, X.J.; Hirst, J.; Jay, W.; Archer, M.; Aubry-Kientz, M.; Vincent, G.; Coomes, D.A. Accurate Delineation of Individual Tree Crowns in Tropical Forests from Aerial RGB Imagery Using Mask R-CNN. Remote Sens. Ecol. Conserv. 2023, 9, 641–655. [Google Scholar] [CrossRef]
Xi, Z.; Chasmer, L.; Hopkinson, C. Delineating and Reconstructing 3D Forest Fuel Components and Volumes with Terrestrial Laser Scanning. Remote Sens. 2023, 15, 4778. [Google Scholar] [CrossRef]
Xi, Z.; Degenhardt, D. A New Unified Framework for Supervised 3D Crown Segmentation (TreeisoNet) Using Deep Neural Networks across Airborne, UAV-Borne, and Terrestrial Laser Scans. ISPRS Open J. Photogramm. Remote Sens. 2025, 15, 100083. [Google Scholar] [CrossRef]
Xiang, B.; Wielgosz, M.; Kontogianni, T.; Peters, T.; Puliti, S.; Astrup, R.; Schindler, K. Automated Forest Inventory: Analysis of High-Density Airborne LiDAR Point Clouds with 3D Deep Learning. Remote Sens. Environ. 2024, 305, 114078. [Google Scholar] [CrossRef]
Hu, B.; Li, J.; Jing, L.; Judah, A. Improving the Efficiency and Accuracy of Individual Tree Crown Delineation from High-Density LiDAR Data. Int. J. Appl. Earth Obs. Geoinf. 2014, 26, 145–155. [Google Scholar] [CrossRef]
Xi, Z.; Hopkinson, C.; Chasmer, L. Supervised Terrestrial to Airborne Laser Scanner Model Calibration for 3D Individual-Tree Attribute Mapping Using Deep Neural Networks. ISPRS J. Photogramm. Remote Sens. 2024, 209, 324–343. [Google Scholar] [CrossRef]
Zhao, H.; Morgenroth, J.; Pearse, G.; Schindler, J. A Systematic Review of Individual Tree Crown Detection and Delineation with Convolutional Neural Networks (CNN). Curr. For. Rep. 2023, 9, 149–170. [Google Scholar] [CrossRef]
Banasiak, P.Z.; Berezowski, P.L.; Zapłata, R.; Mielcarek, M.; Duraj, K.; Stereńczak, K. Semantic Segmentation (U-Net) of Archaeological Features in Airborne Laser Scanning—Example of the Białowieża Forest. Remote Sens. 2022, 14, 995. [Google Scholar] [CrossRef]
Zhang, C.; Zhou, J.; Wang, H.; Tan, T.; Cui, M.; Huang, Z.; Wang, P.; Zhang, L. Multi-Species Individual Tree Segmentation and Identification Based on Improved Mask R-CNN and UAV Imagery in Mixed Forests. Remote Sens. 2022, 14, 874. [Google Scholar] [CrossRef]
Xue, X.; Luo, Q.; Bu, M.; Li, Z.; Lyu, S.; Song, S. Citrus Tree Canopy Segmentation of Orchard Spraying Robot Based on RGB-D Image and the Improved DeepLabv3+. Agronomy 2023, 13, 2059. [Google Scholar] [CrossRef]
Feng, H.; Hu, Q.; Zhao, P.; Wang, S.; Ai, M.; Zheng, D.; Liu, T. FTransDeepLab: Multimodal Fusion Transformer-Based DeepLabv3+ for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–18. [Google Scholar] [CrossRef]
Dersch, S.; Schoettl, A.; Krzystek, P.; Heurich, M. Towards Complete Tree Crown Delineation by Instance Segmentation with Mask R–CNN and DETR Using UAV-Based Multispectral Imagery and Lidar Data. ISPRS Open J. Photogramm. Remote Sens. 2023, 8, 100037. [Google Scholar] [CrossRef]
Fu, H.; Zhao, H.; Jiang, J.; Zhang, Y.; Liu, G.; Xiao, W.; Du, S.; Guo, W.; Liu, X. Automatic Detection Tree Crown and Height Using Mask R-CNN Based on Unmanned Aerial Vehicles Images for Biomass Mapping. For. Ecol. Manag. 2024, 555, 121712. [Google Scholar] [CrossRef]
Wołk, K.; Tatara, M.S. A Review of Semantic Segmentation and Instance Segmentation Techniques in Forestry Using LiDAR and Imagery Data. Electronics 2024, 13, 4139. [Google Scholar] [CrossRef]
Liu, Y.; Chen, D.; Fu, S.; Mathiopoulos, P.T.; Sui, M.; Na, J.; Peethambaran, J. Segmentation of Individual Tree Points by Combining Marker-Controlled Watershed Segmentation and Spectral Clustering Optimization. Remote Sens. 2024, 16, 610. [Google Scholar] [CrossRef]
Yang, J.; Kang, Z.; Cheng, S.; Yang, Z.; Akwensi, P.H. An Individual Tree Segmentation Method Based on Watershed Algorithm and Three-Dimensional Spatial Distribution Analysis from Airborne LiDAR Point Clouds. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1055–1067. [Google Scholar] [CrossRef]
Camilus, K.S.; Govindan, V.K. A Review on Graph Based Segmentation. Int. J. Image Graph. Signal Process. 2012, 4, 1. [Google Scholar] [CrossRef]
Li, W.; Guo, Q.; Jakubowski, M.K.; Kelly, M. A New Method for Segmenting Individual Trees from the Lidar Point Cloud. Photogramm. Eng. Remote Sens. 2012, 78, 75–84. [Google Scholar] [CrossRef]
Dalponte, M. ItcSegment: Individual Tree Crowns Segmentation. In R Package, version 0.8; CRAN: Vienna, Austria, 2018. [Google Scholar]
Gupta, S.; Weinacker, H.; Koch, B. Comparative Analysis of Clustering-Based Approaches for 3-D Single Tree Detection Using Airborne Fullwave Lidar Data. Remote Sens. 2010, 2, 968–989. [Google Scholar] [CrossRef]
Li, S.; Dai, L.; Wang, H.; Wang, Y.; He, Z.; Lin, S. Estimating Leaf Area Density of Individual Trees Using the Point Cloud Segmentation of Terrestrial LiDAR Data and a Voxel-Based Model. Remote Sens. 2017, 9, 1202. [Google Scholar] [CrossRef]
Kim, D.-H.; Ko, C.-U.; Kim, D.-G.; Kang, J.-T.; Park, J.-M.; Cho, H.-J. Automated Segmentation of Individual Tree Structures Using Deep Learning over LiDAR Point Cloud Data. Forests 2023, 14, 1159. [Google Scholar] [CrossRef]
Luo, J.; Zhang, D.; Luo, L.; Yi, T. PointResNet: A Grape Bunches Point Cloud Semantic Segmentation Model Based on Feature Enhancement and Improved PointNet++. Comput. Electron. Agric. 2024, 224, 109132. [Google Scholar] [CrossRef]
Chen, X.; Jiang, K.; Zhu, Y.; Wang, X.; Yun, T. Individual Tree Crown Segmentation Directly from UAV-Borne LiDAR Data Using the PointNet of Deep Learning. Forests 2021, 12, 131. [Google Scholar] [CrossRef]
Ribeiro, L. LidR: Airborne LiDAR Data Manipulation and Visualization (version 3.0.0); R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
Schmohl, S.; Narváez Vallejo, A.; Soergel, U. Individual Tree Detection in Urban ALS Point Clouds with 3D Convolutional Networks. Remote Sens. 2022, 14, 1317. [Google Scholar] [CrossRef]
Girardeau-Montaut, D. CloudCompare (Version 2.13.2); CloudCompare Project: Grenoble, France, 2023. [Google Scholar]
Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J. Array Programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
Gillies, S. Rasterio: Geospatial Raster I/O for Python Programmers (Version 1.3.8); Mapbox: Washington, DC, USA, 2013. [Google Scholar]
Environmental Systems Research Institute (ESRI). ArcGIS Pro (Version 3.2); Esri: Redlands, CA, USA, 2023. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Chen, Q.; Wang, X.; Hang, M.; Li, J. Research on the Improvement of Single Tree Segmentation Algorithm Based on Airborne LiDAR Point Cloud. Open Geosci. 2021, 13, 705–716. [Google Scholar] [CrossRef]
JetBrains PyCharm (Version 2023.2); JetBrains s.r.o.: Prague, Czech Republic, 2023.
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Miraki, M.; Sohrabi, H.; Fatehi, P.; Kneubuehler, M. Individual Tree Crown Delineation from High-Resolution UAV Images in Broadleaf Forest. Ecol. Inform. 2021, 61, 101207. [Google Scholar] [CrossRef]
Ko, C.; Sohn, G.; Remmel, T.K. Tree Genera Classification with Geometric Features from High-Density Airborne LiDAR. Can. J. Remote Sens. 2013, 39, S73–S85. [Google Scholar] [CrossRef]
Lisiewicz, M.; Kamińska, A.; Kraszewski, B.; Stereńczak, K. Correcting the Results of CHM-Based Individual Tree Detection Algorithms to Improve Their Accuracy and Reliability. Remote Sens. 2022, 14, 1822. [Google Scholar] [CrossRef]
You, H.; Liu, Y.; Lei, P.; Qin, Z.; You, Q. Segmentation of Individual Mangrove Trees Using UAV-Based LiDAR Data. Ecol. Inform. 2023, 77, 102200. [Google Scholar] [CrossRef]

Figure 1. The location and characteristics of the study area are outlined in red. The forest regions of Ontario, depicted in the top left, are adapted from OMNR (2002a). The orange box provides a detailed view of forest stands captured in the CHM image. Coordinate system: NAD83/UTM zone 17N.

Figure 2. The 2D reference dataset is divided into the training dataset shown in green on the left and the validation datasets shown in gray on the right. The details of Plot-1, Plot-2, and Plot-3 are presented in the red, yellow, and green boxes on the right. Manually delineated blue polygons represent ITCs overlaid on the CHM.

Figure 3. 3D reference dataset for training and validation. An example of ITCs in a forest stand is presented on the right, with ITCs displayed in random colors.

Figure 4. Workflow for the two-stage deep learning network combining Mask R-CNN and 3D U-Net.

Figure 5. Example results from the two-stage deep learning network. ITCs are displayed in random colors, with corresponding ITCs in the same color in (a,b). (a) shows the 3D view of the delineated ITCs, while (b) provides the 2D view.

Figure 6. Results from the first stage using Mask R-CNN. Red polygons in (a) are the reference dataset, while blue polygons are delineated ITCs using Mask R-CNN. (b) displays ITC detection results, with red dots presenting each detected ITC.

Figure 7. Examples of 3D ITC delineation results from the proposed method and the itcSegment algorithm. (a1) Top view of four individual trees. (a2) Left-side view of two adjacent coniferous and deciduous trees. (a3) Left-side view of the two remaining adjacent trees. (b1) Corresponding views of 3D ITC results for the same four trees using the itcSegment algorithm. (b2) Left-side view of the two adjacent coniferous and deciduous trees using the itcSegment algorithm. (b3) Left-side view of the two remaining adjacent trees using the itcSegment algorithm.

Table 1. Results from the two-stage deep learning network combining Mask R-CNN and 3D U-Net. Three validation plots are listed in the first column.

Plots	Metrics	Two-Stage Method	MASK R-CNN	itcSegment
Plot-1	mIoU	0.82	0.78	0.76
	Delineation accuracy (%)	91.48	90.03	85.58
	Precision	0.92	0.91	0.72
	Recall	0.89	0.85	0.85
	F1 score	0.88	0.87	0.82
Plot-2	mIoU	0.81	0.76	0.73
	Delineation accuracy (%)	84.18	80.04	76.15
	Precision	0.86	0.75	0.70
	Recall	0.83	0.80	0.79
	F1 score	0.84	0.81	0.76
Plot-3	mIoU	0.79	0.72	0.74
	Delineation accuracy (%)	82.38	79.49	78.95
	Precision	0.84	0.78	0.71
	Recall	0.83	0.84	0.81
	F1 score	0.82	0.83	0.80

Table 2. An example of tree attributes extracted from 20 ITCs delineated by the proposed two-stage method on the validation dataset.

ITC_ID	x	y	z	Height	Crown_Area	Local_Peaks	IoU
1	774,203.14	5,163,280.91	314.87	7.26	0.07	0	0.005
2	774,212.023	5,163,285.82	323.28	20.61	58.73	9	0.751
3	774,215.229	5,163,267.141	325.34	22.31	18.20	3	0.939
4	774,229.105	5,163,295.991	322.14	19.53	29.34	3	0.854
5	774,232.105	5,163,275.46	323.49	20.70	8.51	4	0.399
6	774,222.139	5,163,299.257	323.80	21.00	36.42	5	0.784
7	774,216.146	5,163,280.994	325.87	22.45	25.81	3	0.492
8	774,230.796	5,163,279.74	327.54	24.00	32.18	5	0.862
9	774,222.343	5,163,293.852	321.90	20.01	24.64	4	0.706
10	774,249.046	5,163,273.842	328.27	25.50	22.74	2	0.693
11	774,251.172	5,163,269.53	326.04	23.27	16.31	1	0.673
12	774,248.529	5,163,289.822	324.53	23.54	30.47	4	0.876
13	774,224.019	5,163,273.397	325.80	22.51	17.84	3	0.660
14	774,211.021	5,163,272.464	325.47	22.62	23.55	5	0.790
15	774,217.062	5,163,295.945	323.51	20.83	23.51	2	0.780
16	774,234.042	5,163,265.536	326.28	23.96	39.35	6	0.718
17	774,249.729	5,163,294.858	319.66	17.53	14.66	2	0.920
18	774,246.805	5,163,264.895	325.16	23.53	40.53	6	0.879
19	774,211.265	5,163,298.707	322.85	19.58	24.45	2	0.792
20	774,221.886	5,163,308.862	324.16	21.54	68.65	8	0.478

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Hu, B.; Shang, J.; Remmel, T.K. Two-Stage Deep Learning Framework for Individual Tree Crown Detection and Delineation in Mixed-Wood Forests Using High-Resolution Light Detection and Ranging Data. Remote Sens. 2025, 17, 1578. https://doi.org/10.3390/rs17091578

AMA Style

Li Q, Hu B, Shang J, Remmel TK. Two-Stage Deep Learning Framework for Individual Tree Crown Detection and Delineation in Mixed-Wood Forests Using High-Resolution Light Detection and Ranging Data. Remote Sensing. 2025; 17(9):1578. https://doi.org/10.3390/rs17091578

Chicago/Turabian Style

Li, Qian, Baoxin Hu, Jiali Shang, and Tarmo K. Remmel. 2025. "Two-Stage Deep Learning Framework for Individual Tree Crown Detection and Delineation in Mixed-Wood Forests Using High-Resolution Light Detection and Ranging Data" Remote Sensing 17, no. 9: 1578. https://doi.org/10.3390/rs17091578

APA Style

Li, Q., Hu, B., Shang, J., & Remmel, T. K. (2025). Two-Stage Deep Learning Framework for Individual Tree Crown Detection and Delineation in Mixed-Wood Forests Using High-Resolution Light Detection and Ranging Data. Remote Sensing, 17(9), 1578. https://doi.org/10.3390/rs17091578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Two-Stage Deep Learning Framework for Individual Tree Crown Detection and Delineation in Mixed-Wood Forests Using High-Resolution Light Detection and Ranging Data

Abstract

1. Introduction

2. Study Area and Data Pre-Processing

2.1. Study Area

2.2. LAS Data Acquisition and Pre-Processing

2.3. Reference Dataset

3. Methodology

3.1. First Stage: 2D Treetop Region Segmentation Using Mask R-CNN

3.2. Second Stage: 3D ITC Delineation Using 3D U-Net Model

3.3. Evaluation and Accuracy Assessment

4. Results

4.1. ITC Detection and Delineation Results from the Two-Stage Network

4.2. Tree Attributes Derived from the Proposed Method

5. Discussion

5.1. Performance on Coniferous, Deciduous, and Mixed Plots

5.2. Strengths and Limitations of the Proposed Method

5.3. 2D and 3D Reference Datasets for ITC Delineation Using LiDAR Data

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI