1. Introduction
Forest resources, as integral components of terrestrial ecosystems, play a pivotal role in safeguarding essential ecosystem functions, including biodiversity support, atmospheric regulation, and hydrological cycle homeostasis [
1,
2]. Consequently, detailed forest structural parameters provide the foundational data indispensable for the sustainable stewardship of these vital resources. Conventional mensuration techniques, however, are notably hampered by substantial costs, inherent temporal limitations, and challenges related to site accessibility [
3]. The advent of light detection and ranging (LiDAR) technology, particularly terrestrial laser scanning (TLS), represents a paradigm shift in forestry applications [
4,
5]. These laser-based methodologies have emerged as powerful instruments for detailed forest assessment. Central to leveraging TLS data [
6] for detailed forest structural parameters is the capacity for precise individual tree segmentation (ITS), which is indispensable for the accurate estimation of critical biophysical parameters, such as above-ground biomass, sequestered carbon, and taxonomic classification.
Individual tree segmentation (ITS) methodologies based on LiDAR have advanced into three principal paradigms: the canopy height model (CHM)-based method, the point-based method, and the deep learning-based method [
7]. CHM-based methods initially involve rasterizing 3D point clouds into 2D CHM representations, followed by tree apex identification through local maxima detection within the CHM [
8,
9]. Subsequent image segmentation algorithms are then employed to delineate canopy boundaries and accomplish individual tree segmentation. Hyyppä et al. [
10] utilized fixed 3 × 3 search windows to detect local maxima as seed points for subsequent region growing segmentation. Li et al. [
11] introduced point cloud density features for the first time to diagnose the accuracy of the initial segmentation based on the traditional CHM segmentation, and further utilized the density and height information to guide the three-dimensional morphological analysis of the canopy point cloud, thus accurately correcting the segmentation errors. CHM-based methods offer computationally efficient crown segmentation in forests but face inherent limitations from 3D to 2D rasterization [
12] that cause substantial loss of 3D structural data. The point-based approach is a way to overcome the limitations of the CHM-based approach, which is primarily categorized into clustering algorithms and region growing methods [
13,
14,
15]. Clustering approaches partition points within feature spaces analogous to 3D tree point clouds [
16]. Morsdorf et al. [
17] employed tree apex identification as initial cluster seeds, utilizing k-means clustering for single-tree segmentation. In contrast to clustering methods, region growing operates through iterative rule-based merging and partitioning of point clusters. These methodologies integrate morphological and ecological priors into geometric criteria, thereby enriching semantic information for enhanced segmentation fidelity [
18]. Liu et al. [
19] pioneered a trunk growth methodology for individual tree segmentation, demonstrating efficacy in natural forest ecosystems. Wang et al. [
20] developed an unsupervised segmentation framework leveraging super voxel morphological features, achieving high precision of segmentation in synthetic forest datasets. While point-based extraction directly processes 3D structural data that can enable superior delineation of understory and subcanopy trees [
21], it faces limitations in computational efficiency, accuracy degradation at scale, limited generalizability, and hyperparameter sensitivity during large-scale point cloud processing [
22,
23,
24,
25,
26].
Building upon recent advancements in point cloud semantic segmentation and inspired by emerging developments in point cloud instance segmentation [
27,
28], researchers have begun exploring deep learning-based methods to address individual tree delineation [
29,
30,
31]. Luo et al. [
32] proposed a novel top-down approach that leverages neural networks to learn geometric feature distributions at crown edges, integrating tree center detection and region growing algorithms to achieve high precision individual tree segmentation in complex urban environments. Hu et al. [
33] enhanced the point Transformer architecture to precisely separate ground and non-tree points, subsequently combining watershed algorithms with hierarchical clustering to resolve crown occlusion challenges. Further advancing this domain, Chang et al. [
34] present a two-stage approach: semantic segmentation isolates tree points while filtering understory noise, followed by a YOLOv3 detector for trunk localization and multi-level adaptive clustering to effectively address small tree omission and dense canopy occlusion. These innovations mark a transition from isolated algorithmic approaches to model symbiosis, enabling robust tree instance delineation. Notably, while deep learning methods adopt 3D semantic segmentation frameworks [
35], tree instance segmentation often relies on classical algorithms that often face undersegmentation and oversegmentation issues, with accuracy heavily relying on point cloud density, parameter settings, and terrain noise. This highlights the critical need for automated, robust, and generalizable ITS methods.
Driven by the conceptual advancement of panoptic segmentation, researchers have increasingly focused on integrating semantic and instance segmentation within 3D point cloud processing frameworks [
36,
37,
38,
39]. This bottom-up segmentation paradigm enables deep learning networks to concurrently learn semantic features and instance discriminative representations, offering novel pathways toward fully automated individual tree segmentation [
40,
41,
42,
43]. Early studies predominantly adapted indoor scene instance segmentation methodologies. Jiang et al. [
44] proposed PointGroup, which employs dual coordinate clustering in both raw and centroid offset spaces, coupled with ScoreNet for the quality assessment of candidate instances. Their innovation lies in leveraging inter-object spatial voids, for instance, grouping, which is a strategy with implications for separating adjacent trees in forest environments. Xiang et al. [
45,
46] proposed a novel urban landscape panoptic segmentation strategy to experimentally compare different 3D backbone networks and instance segmentation strategies for outdoor mobile mapping point clouds. Their experiments demonstrated that, regardless of the backbone, instance segmentation by clustering embedding features is better than using shifted coordinates. Nevertheless, these approaches primarily target urban tree segmentation, with limited attention to forest environments. Xiang et al. [
47] further developed ForAINet, which integrates complementary strategies of 3D centroid prediction and discriminative embeddings. By employing a cylindrical block partitioning strategy to mitigate memory constraints in large-scale scenes, the segmentation back-end achieves an over 85% F1-score for individual trees. Concurrently, Wielgosz et al. [
48] introduced the SegmentAnyTree, which enhances point cloud data through stochastic downsampling and employs a sparse convolutional network with a multi-task learning head to predict semantic labels and instance offset vectors. This model attained an mAP50 of 64.0% in Norwegian spruce forests. However, postprocessing steps involving the aggregation of fragmented tree projections risk introducing errors such as instance fragmentation. In contrast, Henrich et al. [
49] proposed TreeLearn, which circumvents heuristic block merging by predicting offset vectors from points to tree bases, projecting points to their respective trunks, and clustering complete instances across the entire point cloud.
Despite these advancements, deep learning-based individual tree segmentation faces persistent challenges in point cloud processing: (1) Sparse and irregular point distributions complicate precise boundary delineation between adjacent crowns, particularly under dense canopy overlaps. (2) Overreliance on learnable offset vectors for displacement in center shift is inherently limited by point cloud sparsity, density differences, and morphological diversity across tree species, leading to erroneous point assignments. (3) Fragmentation and oversegmentation persist in clustering algorithms, especially for irregularly shaped or large trees, undermining segmentation fidelity.
To address these limitations, we propose an automated individual tree segmentation network, SPA-Net, designed to overcome inaccurate centroid localization and over-segmentation, and enhance the accuracy of segmentation. Our study has four main contributions: (1) An efficient automated segmentation network based on deep learning for individual tree segmentation from forest point clouds was developed to achieve excellent results in both plantation forests. (2) A sparse geometric proposal (SGP) module is designed that effectively addresses the issues of reliance on offset vectors and the challenges of offset prediction, while accelerating clustering speed, through a three-tier cascaded strategy of medoid sampling, centripetal shrink, and topological connectivity clustering. (3) An affinity aggregation (AA) module is proposed that mitigate over-segmentation and significantly improve large tree segmentation coherence. (4) An SGP + AA approach replaces conventional offset/embedding branches and subsequent scoring/NMS filtering methods.
3. Methods
The SPA-Net network for individual tree segmentation initiates processing with an input data generator applying statistical outlier removal (SOR) and cylindrical sampling to large-scale point clouds. These sampled cylinders are subsequently fed into a 3D backbone network for deep per-point feature extraction. A semantic head then performs point-wise classification (tree and non-tree). An instance branch, designed for instance differentiation among tree points, which comprises two key modules: the sparse geometric proposal (SGP) module generates initial instance proposals directly, eschewing traditional offset predictions. Concurrently, the affinity aggregation (AA) module utilizes backbone features to assess inter-proposal relationships, merging fragmented proposals of the same tree and outputting locally consistent instance results per cylinder. Global consistency is achieved via the cylinder merging (CM) module, which integrates results across all cylinders. The CM resolves overlap conflicts and merges cross-boundary fragments, assigning a globally unique instance ID to each tree instance. The final segmentation merges semantic labels and global instance IDs. The architecture of the SPA-Net network is depicted in
Figure 4.
3.1. Input Data Generator
To eliminate noise and statistical outliers from the forest point clouds, we utilize the statistical outlier removal (SOR) function within the CloudCompare software (Version: 2.12.4), which serves to eliminate conspicuous noise and statistical outliers, thereby precluding the interference of these invalid points in subsequent analyses and enhancing the overall reliability of the final results. Following this point cloud noise filtration, we implement a cylindrical sampling strategy to generate data subsets suitable for network training and processing. Specifically, a vertical cylindrical volume, defined by a radius of 4 m and centered on a designated point within the point cloud, is extracted to serve as an independent sampling unit. During training, the center point is determined by randomly sampling a training data point, with points belonging to rarer classes given a higher probability of being selected to ensure adequate coverage; specifically, the sampling probability is proportional to the inverse square root of the class frequency. In the testing phase, however, the center points are sampled regularly along a grid defined by the x and y coordinates using a fixed step size to ensure even coverage of the entire point cloud area. Therefore, the specific point is not fixed but is determined either by weighted random sampling during training or by regular grid sampling during testing. All points encompassed within this cylinder constitute a single sample.
3.2. Backbone
Upon acquiring the subsampled point cloud representing a cylindrical subvolume, we employ a 3D U-Net network, implemented using the Minkowski Engine, as the backbone network to extract hierarchical point features [
50]. This symmetric encoder–decoder structure, typically enhanced with skip connections, processes the input through multi-resolution sparse volumetric representations. At each hierarchical level, sparse 3D convolutions, operate on the voxelized data to capture contextual relationships across varying spatial extents, ultimately yielding rich point-wise feature vectors
F [
51]. Finally, the semantic head and instance branch take the fused features
F as the input and output semantic label and instance proposals.
3.3. Semantic Branch
The semantic branch performs semantic segmentation by assigning a specific label (tree or non-tree) to each point. The non-tree class encompasses points belonging to the ground or smaller understory vegetation, while the tree class includes all points belonging to actual trees. To achieve this, point features F that are extracted from the U-Net backbone output are input into the semantic segmentation branch that consists of a three-layer multilayer perceptron (MLP) that outputs per-point scores for the tree and non-tree classes. The class with the highest score is assigned as the point’s semantic label. The module is trained under the supervision of a cross-entropy loss function.
3.4. Sparse Geometric Proposal
Instance segmentation methods based on 3D center offsets have been widely adopted in previous studies. The core idea is to predict a vector for each point that points toward its instance center, ideally collapsing all points of an instance to a single location. However, accurately predicting these offsets is challenging in sparse LiDAR data. This difficulty is particularly problematic for neighboring trees, as inaccurate offset prediction leads to unreliable estimated centers and erroneous merging of distinct objects.
Figure 5 illustrates this incorrect clustering of close instances. To address this critical issue, we propose an SGP module that replaces the problematic offset prediction branch.
The overall structure of our proposed SGP module is detailed in
Figure 6.
The pseudocode depicting the specific implementation procedure of the SGP module is illustrated as follows (Algorithm 1):
Algorithm 1. The specific implementation procedure of the SGP module |
Input: Ptree, Voxel_size, RL, L, RL/2
Output: Instance_IDs |
// Part 1: Medoid Sampling (MS) |
S←∅ |
Voxel_map← Initialize_a_map() for each point p in Ptree do
voxel_coord←floor(p.xyz/Voxel_size)
Voxel_map[voxel_coord].append(p)
end for
Point_to_seed_map← Initialize_a_map() |
for each voxel_coord,points_in_voxel in Voxel_map do
Sseed ←Mean(points_in_voxel.xyz)
S.append(sseed )
for each point p in points_in_voxel do
Point_to_seed_map[p]←Sseed
end for
end for
// Part 2: Centripetal Shrink (CS)
Srefined ←S
G← Build graph on Srefined with edges connecting points if distance <RL
Anorm ← Calculate normalized adjacency matrix from G
Define encoding function: f(v)←sign(v)⋅log(1+∣v∣)
Define decoding function: f−1(venc )←sign(venc )⋅(exp(∣venc ∣)−1)
for i←1 to L do
Sencoded ←f(Srefined )
Snext_encoded ←Anorm ⋅Sencoded
Srefined ←f−1(Snext_encoded )
end for
// Part 3: Topological Connectivity Clustering (TCC)
Gfinal ← Build graph on Srefined with edges connecting points if distance <RL/2
Seed_Instance_IDs← Connected_Components_Labeling(Gfinal )
Instance_IDs← Initialize_array_for_all_points_in_Ptree
for each original point p in Ptree do
Sseed ←Point_to_seed_map[p]
Instance_IDs[p]←Seed_Instance_IDs[Sseed ]
end for
return Instance_IDs |
Raw LiDAR point clouds, with their large volume and spatial heterogeneity that include density variations with sensor distance and between surfaces, complicate analyses and make full dataset processing computationally prohibitive. To address this, we employ a medoid sampling (MS) strategy where the raw cloud is partitioned into uniform voxel blocks, and for each nonempty voxel, we calculate the average that becomes the single seed point representing that voxel of all points contained within it. This approach decouples sampling from point density, yields a spatially balanced set of representative seed points, and inherently links raw points to their seeds via the voxel structure, avoiding explicit proximity searches.
Initial seed points, while providing a sparse representation, may suffer from localization inaccuracies that hinder precise instance delineation. As shown in
Figure 6, we proposed the centripetal shrink (CS) module to mitigate this issue, which iteratively refines seed positions toward tree centroids without relying on pre-defined offsets. The CS module takes as input a sparse set of seed points coordinates
, where M denotes the quantity of seed points. A graph structure that represents inter-seed point relationships is established and characterized by a pre-computed normalized adjacency matrix,
. The matrix
delineates connectivity (edges) among seed points (nodes), predicated upon their initial spatial propinquity within a pre-defined radius R
L, D represents the corresponding degree matrix. After 10 iterations, the module yields a refined set of seed point coordinates
.
The quintessence of the CS module resides in its iterative refinement loop, where the contraction process, driven by neighborhood averaging, is executed entirely within a transformed, encoded domain. This ensures that the encoding directly influences the geometric interpretation of proximity and centrality during each step of the iterative shrinking. We employ an encoding function
and its inverse decoding function
. These functions operate element-wise on each coordinate
of a trivariate vector
; these operations are specified as:
where,
is the encoded value of the coordinate component
,
is the decoded value of the coordinate component,
is the exponential function.
The application of transforms the original Euclidean space into a non-linear encoded space where distances and, consequently, the notion of a neighborhood centroid, are rescaled. This logarithmic encoding paradigm compresses substantial coordinate differentials, thereby fostering more stable adjustments across heterogeneous arboreal scales and point cloud densities.
Given initial seed coordinates
, the module iteratively updates these positions for
= 0 to L − 1. The extant seed point coordinates
are initially mapped into the encoded space:
where,
is the matrix of 3D coordinates of all M seed points at iteration
.
is the matrix of encoded 3D coordinates of all seed points at the iteration
. All subsequent operations for this iteration occur in this encoded space.
The iterative contraction is achieved by updating each encoded seed point to the average of the encoded coordinates of its connected neighbors in the graph. This operation inherently uses the encoded representations
:
Specifically, for each seed point
, its new encoded coordinates become the mean of the encoded coordinates of its neighbors:
where,
is the 3D encoded coordinates of seed point p after averaging at iteration
.
is the number of neighbors of seed point
in the graph.
Crucially, because this averaging is performed on (the encoded coordinates), the determination of the neighborhood’s geometric center and the subsequent pull on point are directly influenced by the non-linear properties of the encoding function . Large spatial separations in the original space are compressed in the encoded space, modulating their influence during averaging. This results in an adaptive contraction process where the step size and direction are implicitly adjusted by the encoding. The iterative application of this encoded-space averaging causes connected components of the graph to contract towards their respective geometric centers as defined within this non-linear encoded domain.
The newly computed encoded coordinates
, representing the contracted positions within the encoded domain, are subsequently transformed back to the primordial coordinate space for the next iteration or final output:
Subsequent to L iterations, the module outputs the ultimately refined seed point coordinates . This strategy of embedding the iterative graph-based contraction entirely within an encoded space imparts a non-linear sensitivity to the displacement adjustment mechanism. This intrinsic coupling ensures that the encoding is not a mere pre/post-processing step but an integral part of how spatial relationships are interpreted and acted upon during each phase of the iterative refinement. It is ensured that this will preserve the capacity for substantial adjustments whilst concurrently augmenting the precision of fine-grained refinements during convergence.
Subsequent to the centripetal shrink (CS), seed points pertaining to the same tree exhibit enhanced spatial compactness. We then employ a topological connectivity clustering (TCC) method that groups the converged seed points based on connectivity and aggregates them into preliminary instance proposals.
To this end, a new connectivity graph is constructed upon
. Graph edges connect pairs of points
if their Euclidean distance falls below a typically more stringent threshold R
L/2 that effectively dictates the spatial granularity of the generated primitive clusters. Subsequently, a connected components labeling (CCL) algorithm [
52] is employed to partition this graph. This process assigns all mutually reachable seed points within the graph to the same unique connected component. Each connected component identified is designated as an initial tree instance candidate, denoted
,
is the total number of proposals. Crucially, all original input points that were initially associated with the seed points constituting a single connected component
are then assigned the corresponding unique candidate identifier ID.
3.5. Affinity Aggregation Module
The initial tree instance candidate generated by the SGP module, while locally coherent, may be split into multiple trees for a single real tree instance due to factors such as point cloud sparsity, occlusion, and the complex structure of large trees. In order to restore instance integrity, we design an affinity aggregation (AA) module with the core idea to learn to predict whether any two initial tree instances , should belong to the same final tree instance, and perform a merger based on this prediction.
For each constituent point within a tree instance
, given its point set position
and point-wise features
, we first employ a 1 × 1 convolutional layer to refine
that produces new point-wise features
. Multi-layer perceptrons (MLPs) are used to enhance
by incorporating the position
. Finally, the resulting composite representation is aggregated via max-pooling to generate the global instance feature
, which can be calculated as follows:
To address the lack of sufficient contextual information in the characteristics of a single instance proposal to judge its relationship with other proposals, we further apply the KNN-transformer [
53,
54] within the AA module to enhance the interactions among global instance features, as shown in
Figure 7, thereby integrating potentially fragmented instances.
Pertaining to each instance proposal , its centroid is determined. Predicated upon these centroidal locations, the K-nearest neighbors (KNN) algorithm identifies K spatially proximal proposals . Thereafter, the aggregate instance descriptors corresponding to these adjacent proposals are leveraged for subsequent feature interaction.
To facilitate informative interaction between each instance and its pertinent surrounding counterparts, we leverage the K-nearest neighbors (KNN) algorithm. Based on the spatial coordinates of an instance centroid, we identify, for instance, the K spatially proximal candidate regions among other instances. These identified neighbors collectively constitute the neighborhood, for instance . The criticality of this step resides in its exclusive focus on locally relevant instances, thereby circumventing the computational complexity associated with global calculations.
For each instance candidate region, a global instance feature is acquired. Analogously, for each adjacent candidate region , its corresponding global instance feature is likewise obtained. Subsequently, these global instance features are leveraged for feature interaction.
The global feature
is linearly projected as a query vector
, while neighborhood features
are projected as key vectors
and value vectors
. Subsequently, the scaled dot-product similarity between the query vector
and all key vectors
is computed. The resulting similarity scores are then normalized using the Softmax function to yield attention weights. These weights are subsequently employed to compute a weighted sum of the corresponding value vectors
, resulting in an enhanced feature vector
that integrates the neighborhood context.
is calculated as follows:
where * denotes the dot product and
= 64 denotes the feature dimension.
Following the acquisition of enhanced features
and
, we constructed a discriminative network structured as an MLP, which is designed to predict the affinity between any two candidate instances,
and
, indicating their likelihood of originating from the same ground-truth object. It utilizes the enhanced features of both instances (
,
), the Euclidean distance between their centroids
and the minimum point-to-point distance between them
, as input. The resulting scalar value
, after applying the Sigmoid activation function, represents the posterior probability P(
,
) that
and
originate from the same tree, effectively quantifying this affinity.
The training of this probabilistic predictor is supervised by a binary cross-entropy loss Laff:
where,
is set to 1 if
and
share the same instance ID, otherwise it is set to 0. The instance ID for each proposal is determined by majority voting.
3.6. Cylinder Merging Module
After processing overlapping cylindrical subvolumes, the cylinder merging (CM) module, inspired by Xiang [
47], consolidates these local segmentations. It iteratively merges local tree instances into a globally consistent map by evaluating their spatial overlap with existing global instances. If the overlap exceeds a threshold, points are assigned to the existing global ID; otherwise, a new global ID is created. This yields a global instance label map for the final segmentation.
3.7. Evaluation Metrics
Since all of the methods compared in this work achieve a reasonable segmentation into tree and non-tree points, we decided to perform no evaluation of this aspect. Evaluation focused solely on individual tree instance segmentation. In order to evaluate the description accuracy of individual tree instance segmentation, we adopt the evaluation scheme of Xiang [
45]. The evaluation process commences by formally defining the ground-truth tree instance set as
.
denotes the cardinality of ground-truth instances. Correspondingly, the predicted tree instance set is expressed as .
For each pairwise combination of ground-truth instance
and predicted instance
, the point-wise intersection over union (IoU) metric is computed through the following formulation:
where
,
, and
, respectively, denote the true positive, false positive, and false negative point sets derived from the spatial correspondence between
and
. For each ground truth tree, we determine the index
of the predicted tree
with the highest IoU score:
This pairwise approach ensures that each true tree instance is associated with a predicted instance, even in the presence of mergers or splits. In order to systematically compare the segmentation performance between the different methods, even if the number of trees matched according to a specific criterion varies, the coverage, which is defined as the average of the IoUs between all pairs of true and predicted values, is used as follows:
For a more detailed performance evaluation, we also calculated the average precision, recall, and F1 as follows:
3.8. Implementation Details
For SPA-Net, all experiments were performed using PyTorch 1.12.1 on a workstation equipped with an Intel (Clara, CA, USA) i7-14700K processor, NVIDIA (Clara, CA, USA) GeForce RTX 3090 GPUs (24 GB memory), and 512 GB of RAM. The system ran Ubuntu 22.04 with Python 3.9, CUDA 11.3, and cuDNN 8.2.1. Our PyTorch-based implementation utilized Minkowski Engine 0.5.4 for efficient sparse 3D convolutions. The AdamW algorithm was selected as the optimizer.
4. Results
4.1. Ablation Studies on SGP and AA Module
To understand how each part of our SGP and AA module contributes to its performance, we conducted an ablation study that evaluated the contributions of the instance branch which consists of medoid sampling (MS), centripetal shrink (CS), topological connectivity clustering (TCC), and AA modules, with the results in
Table 3.
Where, √ indicates that the module is available, × indicates that it has been removed. The full SPA-Net network (MS + CS + TCC + AA) served as the baseline, achieving 95.8% Prec, 96.3% Rec, and 92.9% Cov. The importance of the centripetal shrink (CS) within SGP was tested by comparing the full SGP (MS + CS + TCC) with a version lacking CS (MS + TCC). Removing CS substantially lowered Prec by 2.2%, Rec by 2.5%, and Cov by 4.0% to 86.5%. This shows CS is vital for optimizing initial instance seed locations before TCC grouping.
Effective center shifting thus leads to better clustering by reducing both tree segmentation and incorrect point assignments between trees. In summary, these experiments confirm the benefits of both the CS component within SGP and the AA model.
4.2. Ablation on Clustering Algorithms
In this study, we perform a comprehensive ablation study of clustering algorithms, systematically comparing their performance with established clustering algorithms including MeanShift, DBSCAN and HDBScan [
55,
56,
57]. To adapt these clustering algorithms for instance segmentation tasks, all baseline methods are equipped with an offset prediction head that consist of three consecutive linear layers to estimate per-point offsets towards instance centroids.
As shown in
Table 4, we found that guiding standard clustering algorithms like MeanShift or DBScan with predicted offsets improved results, whereas refining point coordinates using center shift prior to clustering was notably more effective.
However, our proposed SGP module, which does not rely on separate offset prediction, proved superior. SGP alone reached 95.7% Prec and 90.5% Cov, surpassing the optimized offset-guided approach using HDBScan by 1.7% in Prec and 1% in Cov. This indicates that the SGP integrated grouping process is more potent than the pipeline of offset prediction followed by refinement and standard clustering for generating high-accuracy instance proposals. The final AA module further refines these SGP proposals, enabling the complete SPA-Net network to achieve the overall performance with 95.8% Prec, 96.3% Rec, and 92.9% Cov.
We also conducted a direct quantitative comparison between our SGP module and DS module [
55], using the BaiMa dataset. SGP demonstrated superior segmentation accuracy overall, surpassing DS with 1.4% higher Prec and 1.3% higher Cov. Critically, as shown in
Table 5, SGP also proved significantly more efficient, executing over thirty times faster than DS.
Observations suggest that while DS can effectively cluster points for individual instances, potentially boosting Rec under favorable initial offset conditions, its strong grouping mechanism may sometimes introduce boundary imperfections. This can result in compromised Prec and Cov compared to the results achieved by the SGP methodology.
4.3. Comparison with 3D Offset
To validate that our SPA-Net architecture can effectively segment individual trees without relying on traditional 3D center offset prediction, we conducted controlled experiments. These compared four architectural variants: a baseline using learned 3D offsets, our SGP module alone, the offset baseline refined by AA, and the complete SPA-Net. The experimental results are shown in
Table 6.
We compared SGP against the learned 3D offset approach directly, without AA refinement. In this comparison, the SGP module demonstrated superior performance over the baseline 3D offset method. It achieved 1.5% higher Prec and 1.3% higher Cov, albeit with a marginal 0.4% deficit in Rec. This highlights the strength of SGP in producing more accurate initial candidates without relying on direct offset regression. Furthermore, the foundational role of the AA refinement stage was substantiated. Adding AA to the conventional 3D offset architecture substantially improved performance over the offset-only baseline, most notably increasing Cov by 2.2% and Rec by 0.7%.
Finally, the synergistic integration of SGP and AA within the full SPA-Net architecture culminated in the highest performance across all metrics. Crucially, this complete SGP plus AA configuration significantly outperformed the 3D offset baseline. SPA-Net exhibited clear superiority, surpassing the refined offset approach by 3.6% in Cov, 1.9% in Prec, and 1.2% in Rec. This decisively validates our method that the SPA-Net network, using SGP for proposals followed by AA for refinement, represents a more effective paradigm than even 3d offset-based approaches.
4.4. Comparison with the Existing Network
In order to compare with advanced ITS methods, we conducted a systematic performance evaluation comparing our SPA-Net method against state-of-the-art techniques on the BaiMa Plantation Forest Dataset and the Hung-tse Lake dataset. The results confirm the effectiveness of SPA-Net. On the BaiMa dataset, our method achieved 95.8% Prec, 96.3% Rec, 92.9% Cov, and a 96.0% F1.
The corresponding metrics in the Hung-Tse Lake dataset were 92.6% Prec, 94.8% Rec, 88.8% Cov, and 93.7% F1. The result of the comparison on different datasets for individual tree segmentation is shown in
Table 7.
Evaluation on the BaiMa dataset demonstrates that SPA-Net consistently surpasses the strongest baseline TreeLearn, as shown in
Figure 8, across all reported metrics. Specifically, our method yielded improvements of 0.7% in Prec, 0.1% in Rec, 0.6% in Cov, and 0.4% in F1 over TreeLearn. Performance levels attained by the leading methodologies on the relatively sparse BaiMa dataset were exceptionally high, with the top five methods exceeding 93% F1.
Consistent with advancements in point cloud analysis, deep learning approaches [
58,
59,
60] significantly outperformed methods predominantly based on algorithms or rules when applied to the BaiMa dataset. This highlights the advantage of learned feature representations for robust segmentation, even in sparse point clouds. The segmentation results of different methods are shown in
Figure 9.
To provide a cross-domain validation of our architecture, we benchmarked SPA-Net against PointGroup, an influential method developed for large-scale indoor scenes like ScanNet v2 and S3DIS. Given that the structured geometry of these indoor environments differs significantly from that of complex outdoor forests, this experiment tests the robustness of our forestry-focused approach against a leading model from a different application domain. The results are presented in
Table 8.
This performance of SPA-Net significantly outperforms the classic PointGroup method, which only reached a 92.3% F1 score. PointGroup’s limitations are evident because it was originally designed for structured indoor scenes; challenges in forestry applications, such as irregular tree crowns and point cloud sparsity, hinder its ability to predict offset vectors, resulting in lower Prec (91.5%) and Cov (85.8%).
5. Discussion
5.1. Performance on Plantation Forest and Wetland Forest
The empirical results from the BaiMa Plantation Forest and Hung-Tse Lake TLS datasets substantiate the effectiveness of SPA-Net design choices across differing environmental contexts. On the BaiMa dataset, as shown in
Table 7, potentially characterized by more uniform tree structures typical of plantations, SPA-Net achieved 95.8% Prec, 96.3% Rec, 92.9% Cov, and a 96.0% F1. This performance level, surpassing a suite of contemporary ITS methods [
25], suggests that SPA-Net’s offset-free proposal SGP module and AA module are adept at handling conditions even where tree delineation might be considered relatively straightforward due to stand regularity. In contrast, the Hung-Tse Lake dataset, which presents greater structural complexity, specific challenges of wetland species, and denser understory, still saw SPA-Net achieve robust results, again leading the compared methods. The consistent leading performance across these two distinct TLS environments demonstrates the adaptability of the SPA-Net network.
Despite SPA-Net’s excellent performance, its accuracy is still influenced by forest complexity. The model’s accuracy decreased when moving from the relatively uniform BaiMa plantation forest to the more complex Hung-Tse Lake wetland forest. This indicates that extreme crown overlap and stand density remain a challenge. A potential failure case, particularly for large or structurally complex trees, occurs when the Affinity Aggregation (AA) module fails to merge all fragments belonging to the same tree. This situation can arise in cases of severe occlusion or unusual tree morphology. This implies that while factors like crown morphology and stand density, known to influence segmentation accuracy, do affect absolute performance [
29], as shown in
Figure 9, architectural features of SPA-Net provide a more consistent ability to resolve individual trees compared to other state-of-the-art deep learning and traditional approaches in these specific TLS-scanned environments.
5.2. Comparison with Existing Methods
The SPA-Net network, introduced in this study, offers a distinct approach to individual tree segmentation (ITS) when compared to other contemporary deep learning networks. Our method, which is validated by the ablation studies, was that an SGP module could achieve superior accuracy by moving beyond the direct 3D offset that is common in methods like TreeLearn, ForAINet, and SegmentAnyTree [
47,
48,
49] for initial proposal generation. While TreeLearn and ForAINet have significantly advanced ITS, their reliance on accurate offset prediction can be a limiting factor in dense or complex stands [
45,
46]. SGP module, with its novel sampling-shifting-grouping paradigm, directly generates initial candidates from point geometry. This contrasts with the aforementioned networks, which typically predict offsets for all points towards a learned instance center before clustering [
44]. The SGP mechanism of refining seed points towards emergent geometric centers within a non-linearly encoded space proved more robust for delineating tree instances in our TLS datasets. As shown in
Table 6, compared to the 3D offset method, the SGP module increased Prec by 1.5% and Cov by 1.3%, although Rec slightly decreased by 0.4%. Furthermore, the Affinity Aggregation (AA) module provides a sophisticated, learned refinement stage. Unlike the scoring and Non-Maximum Suppression (NMS) [
50] mechanisms prevalent in PointGroup-based architectures for pruning redundant proposals, AA module learns inter-proposal affinities to merge fragmented segments actively.
The architecture of SPA-Net is a coarse-to-fine method with a clear division of labor. We do not view SGP and AA merely as two independent modules, but rather consider their combination as a unique instance branch. The effectiveness of this symbiotic design is clearly and quantitatively validated in the ablation study presented in
Table 3 of the paper. Through a comparison with
Table 7, we demonstrate that this combined SGP + AA paradigm exhibits stronger robustness compared to traditional methods based on offset prediction, clustering, and ScoreNet/NMS (such as ForAINet and Treelearn). The method of combining the SGP and AA modules decomposes the complex segmentation task into two more easily learnable subtasks: geometric proposal generation and contextual relationship aggregation. As a result, it demonstrates a lower risk of erroneous merging and under-segmentation when processing large trees with sparse point clouds and complex morphology.
5.3. Future Work
Our primary design and validation focus for SPA-Net was on the TLS data of the forest’s point cloud. Future research could explore the adaptation of the SGP and AA principles to other 3D sensing modalities, such as dense ALS or MLS data. Further investigation into the robustness of SPA-Net across an even broader range of forest types. Improving the model’s computational efficiency is an important next step. The KNN-transformer in the AA module is computationally expensive, so we plan to explore lightweight alternatives like linear attention to reduce this load and improve scalability without a significant loss in performance. As an initial step in assessing such generalization capabilities, we conducted experimental verification using an airborne laser scanning (ALS) dataset [
61,
62] acquired over a Northeast China primeval mixed forest, specifically selecting 11 natural mixed-forest subcompartments within the Shangganling Xishui Forest Farm for this purpose. Each experimental site was established as a 50 m × 50 m square standard quadrat (0.25 hectares in area), encompassing diverse stand structures and topographic characteristics.
Compared to TLS, the point density of ALS data is significantly lower, especially in the trunk and understory regions, often resulting in a severe lack of point cloud data for the tree structure, particularly at the trunk base. The top-down acquisition perspective of ALS data causes dense canopies to severely occlude the underlying trunks and branches, rendering segmentation algorithms that rely on complete tree morphology ineffective.
Offset-based methods, such as TreeLearn, typically require the prediction of an offset vector from each point to a tree’s center (e.g., the trunk base). In ALS data where trunk base points are missing, such predictions become unreliable and ill-posed, often leading to failure. In contrast, SPA-Net’s SGP module is offset-free; it aggregates seed points toward local, emergent geometric centers via centripetal shrink. This means it does not depend on a global, pre-defined target (like the trunk) but instead adapts to the locally available point cloud geometry. Building on this, the AA module further enhances adaptability to varying data quality by learning to merge fragmented instances caused by data sparsity or occlusion. Therefore, our method requires no modification for ALS data.
For natural mixed forests, as shown in
Figure 10, our method maintains its performance advantage despite the increased environmental complexity.
Amidst these difficulties, as shown in
Table 9, our method outperforms all comparative methods, yielding a Prec of 53.1%, Rec of 92.8%, Cov of 49.7%, and an F1 of 67.4%. SegmentAnyTree performed second best, notably reaching 93.0% Rec and 49.0% Cov. The relative success of these deep learning models highlights their capacity to learn robust features suited for complex geometries, with SPA-Net benefiting from its SGP + AA architecture and SegmentAnyTree from its diverse training approach simulating ALS conditions. Conversely, methods like Treeiso struggled significantly with precision and coverage due to reliance on occluded base features critical to its algorithm. Other approaches, including Lidar360, also faced challenges with accurate delineation or data fragmentation in this complex scenario.
When directly migrated to a new scene, SPA-Net still maintains relatively optimal performance. This strongly demonstrates the universality and potential of our method’s design. While these results highlight the generalization potential of advanced deep learning frameworks, they also reinforce that accurate individual tree segmentation from ALS of natural mixed-forest in complex, multi-layered canopies remains a significant hurdle.
Of course, we also recognize the limitations of the current research. Although SPA-Net shows generalization ability on ALS data, its performance is reduced compared to TLS data. This indicates that optimization for specific data sources is still necessary.
Future research should focus on extending SPA-Net to more diverse data scenarios, such as unmanned aerial vehicle LiDAR (UAV-LiDAR) data and more complex forest ecosystems like tropical rainforests.
By training and fine-tuning on a wider variety of datasets, we hope to build a truly cross-platform and cross-regional universal single-tree segmentation model. This would provide more powerful technical support for precision forestry management and carbon stock estimation on a global scale.