TreeSeg-Net: An End-to-End Instance Segmentation Network for Leaf-Off Forest Point Clouds Using Global Context and Spatial Proximity

Xu, Xingmei; Zhang, Ruihang; Xiao, Shunfu; Li, Jiayuan; Zhang, Xinyue; Cao, Liying; Yu, Helong; Ma, Yuntao; Zhang, Jian; Zhao, Xiyang

doi:10.3390/plants15040525

Open AccessArticle

TreeSeg-Net: An End-to-End Instance Segmentation Network for Leaf-Off Forest Point Clouds Using Global Context and Spatial Proximity

by

Xingmei Xu

¹,

Ruihang Zhang

¹,

Shunfu Xiao

²,

Jiayuan Li

²,

Xinyue Zhang

¹,

Liying Cao

¹,

Helong Yu

¹

,

Yuntao Ma

^1,2,

Jian Zhang

^3,4,*

and

Xiyang Zhao

^5,*

¹

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

²

College of Land Science and Technology, China Agricultural University, Beijing 100193, China

³

Faculty of Agronomy, Jilin Agricultural University, Changchun 130118, China

⁴

Department of Biology, University of Columbia Okanagan, Kelowna, BC V1V 1V7, Canada

⁵

College of Forestry and Grassland Science, Jilin Agricultural University, Changchun 130118, China

^*

Authors to whom correspondence should be addressed.

Plants 2026, 15(4), 525; https://doi.org/10.3390/plants15040525

Submission received: 15 January 2026 / Revised: 2 February 2026 / Accepted: 5 February 2026 / Published: 7 February 2026

(This article belongs to the Special Issue Smart Agriculture for Sustainable Crop Production: From Precision Technologies to Field Applications)

Download

Browse Figures

Versions Notes

Abstract

Forest ecosystems play a pivotal role in maintaining the balance of the global carbon cycle and conserving biodiversity. High-density point clouds derived from unmanned aerial vehicle (UAV) structure from motion (SfM) and multi-view stereo (MVS) technologies offer a cost-effective solution for data acquisition. These technologies have become efficient tools for facilitating precision forest resource management and extracting individual tree structural parameters. However, in complex forest scenarios during the leaf-off season, canopies exhibit unstructured branch network morphologies due to the absence of leaf occlusion, and adjacent crowns are heavily interlaced. Consequently, existing segmentation methods struggle to overcome challenges associated with fuzzy boundaries and instance adhesion. To address these challenges, this study proposes TreeSeg-Net, an end-to-end instance segmentation network designed to precisely separate individual trees directly from raw point clouds. The network incorporates a global context attention module (GCAM) to capture long-range feature dependencies, thereby compensating for the limitations of sparse convolution in perceiving global information. Simultaneously, a spatial proximity weighting module (SPWM) is designed. By introducing geometric center constraints and a distance penalty mechanism, this module effectively mitigates under-segmentation issues caused by the feature similarity of adjacent branches in high-canopy-density environments. Experimental results demonstrate that TreeSeg-Net achieves an average precision (AP) of 97.2% in instance segmentation tasks and a mean intersection over union (mIoU) of 99.7% in semantic segmentation tasks. Compared to mainstream networks, the proposed method exhibits superior segmentation accuracy, providing an efficient and automated technical solution for precise resource inventory in complex forest environments.

Keywords:

precision forestry; point cloud segmentation; TreeSeg-Net; UAV photogrammetry; remote sensing

1. Introduction

Forest ecosystems, as core components of the Earth’s biosphere, play an indispensable role in maintaining ecological balance, mitigating climate change, conserving biodiversity, and ensuring sustainable forest management [1,2,3]. To achieve precision forest resource management and ecological assessment, the extraction of structural parameters at the individual tree scale is a central task in forest inventory [4]. Within this context, high-precision individual tree segmentation (ITS) serves as a critical step, providing foundational data support for extracting key phenotypic parameters such as tree height, diameter at breast height (DBH), and biomass [5,6]. UAV-based SfM and MVS technologies, offering a cost-effective solution, enable the reconstruction of high-density forest point clouds through overlapping imagery [7,8]. However, in complex forest scenarios, stably and accurately isolating individual tree instances from these point clouds remains one of the primary challenges in current 3D forest analysis [9,10].

Addressing the ITS task, early works primarily relied on canopy height models (CHMs) and 2D image processing techniques, utilizing algorithms such as local maxima detection [11], watershed segmentation [12], and graph cuts [13] to identify and delineate individual tree crowns [14]. While these methods exhibit high efficiency in regular stands, their segmentation accuracy is often limited in areas with homogeneous tree heights or severe canopy interlacing, as 2D projection inevitably compromises detailed 3D structural information [15,16,17]. To mitigate information loss, an increasing number of studies have shifted towards performing segmentation directly on 3D point clouds, identifying individual trees through clustering, region growing, or geometric feature-based strategies [18,19,20]. Nevertheless, in real-world forest scenarios characterized by significant morphological variations and mutual canopy occlusion, these methods—reliant on explicit rules and geometric assumptions—still struggle to maintain robust performance across diverse stand conditions.

In recent years, driven by the rapid advancements of deep learning in computer vision, researchers have increasingly applied these techniques to 3D point cloud segmentation tasks [21,22,23]. Distinct from traditional rule-based or geometric feature-dependent methods, deep neural networks can automatically learn complex shape and texture features directly from raw point clouds, thereby exhibiting superior adaptability to forest environments characterized by diverse canopy morphologies and severe occlusion [24]. Deep learning-based approaches for ITS generally fall into two categories: proposal-based frameworks and proposal-free approaches. Proposal-based methods typically follow a detect-then-segment paradigm, locating potential tree regions before predicting instance masks. By explicitly modeling potential instance regions, these methods achieve precise object localization and boundary cropping in complex scenes. Representative works include SGPN [25], 3D-MPA [26], 3D-SIS [27], and PointRCNN [28], which is based on two-stage detectors. Conversely, proposal-free methods circumvent the candidate generation process. They usually rely on point-level semantic prediction combined with clustering or center regression in a feature embedding space to directly recover instance structures. These approaches offer advantages in computational efficiency and scalability, as seen in PointGroup [29], SoftGroup [30], and CPSeg [31].

Furthermore, several studies have incorporated forestry priors or geometric constraints, integrating deep learning with physically meaningful grouping strategies. For instance, Xia et al. [32] combined RandLA-Net with MeanShift to extract individual trees from point clouds. Henrich et al. [33], in their TreeLearn framework, achieved instance separation in an offset space by predicting offset vectors pointing to the trunk base. Addressing mobile laser scanning (MLS) data, Jiang et al. [34] developed a segmentation strategy combining cylindrical convolution with dynamic shifting to resolve segmentation challenges caused by canopy overlap and occlusion in complex urban scenarios. Additionally, while Sun et al. [35] proposed a sparse 3D U-Net, Huo et al. [3] improved upon it by introducing multi-head attention mechanisms. By enhancing global and multi-scale feature capture through subspace projection, this method effectively achieved ITS on smartphone-acquired point clouds. Collectively, these studies demonstrate that combining point-level predictions based on deep features with spatial clustering holds significant promise for processing irregular and structurally complex forest point clouds.

Despite the significant progress of existing deep learning methods in point cloud processing, two core challenges persist. First, the inherent locality of sparse convolution limits the network’s ability to comprehend the global structure of an entire tree. Second, clustering or mask generation mechanisms based solely on feature similarity fail to effectively distinguish adjacent branches that are spatially entangled but indistinguishable in feature space. To address these issues, this study posits two research hypotheses: (1) incorporating long-range dependency modeling can compensate for the limited receptive field of sparse convolution, and (2) introducing explicit geometric constraints can effectively differentiate spatially entangled instances. Driven by these hypotheses, the primary aim of this study is to develop TreeSeg-Net, an end-to-end instance segmentation framework integrating global perception and geometric constraints. Obviating the need for complex post-processing steps or manual intervention, the model takes raw forest point clouds as input and automatically outputs ITS results. Comprehensive experiments were conducted on SfM point cloud datasets generated by consumer-grade UAVs to validate the method’s effectiveness under complex stand conditions. To achieve this objective, the specific contributions are summarized as follows:

A novel end-to-end instance segmentation network, TreeSeg-Net, is proposed for complex forest scenarios. The network integrates an improved sparse 3D U-Net with a transformer decoder.
A GCAM and a SPWM are designed. The GCAM is designed to capture long-range feature dependencies, compensating for the limitations of sparse convolution in global information perception. The SPWM introduces geometric center constraints and a distance penalty mechanism to address the challenges of boundary fuzziness and instance adhesion caused by feature similarity among adjacent canopies in high-density environments.
Through comparative analysis with various mainstream point cloud segmentation networks and ablation studies, the proposed method proves to be highly effective, providing an efficient and economical technical solution for forest resource inventory.

2. Materials and Methods

2.1. Study Area and Data Acquisition

The experiment was conducted in October 2024 in Tongzhou District, Beijing, China (39.82° N, 116.87° E). The geographical location of the study site is illustrated in Figure 1A. The experimental area covers a total extent of approximately 4.36 hectares and features flat and open terrain with an average elevation of approximately 20 m. The region belongs to a warm temperate continental semi-humid monsoon climate zone, characterized by distinct seasons, an annual average temperature of 12.7 °C, and annual average precipitation of 445.6 mm. The stand type is an artificial plantation, with Populus tomentosa Carr. as the dominant species. As shown in Figure 1B, the trees are arranged relatively neatly. Since data acquisition occurred during the early leaf-off season, the canopy lost its leaf occlusion, presenting a complex, unstructured branch network morphology (Figure 1B). Figure 1C provides a side view of the forest point cloud, showcasing the vertical structure of the stand. Additionally, the acquired point cloud exhibits a high average density of 1339 points/m², providing rich geometric details for individual tree segmentation.

To acquire high-quality forest point clouds containing fine trunk textures and complete branching structures, a DJI Mavic 3M (DJI, Shenzhen, China) UAV platform was employed for image data collection. A cross-circle oblique (CCO) flight path (Figure 2) was adopted to capture canopy images [36]. The flight altitude of the CCO path was set to 47 m above the canopy top, with a flight speed maintained at 5 m/s. The gimbal pitch angle was fixed at 45°. This flight plan consisted of multiple intersecting circular paths, with 35 waypoints set for each circle and an overlap rate of 50% between adjacent circles, resulting in a total of 537 multi-view raw images. Simultaneously, a standard nadir flight path was used to obtain orthophotos of the canopy. The nadir flight altitude was set to 80 m, with the speed also maintained at 5 m/s. The forward and side overlap rates were set to 80% and 70%, respectively, yielding 180 images. All flight missions were executed under meteorological conditions with uniform lighting and low wind speeds to ensure image clarity.

2.2. Data Preprocessing

Mosaicking of raw images and generation of 3D point clouds were performed using Agisoft Metashape Professional (Agisoft LLC, St. Petersburg, Russia, version 2.1.0). The software operates based on the standard SfM-MVS pipeline: first, the SfM algorithm estimates internal and external camera parameters to construct a sparse point cloud; subsequently, the MVS algorithm further refines the model to generate a dense 3D representation. To ensure point cloud density and geometric accuracy, parameters for both photo alignment and dense cloud generation were set to “High Quality”.

To construct a standardized dataset suitable for TreeSeg-Net, the raw point clouds underwent a series of normalization procedures. First, the statistical outlier removal (SOR) algorithm was employed to eliminate discrete noise generated during the reconstruction process. Subsequently, the cloth simulation filter (CSF) algorithm was utilized to classify the point cloud into ground and vegetation categories (Figure 3A). To address the inherent class imbalance where ground points significantly outnumbered tree points, a random downsampling strategy was applied to the ground category to achieve a balanced distribution. Afterwards, height normalization was performed on the processed point cloud based on the extracted digital elevation model (DEM) to mitigate the interference of terrain undulation on tree height features. Finally, given the massive data volume of the entire plot, the normalized point cloud was partitioned into blocks of 10 m by 10 m using a sliding window strategy with overlap (Figure 3B). This approach was adopted to mitigate boundary effects and increase data diversity, resulting in a total of 146 blocks, where each block contained approximately four to five trees. The larger sub-region, consisting of 111 blocks, was randomly split into training and validation sets at an approximate ratio of 8 to 2. Meanwhile, the smaller sub-region of 35 blocks served exclusively as the independent test set to evaluate model generalization.

2.3. Overall Architecture

In leaf-off forest scenarios, severely interlaced branches and highly similar geometric features present significant segmentation challenges. Addressing these, existing clustering-based methods often struggle to delineate clear boundaries within complex skeletal structures. Therefore, drawing inspiration from Organ3DNet [37], this study designed TreeSeg-Net. The network follows the mask prediction paradigm, discarding the indirect approach of mapping point clouds to a feature space for clustering. Instead, it directly learns and predicts binary masks representing individual tree instances in an end-to-end manner. As shown in Figure 4, the main architecture of TreeSeg-Net comprises two core modules: the sparse convolutional feature network and the transformer decoder. The network takes raw point cloud coordinates P as input and, through feature encoding and decoding interactions, synchronously outputs point-level semantic classes P_o and instance masks M.

The sparse convolutional feature network primarily functions as the encoder. Addressing the characteristics of extremely high sparsity and non-uniform distribution inherent in forest point cloud data, this module adopts a sparse convolutional backbone based on the U-Net structure. As shown in Figure 4A, this network extracts multi-scale features F₀ to F₄, ranging from local geometric textures to high-level semantics, through multi-level downsampling and upsampling operations. However, standard sparse convolution focuses mainly on local neighborhood information, making it difficult to capture the macrostructure of large-scale forest plots. To resolve this limitation, TreeSeg-Net introduces a GCAM at the end of the backbone network. The GCAM establishes long-range feature dependencies and fuses global context information into local features, thereby enhancing the encoder’s feature representation capabilities in environments with complex occlusion.

The Transformer decoder is responsible for parsing the deep features extracted by the encoder into specific individual tree instances. This module first initializes a set of learnable instance queries, which are then progressively optimized through a multi-level cascading approach. As illustrated in Figure 4B, the decoding process alternates between a QRM and a SPWM. The QRM utilizes self-attention and cross-attention mechanisms to facilitate information interaction among queries and between queries and pixel-level features. To address the insufficient distinctiveness of original mask modules when processing overlapping canopies, this study designed the SPWM. Unlike traditional methods that rely solely on feature similarity, the SPWM introduces explicit geometric spatial constraints during the decoding phase. This forces the network to balance feature consistency with spatial proximity, effectively resolving the issue of branch entanglement between adjacent trees.

2.4. Sparse Convolutional Feature Network

Point cloud data in forest scenarios are typically massive, unordered, and highly sparse in 3D space [38]. Directly employing traditional voxelized 3D convolutional networks would result in a significant waste of computational resources by processing many empty voxels. Furthermore, such an approach would fail to meet the real-time requirements of high-throughput phenotypic analysis. To balance computational efficiency with feature extraction accuracy, this study adopts a sparse 3D convolutional network built on the Minkowski Engine as the backbone architecture. This backbone follows the classic U-Net design paradigm, achieving deep abstraction and fusion of multi-scale features through an encoder–decoder structure.

Before entering the network, the raw point cloud coordinates P are voxelized and quantized into a sparse tensor. The encoder stage consists of a series of stacked sparse convolutional layers and pooling operations. As the network depth increases, the spatial resolution of the feature maps gradually decreases, while the channel dimension increases, transforming low-level geometric textures into abstract high-level semantic features. The decoder stage progressively restores spatial resolution via transposed convolution and fuses high-resolution features from the corresponding encoder levels into the decoding path through skip connections. This design effectively mitigates the loss of spatial information during downsampling, enabling the network to output hierarchical features at five scales, from F₀ to F₄ (Figure 4A). These multi-scale features not only preserve the details necessary for describing trunks and branches but also contain semantic information to distinguish different vegetation levels.

Global Context Attention Module (GCAM)

Although sparse convolutional networks perform well in processing large-scale point clouds, the receptive field of their convolution kernels is inherently restricted to local neighborhoods. In forest plots with high canopy density, individual tree crowns often occupy large spatial ranges, and there is severe branch interlacing between adjacent trees. Relying solely on local features makes it difficult for the network to perceive the structure of the entire tree at a macroscopic level. This often leads to the erroneous fragmentation of a single tree into multiple instances or the confusion of adhered parts of adjacent trees. To overcome this limitation, TreeSeg-Net introduces a GCAM at the neck of the backbone network. The core objective of the GCAM is to establish long-range feature dependencies, enabling the network to perceive the global distribution patterns of the entire plot before aggregating local features. As shown in Figure 5C, the GCAM is a lightweight and efficient attention branch.

Specifically, for a given feature tensor

F_{i} \in R^{N \times C}

, where N is the number of voxels and C is the number of channels, the module first compresses the information from the spatial dimension into a channel descriptor

z \in R^{1 \times C}

via global average pooling. For the c-th channel, its global statistic z_c is calculated as follows:

z_{c} = \frac{1}{N} \sum_{i = 1}^{N} F_{i n}^{c} (i)

(1)

where

F_{i n}^{c} (i)

represents the response value of the input feature at the c-th channel and the i-th voxel. Subsequently, to capture non-linear interactions between channels, a gating mechanism is employed to adaptively generate a channel weight vector

s \in R^{C}

and recalibrate the original features:

F_{o u t} = σ (W_{2} δ (W_{1} z)) \times F_{i n}

(2)

where

δ

denotes the ReLU activation function,

σ

denotes the Sigmoid activation function, and W₁ and W₂ are the learnable weight parameters of the MLP. Through this mechanism, the GCAM explicitly models the correlations between channels. It adaptively enhances those feature channels that possess critical discriminative power for distinguishing individual tree instances—such as the texture features at the junction of trunks and crowns—while simultaneously suppressing environmental noise.

2.5. Transformer Decoder

Following the extraction and enhancement of global context features by the sparse convolutional backbone, the Transformer decoder parses these high-dimensional features into specific individual tree instances. As illustrated in Figure 4B, the decoder of TreeSeg-Net adopts a mask classification paradigm based on set prediction. Unlike traditional pipelines that perform point-wise classification followed by clustering, this module utilizes a set of learnable instance queries to represent potential tree objects. During inference, these queries function as soft anchors to probe individual tree features within the scene. The decoder is composed of multiple cascaded decoding layers, employing an iterative optimization strategy to progressively refine segmentation accuracy. In each layer, instance queries first enter the QRM, where they aggregate multi-scale features and contextual information via attention mechanisms to optimize their feature representation. Subsequently, the updated queries are fed into the SPWM to generate instance mask predictions and classification probabilities for the current level. This design enables the network to infer both the category semantics and geometric shapes of all trees in the scene in parallel. Operating within an end-to-end framework, it eliminates the need for complex post-processing steps.

2.5.1. Query Refinement Module (QRM)

The QRM leverages attention mechanisms to aggregate context information extracted by the backbone into instance queries, thereby updating their feature representations (Figure 5B). This module accepts initialized queries or output vectors from the preceding layer as input and progressively optimizes the query state through three internal sub-layers. First, a self-attention mechanism establishes communication channels among instance queries. This process builds global dependencies between queries, prompting different queries to differentiate and focus on distinct individual tree targets, thereby suppressing redundant predictions where multiple queries respond to the same target at the feature level. Subsequently, a masked cross-attention mechanism drives the interaction between queries and the voxel features output by the backbone. Distinct from global full-attention mechanisms, this step employs the binary mask predicted in the preceding hierarchy as an attention bias, restricting the queries to aggregate features solely from their corresponding foreground regions. This spatial constraint strategy filters out background noise and interference from neighboring trees, ensuring that query vectors focus on the local fine-grained textures of the target canopy. Finally, a feed-forward network (FFN) performs non-linear transformations on the aggregated features to complete the state update for the current layer.

2.5.2. Spatial Proximity-Weighted Module (SPWM)

Transforming high-dimensional query vectors into precise 3D spatial masks constitutes the final step of individual tree segmentation. Existing methods, such as Organ3DNet, typically rely solely on the dot-product similarity between queries and features to generate masks. However, in forests with high canopy density, the crowns of adjacent trees are interlaced, and the local geometric textures of branches and leaves are highly similar. Consequently, relying exclusively on feature similarity makes it difficult to delineate clear boundaries in overlapping regions. To address this, this study designs the SPWM (Figure 5A), which introduces an explicit geometric center constraint on top of semantic matching.

The core architecture of the SPWM comprises dual-stream prediction paths. In addition to the conventional semantic mask generation branch, we introduce an additional lightweight center regression head. This branch consists of a two-layer MLP designed to map high-dimensional instance queries into 3D physical space. For the k-th instance query and the i-th point in space, their joint score S_k_,i is defined as a weighted combination of a semantic consistency term and a geometric penalty term:

S_{k, i} = (Q_{k} \times F_{i}^{T}) - λ \times D (P_{i} \times C_{k})

(3)

where Q_k denotes the k-th instance query, F_i represents the feature vector of the i-th point, and

λ

is a balancing coefficient used to control the intensity of the geometric constraint. D is a distance penalty function utilized to suppress outliers far from the instance center. To eliminate scale differences across different forest plots, the normalized instance center C_k is first predicted, and its normalized Euclidean distance to the point cloud coordinate P_i is calculated:

D (P_{i} \times C_{k}) = {‖\frac{P_{i} - P_{m i n}}{P_{m a x} - P_{m i n}} - C_{k}‖}_{2}^{2}

(4)

where P_min and P_max represent the minimum and maximum values of the input point cloud across three dimensions, respectively.

C_{k} \in {[0,1]}^{3}

is the geometric centroid predicted from the query vector via the regression branch, constrained by a Sigmoid function to ensure numerical stability. It is worth noting that this geometric constraint exhibits a hierarchical, iterative nature. As the decoder layers deepen, the query vector Q_k gradually incorporates richer contextual information. Consequently, its predicted center C_k progressively approaches the true physical centroid of the tree from an initial random position, thereby enabling the geometric penalty term D to generate a more precise spatial attenuation field. This mechanism forces the network to prioritize retaining spatially proximal points while maintaining semantic consistency, effectively severing branch adhesions between adjacent trees and achieving precise separation at the instance level.

2.6. Statistical Analysis

To evaluate the segmentation performance of TreeSeg-Net on forest point clouds, this study stratified the assessment process into two dimensions: semantic segmentation and instance segmentation. All metrics were derived from a point-wise comparison between the predicted results and the manually annotated ground truth. In terms of semantic segmentation, precision, recall, F1-score, and IoU were employed to assess the model’s accuracy in distinguishing tree foregrounds from background noise. Specifically, the F1-score comprehensively reflects classification robustness, while IoU measures the degree of geometric spatial overlap between the predicted and ground truth point sets. The formulas for these metrics are as follows:

P r e c = \frac{T P}{T P + F P}

(5)

R e c = \frac{T P}{T P + F N}

(6)

F 1 = \frac{2 \times P r e c \times R e c}{P r e c + R e c}

(7)

I o U = \frac{T P}{T P + F P + F N}

(8)

where TP, FP, and FN represent the number of correctly identified, false positive, and false negative tree points, respectively. To evaluate the model’s capability in isolating individual trees from complex forest environments, this study utilized average precision (AP), mean precision (mPrec), mean recall (mRec), mean coverage (mCov), and mean weighted coverage (mWCov). Average precision serves as the primary metric for instance segmentation. It is defined as the mean AP averaged over IoU thresholds from 0.50 to 0.95. Additionally, the specific metrics AP₅₀ and AP₂₅ were calculated at IoU thresholds of 0.50 and 0.25, respectively. These metrics first calculate independent scores for each semantic category (e.g., ground, trees) and subsequently take their arithmetic mean to perform a global evaluation. Notably, mCov and mWCov intuitively quantify the extent to which predicted instances restore the true tree crown shapes. The definitions of these metrics are as follows:

A P = \int_{0}^{1} P r e c (R e c) d R e c

(9)

m P r e c = \frac{1}{C} \sum_{i = 1}^{C} {P r e c}_{i}

(10)

m R e c = \frac{1}{C} \sum_{i = 1}^{C} {R e c}_{i}

(11)

m C o v = \frac{1}{M} \sum_{k = 1}^{M} \max_{j} I o U (G_{k}, P_{j})

(12)

m W C o v = \sum_{k = 1}^{M} \underset{j}{w_{k} \max} I o U (G_{k}, P_{j})

(13)

where C is the number of semantic categories, and Prec_i and Rec_i denote the precision and recall of the i-th class, respectively. M represents the total number of ground truth trees in the plot, while G_k and P_j denote the point sets of the k-th ground truth tree and the j-th predicted instance, respectively. A detection is considered correct when the IoU between the predicted instance and the ground truth exceeds 0.5. w_k indicates the proportion of points in the k-th tree relative to the total number of tree points in the entire plot.

3. Results

3.1. Platform Configuration and Network Structure

The model proposed in this study was implemented based on the PyTorch 2.2.0 deep learning framework and trained and tested on the Ubuntu 22.04 operating system. The experimental hardware platform consisted of two Intel Xeon Gold 6246R CPUs (@3.40 GHz) and one NVIDIA Quadro RTX 8000 GPU with 48 GB of VRAM, equipped with 128 GB of system memory. The software environment included Python 3.10.9, CUDA 11.8, and cuDNN 8.7.0 to accelerate computations. During the training phase, the AdamW optimizer was employed with an initial learning rate set to 2 × 10⁻⁴, adjusted using the OneCycleLR policy. The model training was conducted for 300 epochs with a Batch Size of 10. The voxelization size for the input point cloud was set to 0.15. Table 1 details the network configuration of TreeSeg-Net, covering the specific kernel size, stride, and channel dimensions for each stage.

3.2. Semantic Segmentation Results

Table 2 presents the semantic segmentation results of TreeSeg-Net on the test set. The results indicate that TreeSeg-Net achieved robust accuracy across both semantic categories, reaching an mIoU of 99.70% and an average F1-score of 99.85%. Specifically, the extraction of vegetation points demonstrated extremely high precision, with the IoU for the tree category reaching 99.55% and a recall rate as high as 99.86%. These metrics suggest that the network effectively separates complex unstructured branch networks from the terrain background, providing high-quality input data for subsequent instance segmentation tasks.

Figure 6 illustrates the comparison between the ground truth (GT) and the predicted results. As observed in the figure, manual annotations contain errors at the interface between tree roots and the ground, where parts of the tree trunk bases were incorrectly classified as ground points. This inaccuracy arises from the difficulty of achieving perfectly precise manual delineation on undulating terrain. In contrast, the predictions of TreeSeg-Net corrected these errors, completely preserving the trunk roots within the tree category. This demonstrates that the model learned the morphological features of trees rather than overfitting the errors present in the labels. Consequently, the segmentation details at the tree bases were actually more accurate than the manually annotated GT.

3.3. Instance Segmentation Results

Table 3 presents the results of TreeSeg-Net on the instance segmentation task. Compared to the baseline, TreeSeg-Net demonstrated significant performance improvements in the extraction of tree instances. The baseline achieved an AP of 0.825 for the tree category, whereas TreeSeg-Net elevated this metric to 0.972. Regarding coverage metrics, the mWCov of our model reached 0.988, indicating that the predicted instance masks exhibit a high degree of overlap with the real trees in terms of weighted volume. These data reflect that the model can effectively perform individual tree separation when processing unstructured forest scenes.

Figure 7 visualizes the instance segmentation results of TreeSeg-Net, where distinct colors denote separate individual tree instances distinguished by the model. From a holistic perspective, TreeSeg-Net accurately delineated closely adjacent trees, exhibiting no significant signs of under-segmentation or over-segmentation. A comparison of the local region on the right side of the figure reveals that the manual GT annotations exhibited fragmentation when processing fine branches, erroneously labeling branches of the same tree as two discontinuous parts. In contrast, leveraging the spatial continuity of the point cloud, TreeSeg-Net correctly merged these fragmented branches into the main trunk instance, thereby preserving the structural integrity of the individual tree. This result suggests that the model, to a certain extent, mitigated the issue of instance discontinuity present in manual annotations. Although the model rectified these subtle annotation errors, the quantitative metrics remained superior due to the exceptionally high overall overlap.

3.4. Ablation Studies

To validate the effectiveness of the core modules within TreeSeg-Net and their contributions to the overall model performance, ablation experiments were conducted on the test set. The results are presented in Table 4. This section primarily analyzes the specific performance of GCAM and SPWM in the instance segmentation task.

The baseline model comprises solely the sparse convolutional backbone and a fundamental Transformer decoder. As indicated by the data in the table, the baseline model already exhibited high accuracy in semantic segmentation, achieving an mIoU of 99.50%. This is attributed to the robust feature extraction capabilities of sparse convolution when processing point cloud data. However, in the more challenging instance segmentation task, its AP and precision were 82.50% and 68.50%, respectively, indicating that relying on the basic architecture leaves room for improvement when handling complex forest scenes.

Upon introducing GCAM to the baseline model, all instance segmentation metrics showed varying degrees of improvement. The AP increased from 82.50% to 86.30%, and AP₅₀ rose to 87.60%. This result suggests that by integrating long-range contextual dependencies, GCAM enhanced the model’s feature perception capabilities for large-scale forest scenes, thereby strengthening the decoder’s discrimination of instances. Similarly, the independent introduction of SPWM brought significant performance gains. Compared to the baseline, AP increased to 89.30%, with mCov and mWCov reaching 95.10% and 95.70%, respectively. This demonstrates that the geometric center constraint and distance penalty mechanism introduced in SPWM effectively suppressed outlier noise, rendering the predicted instance masks spatially more compact and continuous.

The TreeSeg-Net model, which integrates both GCAM and SPWM, achieved the optimal experimental results. Compared to the baseline model, the AP of TreeSeg-Net rose to 97.20%, instance precision significantly improved from 68.50% to 88.80%, and recall reached 99.20%. This substantial performance boost demonstrates a favorable synergistic effect between GCAM and SPWM in terms of feature enhancement and geometric constraints. Furthermore, it is worth noting that the fluctuations in semantic segmentation metrics across all variants were minimal, indicating that the model maintained robustness in semantic category recognition while enhancing instance segmentation capabilities.

3.5. Comparison with Other Networks

To evaluate the efficacy of TreeSeg-Net in complex forest scenes, comparative experiments were conducted on the same dataset against four mainstream point cloud segmentation networks: PointGroup [29], SoftGroup [30], OneFormer3D [39], and Organ3DNet [37]. All comparative models were trained based on the official source code under recommended parameter settings and terminated upon the full convergence of the training loss curves. Table 5 lists the quantitative evaluation results for each model on semantic and instance segmentation tasks.

In the semantic segmentation task, OneFormer3D achieved the highest mIoU of 99.94%, indicating the advantage of Transformer-based architectures in point cloud semantic feature extraction. TreeSeg-Net achieved an mIoU of 99.70%, validating the effectiveness of the sparse convolutional backbone combined with the GCAM module in extracting complex branch features of deciduous forests. In contrast, while PointGroup achieved an instance segmentation AP of 91.87%, its semantic segmentation mIoU was only 87.20%, lower than the method proposed in this study.

The results in Table 5 indicate that TreeSeg-Net outperformed the other four networks across the vast majority of evaluation metrics. Regarding AP, which reflects segmentation quality, TreeSeg-Net reached 97.20%, surpassing PointGroup by 5.33%. This margin highlights the significant advantage of TreeSeg-Net when processing unstructured forest scenes. In comparison, SoftGroup achieved an AP of 82.20%, suggesting that relying solely on semantic features and bottom-up clustering strategies is inadequate for handling complex interlaced branch structures. OneFormer3D achieved the highest recall of 100.00%; however, its AP value was only 86.11%, and its AP₅₀ was also lower than that of our method. This phenomenon suggests that while OneFormer3D detected nearly all tree targets, the quality of the generated masks was inferior, prone to redundant predictions or confidence ranking biases. A similar trend appeared with Organ3DNet; despite a recall of 93.10%, its precision was only 68.50%. Conversely, TreeSeg-Net maintained a high level of precision at 88.80% while upholding an extremely high recall, achieving the best value of 97.30% in AP₅₀. This is attributed to the SPWM module, which imposes explicit geometric spatial constraints, effectively suppressing under-segmentation and over-segmentation phenomena caused by feature similarity.

4. Discussion

High-precision ITS serves as the foundation for extracting phenotypic parameters and estimating carbon stocks from complex forest point clouds. This study investigated the viability of an end-to-end instance segmentation approach for complex forest scenarios by creating a high-density point cloud dataset derived from UAV SfM technology. Although UAV SfM technology provides a low-cost data source, achieving automated and high-precision individual segmentation in high-density deciduous stands remains a formidable challenge [40,41,42]. Existing ITS methods are primarily categorized into traditional clustering-based algorithms and data-driven deep learning algorithms [43,44,45]. While traditional region growing or watershed algorithms perform well in sparse stands, they often exhibit limitations such as over-segmentation or under-segmentation when dealing with complex overlapping crowns due to an over-reliance on local geometric features [46].

In contrast, deep learning methods demonstrate robust feature extraction capabilities. The fundamental innovation of TreeSeg-Net lies in the incorporation of GCAM and SPWM within a sparse 3D U-Net architecture. Specifically, SPWM amplifies the feature disparity between the crown center and its edge regions by incorporating spatial distance information [47]. This mechanism effectively prevents adjacent trees with highly similar branch textures from being erroneously merged, which notably improves the model’s performance in scenarios characterized by severe canopy interlacing. Simultaneously, GCAM establishes long-range dependencies between feature points via attention mechanisms, compensating for the limited local receptive field of the sparse convolutional backbone [48,49]. By addressing point cloud discontinuity caused by sparse leaves or mutual occlusion, this module integrates spatially discrete local features into a holistic entity, ensuring the geometric structural integrity and continuity of tree instances.

When compared to mainstream clustering-based networks such as PointGroup and SoftGroup, our approach demonstrates substantial improvements in boundary delineation and instance separation. This advantage primarily stems from the end-to-end mask prediction paradigm adopted by TreeSeg-Net [50,51]. Existing methods often suffer from under-segmentation in dense stands because they rely on bottom-up clustering, which struggles when the semantic features of adjacent branches are indistinguishable [21,52]. In contrast, our SPWM imposes a distance penalty that forces the network to prioritize spatially proximal points, thereby establishing clear segmentation boundaries even in overlapping regions.

However, it is acknowledged that point clouds reconstructed via SfM algorithms are inherently limited by lighting conditions and shooting angles. As observed in our study, this often results in missing trunk information in the lower canopy compared to LiDAR data, which restricts the direct application of the model for diameter at breast height extraction. In contrast, airborne or terrestrial LiDAR can provide more complete vertical structures but comes with high costs and limited portability [53]. Furthermore, the current model training relies primarily on a single type of forest data. The generalization capability of the model when facing mountainous scenes with more complex species compositions or dramatic terrain undulations remains to be further validated through transfer learning or domain adaptation techniques [54,55].

5. Conclusions

To address the challenges of ITS in complex forest environments, this paper proposes the TreeSeg-Net instance segmentation model. Built upon a sparse 3D U-Net backbone, the model integrates targeted improvement modules to adapt to the unstructured features of deciduous forest point clouds. Validating the research hypotheses posited in this study, the integration of GCAM effectively compensates for the limited receptive field of sparse convolution to enhance whole-tree structural perception, while the SPWM confirms that explicit geometric spatial constraints are essential for resolving the boundary adhesion issue caused by similar physical features of adjacent crowns in high-density stands. Experimental results demonstrate that TreeSeg-Net outperforms current mainstream individual tree segmentation methods. It achieved an AP of 97.2% and an mWCov of 98.8%. Compared to mainstream deep learning methods such as PointGroup and OneFormer3D, TreeSeg-Net exhibits higher precision and robustness in handling canopy overlap, identifying fine boundaries, and minimizing over-segmentation and under-segmentation errors. These findings substantiate the critical roles of global context enhancement and geometric spatial constraint strategies in improving segmentation performance. In summary, the end-to-end network constructed in this study realizes the precise extraction of individual tree instances from forest point clouds without manual intervention, providing strong technical support for efficient forest resource inventory and refined management.

Author Contributions

X.X.: Conceptualization, Software, Investigation, Resources, Supervision. R.Z.: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing—original draft, Writing—review and editing, Visualization, Supervision. S.X.: Investigation, Data curation. J.L.: Software, Validation, Visualization. X.Z. (Xinyue Zhang): Formal analysis, Investigation. L.C.: Investigation, Resources. H.Y.: Investigation, Resources, Supervision. Y.M.: Conceptualization, Methodology, Formal analysis, Investigation, Data curation, Validation, Project administration. J.Z.: Resources, Conceptualization, Supervision, Project administration, Funding acquisition. X.Z. (Xiyang Zhao): Investigation, Resources, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Jilin Agricultural University High-Level Researcher Grant [JLAUHLRG20102006]; the Jilin Provincial Department of Human Resources and Social Security [No. 201020012]; and the 111 Project, Northeast Advantageous Characteristic Resources and Health Food Discipline Innovation Introduction Base [No. D23007].

Data Availability Statement

Data and code are available upon request.

Acknowledgments

The authors would like to acknowledge the anonymous reviewers for their valuable comments and the members of the editorial team for their careful proofreading.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liang, J.; Crowther, T.W.; Picard, N.; Wiser, S.; Zhou, M.; Alberti, G.; Schulze, E.-D.; McGuire, A.D.; Bozzato, F.; Pretzsch, H. Positive biodiversity-productivity relationship predominant in global forests. Science 2016, 354, aaf8957. [Google Scholar] [CrossRef]
Seidl, R.; Thom, D.; Kautz, M.; Martin-Benito, D.; Peltoniemi, M.; Vacchiano, G.; Wild, J.; Ascoli, D.; Petr, M.; Honkaniemi, J. Forest disturbances under climate change. Nat. Clim. Change 2017, 7, 395–402. [Google Scholar] [CrossRef]
Huo, L.; Chen, Z.; Dai, L.; Wang, D.; Zhao, X. Research on Single-Tree Segmentation Method for Forest 3D Reconstruction Point Cloud Based on Attention Mechanism. Forests 2025, 16, 1192. [Google Scholar] [CrossRef]
Yao, W.; Krzystek, P.; Heurich, M. Tree species classification and estimation of stem volume and DBH based on single tree extraction by exploiting airborne full-waveform LiDAR data. Remote Sens. Environ. 2012, 123, 368–380. [Google Scholar] [CrossRef]
Yang, B.; Dai, W.; Dong, Z.; Liu, Y. Automatic forest mapping at individual tree levels from terrestrial laser scanning point clouds with a hierarchical minimum cut method. Remote Sens. 2016, 8, 372. [Google Scholar] [CrossRef]
Dersch, S.; Schöttl, A.; Krzystek, P.; Heurich, M. Towards complete tree crown delineation by instance segmentation with Mask R–CNN and DETR using UAV-based multispectral imagery and lidar data. ISPRS Open J. Photogramm. Remote Sens. 2023, 8, 100037. [Google Scholar] [CrossRef]
Gobbi, B.; Van Rompaey, A.; Gasparri, N.I.; Vanacker, V. Forest degradation in the Dry Chaco: A detection based on 3D canopy reconstruction from UAV-SfM techniques. For. Ecol. Manag. 2022, 526, 120554. [Google Scholar] [CrossRef]
Ghasemi, M.; Latifi, H.; Iranmanesh, Y. Geometry-based point cloud fusion of dual-layer UAV photogrammetry and a modified unsupervised generative adversarial network for 3D tree reconstruction in semi-arid forests. Comput. Electron. Agric. 2025, 239, 111024. [Google Scholar] [CrossRef]
Yu, X.; Hyyppä, J.; Holopainen, M.; Vastaranta, M. Comparison of area-based and individual tree-based methods for predicting plot-level forest attributes. Remote Sens. 2010, 2, 1481–1495. [Google Scholar] [CrossRef]
Han, X.; Liu, C.; Zhou, Y.; Tan, K.; Dong, Z.; Yang, B. WHU-Urban3D: An urban scene LiDAR point cloud dataset for semantic instance segmentation. ISPRS J. Photogramm. Remote Sens. 2024, 209, 500–513. [Google Scholar] [CrossRef]
Marasigan, R.; Festijo, E.; Juanico, D.E. Mangrove crown diameter measurement from airborne lidar data using marker-controlled watershed algorithm: Exploring performance. In Proceedings of the 2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Kuala Lumpur, Malaysia, 20–21 December 2019; IEEE: Washington, DC, USA, 2019. [Google Scholar]
Yin, D.; Wang, L. Individual mangrove tree measurement using UAV-based LiDAR data: Possibilities and challenges. Remote Sens. Environ. 2019, 223, 34–49. [Google Scholar] [CrossRef]
Williams, J.; Schönlieb, C.-B.; Swinfield, T.; Lee, J.; Cai, X.; Qie, L.; Coomes, D.A. 3D segmentation of trees through a flexible multiclass graph cut algorithm. IEEE Trans. Geosci. Remote Sens. 2019, 58, 754–776. [Google Scholar] [CrossRef]
Chen, X.; Jiang, K.; Zhu, Y.; Wang, X.; Yun, T. Individual Tree Crown Segmentation Directly from UAV-Borne LiDAR Data Using the PointNet of Deep Learning. Forests 2021, 12, 131. [Google Scholar] [CrossRef]
Shen, X.; Cao, L.; Chen, D.; Sun, Y.; Wang, G.; Ruan, H. Prediction of Forest Structural Parameters Using Airborne Full-Waveform LiDAR and Hyperspectral Data in Subtropical Forests. Remote Sens. 2018, 10, 1729. [Google Scholar] [CrossRef]
Vega, C.; Hamrouni, A.; El Mokhtari, S.; Morel, J.; Bock, J.; Renaud, J.P.; Bouvier, M.; Durrieu, S. PTrees: A point-based approach to forest tree extraction from lidar data. Int. J. Appl. Earth Obs. Geoinf. 2014, 33, 98–108. [Google Scholar] [CrossRef]
Pang, Y.; Wang, W.; Du, L.; Zhang, Z.; Liang, X.; Li, Y.; Wang, Z. Nyström-based spectral clustering using airborne LiDAR point cloud data for individual tree segmentation. Int. J. Digit. Earth 2021, 14, 1452–1476. [Google Scholar] [CrossRef]
Hyyppä, J.; Kelle, O.; Lehikoinen, M.; Inkinen, M. A segmentation-based method to retrieve stem volume estimates from 3-D tree height models produced by laser scanners. IEEE Trans. Geosci. Remote Sens. 2001, 39, 969–975. [Google Scholar] [CrossRef]
Mongus, D.; Žalik, B. An efficient approach to 3D single tree-crown delineation in LiDAR data. ISPRS J. Photogramm. Remote Sens. 2015, 108, 219–233. [Google Scholar] [CrossRef]
Tao, S.; Wu, F.; Guo, Q.; Wang, Y.; Li, W.; Xue, B.; Hu, X.; Li, P.; Tian, D.; Li, C.; et al. Segmenting tree crowns from terrestrial and mobile LiDAR data by exploring ecological theories. ISPRS J. Photogramm. Remote Sens. 2015, 110, 66–76. [Google Scholar] [CrossRef]
Zhao, W.; Yan, Y.; Yang, C.; Ye, J.; Yang, X.; Huang, K. Divide and Conquer: 3D Point Cloud Instance Segmentation With Point-Wise Binarization. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar]
Chen, W.; Hu, X.; Chen, W.; Hong, Y.; Yang, M. Airborne LiDAR Remote Sensing for Individual Tree Forest Inventory Using Trunk Detection-Aided Mean Shift Clustering Techniques. Remote Sens. 2018, 10, 1078. [Google Scholar] [CrossRef]
Xu, X.; Li, J.; Zhou, J.; Feng, P.; Yu, H.; Ma, Y. Three-Dimensional Reconstruction, Phenotypic Traits Extraction, and Yield Estimation of Shiitake Mushrooms Based on Structure from Motion and Multi-View Stereo. Agriculture 2025, 15, 298. [Google Scholar] [CrossRef]
Zhong, Y.; Liu, S.; Sun, H. A 3D point cloud instance segmentation network for extracting individual trees from complex forest scenes. Comput. Electron. Agric. 2026, 242, 111333. [Google Scholar] [CrossRef]
Wang, W.; Yu, R.; Huang, Q.; Neumann, U. Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Engelmann, F.; Bokeloh, M.; Fathi, A.; Leibe, B.; Nießner, M. 3d-mpa: Multi-proposal aggregation for 3d semantic instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Hou, J.; Dai, A.; Nießner, M. 3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Washington, DC, USA, 2019. [Google Scholar]
Jiang, L.; Zhao, H.; Shi, S.; Liu, S.; Fu, C.-W.; Jia, J. Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Vu, T.; Kim, K.; Luu, T.M.; Nguyen, T.; Yoo, C.D. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Li, E.; Razani, R.; Xu, Y.; Liu, B. Cpseg: Cluster-free panoptic segmentation of 3d lidar point clouds. arXiv 2021, arXiv:2111.01723. [Google Scholar]
Xia, K.; Li, C.; Yang, Y.; Deng, S.; Feng, H. Study on Single-Tree Extraction Method for Complex RGB Point Cloud Scenes. Remote Sens. 2023, 15, 2644. [Google Scholar] [CrossRef]
Henrich, J.; van Delden, J.; Seidel, D.; Kneib, T.; Ecker, A.S. TreeLearn: A deep learning method for segmenting individual trees from ground-based LiDAR forest point clouds. Ecol. Inform. 2024, 84, 102888. [Google Scholar] [CrossRef]
Jiang, T.; Liu, S.; Zhang, Q.; Xu, X.; Sun, J.; Wang, Y. Segmentation of individual trees in urban MLS point clouds using a deep learning framework based on cylindrical convolution network. Int. J. Appl. Earth Obs. Geoinf. 2023, 123, 103473. [Google Scholar] [CrossRef]
Sun, J.; Qing, C.; Tan, J.; Xu, X. Superpoint transformer for 3d scene instance segmentation. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2393–2401. [Google Scholar] [CrossRef]
Xiao, S.; Ye, Y.; Fei, S.; Chen, H.; Zhang, B.; Li, Q.; Cai, Z.; Che, Y.; Wang, Q.; Ghafoor, A.; et al. High-throughput calculation of organ-scale traits with reconstructed accurate 3D canopy structures using a UAV RGB camera with an advanced cross-circling oblique route. ISPRS J. Photogramm. Remote Sens. 2023, 201, 104–122. [Google Scholar] [CrossRef]
Li, D.; Huang, J.; Zhao, B.; Wen, W. Organ3DNet: A deep network for segmenting organ semantics and instances from dense plant point clouds. Artif. Intell. Agric. 2026, 16, 342–364. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. Oneformer3d: One transformer for unified point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Qiu, P.; Wang, D.; Zou, X.; Yang, X.; Xie, G.; Xu, S.; Zhong, Z. Finer resolution estimation and mapping of mangrove biomass using UAV LiDAR and worldview-2 data. Forests 2019, 10, 871. [Google Scholar] [CrossRef]
Li, Y.; Xie, D.; Wang, Y.; Jin, S.; Zhou, K.; Zhang, Z.; Li, W.; Zhang, W.; Mu, X.; Yan, G. Individual tree segmentation of airborne and UAV LiDAR point clouds based on the watershed and optimized connection center evolution clustering. Ecol. Evol. 2023, 13, e10297. [Google Scholar] [CrossRef]
Irlan, I.; Adzkia, U.; Suhartono, S.; Meliani, M.; Jenos, A.S.; Bimantara, T.; A, C. Individual Tree Segmentation in TropicalNatural Forest Based on Point CloudGenerated from UAV RGB Image. J. Wasian 2025, 12, 27–42. [Google Scholar] [CrossRef]
You, H.; Liu, Y.; Lei, P.; Qin, Z.; You, Q. Segmentation of individual mangrove trees using UAV-based LiDAR data. Ecol. Inform. 2023, 77, 102200. [Google Scholar] [CrossRef]
Dai, W.; Yang, B.; Dong, Z.; Shaker, A. A new method for 3D individual tree extraction using multispectral airborne LiDAR point clouds. ISPRS J. Photogramm. Remote Sens. 2018, 144, 400–411. [Google Scholar] [CrossRef]
Fu, Y.; Niu, Y.; Wang, L.; Li, W. Individual-Tree Segmentation from UAV–LiDAR Data Using a Region-Growing Segmentation and Supervoxel-Weighted Fuzzy Clustering Approach. Remote Sens. 2024, 16, 608. [Google Scholar] [CrossRef]
Liu, Y.; Chen, D.; Fu, S.; Mathiopoulos, P.T.; Sui, M.; Na, J.; Peethambaran, J. Segmentation of Individual Tree Points by Combining Marker-Controlled Watershed Segmentation and Spectral Clustering Optimization. Remote Sens. 2024, 16, 610. [Google Scholar] [CrossRef]
Wang, C.; Zhao, C. Enhancing 3D point cloud learning with geometric-aware dynamic graph convolution and Transformer networks. J. Supercomput. 2026, 82, 50. [Google Scholar] [CrossRef]
Feng, M.; Zhang, L.; Lin, X.; Gilani, S.Z.; Mian, A. Point attention network for semantic segmentation of 3D point clouds. Pattern Recognit. 2020, 107, 107446. [Google Scholar] [CrossRef]
Li, Z.; Gao, P.; You, K.; Yan, C.; Paul, M. Global Attention-Guided Dual-Domain Point Cloud Feature Learning for Classification and Segmentation. IEEE Trans. Artif. Intell. 2024, 5, 5167–5178. [Google Scholar] [CrossRef]
Yang, B.; Wang, J.; Clark, R.; Hu, Q.; Wang, S.; Markham, A.; Trigoni, N. Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Royen, R.; Denis, L.; Munteanu, A. Protoseg: A prototype-based point cloud instance segmentation method. arXiv 2024, arXiv:2410.02352. [Google Scholar]
Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. Top-down beats bottom-up in 3d instance segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar]
Wang, Y.; Zhang, Z. Segment Any Leaf 3D: A Zero-Shot 3D Leaf Instance Segmentation Method Based on Multi-View Images. Sensors 2025, 25, 526. [Google Scholar] [CrossRef] [PubMed]
Rizaldy, A.; Fassnacht, F.E.; Afifi, A.J.; Jiang, H.; Gloaguen, R.; Ghamisi, P. Label-Efficient 3D Forest Mapping: Self-Supervised and Transfer Learning for Individual, Structural, and Species Analysis. arXiv 2025, arXiv:2511.06331. [Google Scholar]
Dubrovin, I.; Fortin, C.; Kedrov, A. An open dataset for individual tree detection in UAV LiDAR point clouds and RGB orthophotos in dense mixed forests. Sci. Rep. 2024, 14, 21938. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the study area: (A) Schematic diagram of the geographical location. (B) Field photograph showing the environment above the canopy within the sample plot. (C) Side view of the forest point cloud.

Figure 2. (A) Schematic diagram of the data acquisition strategy. (B) Bird’s-eye view of the CCO flight path, with an overlap rate of 50% between circles.

Figure 3. Schematic diagram of the data preprocessing workflow: (A) Filtering process: raw point clouds are denoised via the SOR algorithm, followed by ground point removal using the CSF algorithm to extract vegetation points. (B) Dataset samples for training, where each sample contains several trees and surrounding ground points. (C) The complete plot point cloud.

Figure 4. Overall architecture of TreeSeg-Net: (A) Sparse convolutional feature network. (B) Transformer decoder.

Figure 5. Structural diagram of the core modules in TreeSeg-Net: (A) SPWM introduces a geometric center prediction branch to impose spatial distance penalties. (B) QRM utilizes mask cross-attention mechanisms to aggregate features. (C) GCAM performs feature channel recalibration via global average pooling and an MLP.

Figure 6. Visual comparison of semantic segmentation results, illustrating the contrast between the GT and the predictions generated by TreeSeg-Net.

Figure 7. Visualization of instance segmentation results, where different colors represent distinct individual tree instances.

Table 1. Detailed architecture configuration of TreeSeg-Net.

Stage	Module/Component	Operation	Kernel Size	Stride	Out Channels
Stem	Input Conv	MinkConv	5 × 5 × 5	1	32
Stem	Downsample	MinkConv	2 × 2 × 2	2	32
Encoder	Stage 1	ResBlock × 2	3 × 3 × 3	1	32
	Downsample	MinkConv	2 × 2 × 2	2	64
	Stage 2	ResBlock × 3	3 × 3 × 3	1	64
	Downsample	MinkConv	2 × 2 × 2	2	128
	Stage 3	ResBlock × 4	3 × 3 × 3	1	128
	Feature Refine	GCAM	Global Pool	-	128
	Downsample	MinkConv	2 × 2 × 2	2	256
	Stage 4	ResBlock × 6	3 × 3 × 3	1	256
	Feature Refine	GCAM	Global Pool	-	256
Decoder	Upsample 4	MinkTranspose	2 × 2 × 2	2	256
	Stage 5	ResBlock × 2	3 × 3 × 3	1	256
	Upsample 5	MinkTranspose	2 × 2 × 2	2	128
	Stage 6	ResBlock × 2	3 × 3 × 3	1	128
	Upsample 6	MinkTranspose	2 × 2 × 2	2	96
	Stage 7	ResBlock × 2	3 × 3 × 3	1	96
	Upsample 7	MinkTranspose	2 × 2 × 2	2	96
	Stage 8	ResBlock × 2	3 × 3 × 3	1	96
	Feature Refine	GCAM	Global Pool	-	96
Transformer	Query Refine	QRM	-	-	128
Head	Instance Branch (Mask)	MLP	-	-	128 (Embed)
	Instance Branch (Geometric)	MLP	-	-	3 (x, y, z)
	Semantic Branch	MLP	-	-	3 (Class) *

* The output channel is set to 3 to include the ‘no-object’ class required by the Transformer architecture.

Table 2. Quantitative comparison of semantic segmentation performance between the baseline and TreeSeg-Net.

Method	Class	Precision (%)	Recall (%)	F1-Score (%)	IoU (%)
Baseline	Ground	99.95	99.87	99.91	99.82
	Tree	99.64	99.86	99.75	99.50
	Average	99.80	99.87	99.83	99.66
TreeSeg-Net	Ground	99.95	99.89	99.92	99.84
	Tree	99.70	99.86	99.78	99.55
	Average	99.83	99.87	99.85	99.70

Table 3. Performance evaluation of instance segmentation between the baseline and TreeSeg-Net.

Method	Class	AP	AP₅₀	AP₂₅	mCov	mWCov
Baseline	Ground	0.988	0.988	0.988	0.998	0.998
	Tree	0.825	0.842	0.855	0.895	0.908
	Average	0.907	0.915	0.922	0.947	0.953
TreeSeg-Net	Ground	0.999	0.999	0.999	0.998	0.998
	Tree	0.972	0.973	0.973	0.981	0.988
	Average	0.986	0.986	0.986	0.990	0.993

Table 4. Ablation study of TreeSeg-Net.

Method	Semantic Segmentation (%)				Instance Segmentation (%)
Method	Prec	Rec	F1	IoU	AP	AP₅₀	Prec	Rec	mCov	mWCov
Baseline	99.64	99.86	99.75	99.50	82.50	84.20	68.50	93.10	89.50	90.80
+GCAM	99.57	99.88	99.73	99.46	86.30	87.60	70.50	94.80	92.20	92.80
+SPWM	99.55	99.89	99.72	99.44	89.30	90.60	72.00	96.70	95.10	95.70
TreeSeg-Net	99.70	99.86	99.78	99.55	97.20	97.30	88.80	99.20	98.10	98.80

Table 5. Comparison with other mainstream methods.

Method	mIoU (%)	AP (%)	AP₅₀ (%)	AP₂₅ (%)	Rec (%)	Prec (%)
PointGroup	87.20	91.87	-	-	-	-
SoftGroup	98.80	82.20	94.40	90.10	91.20	-
OneFormer3D	99.94	86.11	94.64	93.49	100.00	89.47
Organ3DNet	99.66	82.50	84.20	85.50	93.10	68.50
TreeSeg-Net	99.70	97.20	97.30	97.30	99.20	88.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, X.; Zhang, R.; Xiao, S.; Li, J.; Zhang, X.; Cao, L.; Yu, H.; Ma, Y.; Zhang, J.; Zhao, X. TreeSeg-Net: An End-to-End Instance Segmentation Network for Leaf-Off Forest Point Clouds Using Global Context and Spatial Proximity. Plants 2026, 15, 525. https://doi.org/10.3390/plants15040525

AMA Style

Xu X, Zhang R, Xiao S, Li J, Zhang X, Cao L, Yu H, Ma Y, Zhang J, Zhao X. TreeSeg-Net: An End-to-End Instance Segmentation Network for Leaf-Off Forest Point Clouds Using Global Context and Spatial Proximity. Plants. 2026; 15(4):525. https://doi.org/10.3390/plants15040525

Chicago/Turabian Style

Xu, Xingmei, Ruihang Zhang, Shunfu Xiao, Jiayuan Li, Xinyue Zhang, Liying Cao, Helong Yu, Yuntao Ma, Jian Zhang, and Xiyang Zhao. 2026. "TreeSeg-Net: An End-to-End Instance Segmentation Network for Leaf-Off Forest Point Clouds Using Global Context and Spatial Proximity" Plants 15, no. 4: 525. https://doi.org/10.3390/plants15040525

APA Style

Xu, X., Zhang, R., Xiao, S., Li, J., Zhang, X., Cao, L., Yu, H., Ma, Y., Zhang, J., & Zhao, X. (2026). TreeSeg-Net: An End-to-End Instance Segmentation Network for Leaf-Off Forest Point Clouds Using Global Context and Spatial Proximity. Plants, 15(4), 525. https://doi.org/10.3390/plants15040525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TreeSeg-Net: An End-to-End Instance Segmentation Network for Leaf-Off Forest Point Clouds Using Global Context and Spatial Proximity

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Data Acquisition

2.2. Data Preprocessing

2.3. Overall Architecture

2.4. Sparse Convolutional Feature Network

Global Context Attention Module (GCAM)

2.5. Transformer Decoder

2.5.1. Query Refinement Module (QRM)

2.5.2. Spatial Proximity-Weighted Module (SPWM)

2.6. Statistical Analysis

3. Results

3.1. Platform Configuration and Network Structure

3.2. Semantic Segmentation Results

3.3. Instance Segmentation Results

3.4. Ablation Studies

3.5. Comparison with Other Networks

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI