1. Introduction
Semantic segmentation of individual tree-scale light detection and ranging (LiDAR) point clouds is a fundamental technique for precision forest management, analysis of ecological interaction mechanisms, and forest carbon stock assessment [
1]. Traditional forestry surveys rely heavily on manual visual interpretation or ground-based measurements, which are labor-intensive, time-consuming, and inadequate for covering large and complex terrains [
2]. In recent years, the rapid advancement of unmanned aerial vehicle (UAV)-based multi-sensor technologies has opened up new opportunities for forest resource surveys. By synchronously acquiring high-resolution RGB imagery and LiDAR point clouds, researchers are now able to capture forest structural information with greater completeness and detail [
3,
4]. However, existing supervised point cloud segmentation approaches remain heavily dependent on labor-intensive annotated datasets. The high cost of annotation and limited generalization across different forest environments constrain their applicability in dynamic forest monitoring [
5]. Meanwhile, 2D image segmentation methods based on deep learning have demonstrated strong transferability across domains [
6,
7,
8]. Nonetheless, efficiently transferring 2D semantic labels to the 3D point cloud domain remains a significant open challenge.
Current individual tree segmentation methods can be broadly categorized into traditional geometry-based algorithms and deep learning-based models. Traditional approaches, such as the watershed algorithm based on canopy height models (CHMs) [
9,
10] and voxel-based region growing methods [
11,
12], do not require annotated data. However, they are highly sensitive to canopy overlap and morphological diversity, often resulting in over-segmentation or under-segmentation. With the advent of deep learning, PointNet [
13] pioneered end-to-end point cloud classification, but its global feature extraction mechanism overlooks local geometric relationships, leading to frequent misclassifications in dense canopy segmentation tasks. PointNet++ [
14] addressed this limitation by hierarchically aggregating local neighborhood features, yet its supervised training paradigm still demands large amounts of labeled data and remains sensitive to variations in point cloud density. RandLA-Net [
15] improves computational efficiency through random sampling and local feature enhancement, but its recall rate under dense canopies falls below 40%, which is insufficient for high-precision forestry applications. Moreover, supervised models exhibit limited generalization across forest types (e.g., coniferous vs. broadleaf forests), with significant performance degradation when applied to novel environments [
16]. This issue is particularly critical in dynamically changing forest ecosystems.
Recent studies have explored multimodal fusion to enhance segmentation robustness. For instance, the FuseSeg framework [
17] integrates RGB and LiDAR data via early-stage feature fusion, while the PMF (Prompt-Based Multimodal Fusion) method [
18] employs a dual-stream network to align multimodal features within the camera coordinate system. However, these approaches are primarily designed for rigid urban objects and struggle to handle the flexible deformation and canopy occlusion commonly observed in forested environments. Individual tree segmentation in forest scenes exhibits distinct characteristics. Unlike the regular structures of rigid urban objects, forest canopies vary significantly across tree species—for example, conical spruces versus umbrella-shaped oaks—necessitating strong model generalization capabilities [
19]. In addition, multi-layered canopies lead to sparse point cloud representations of understory trees, as LiDAR pulse penetration in coniferous forests is typically limited to 30–50%, further complicating data processing. Dynamic factors such as seasonal variations in leaf area also pose challenges to model robustness, requiring traditional supervised methods to be frequently retrained to adapt to shifts in data distribution [
16].
To reduce the reliance on manual annotations, extensive research has explored unsupervised and cross-modal transfer techniques [
20,
21,
22,
23]. For example, the Segment Anything Model (SAM) [
21] achieves zero-shot generalization through prompt-based mechanisms. However, when applied directly to LiDAR point clouds, it suffers from fragmented segmentation results due to the lack of spatial continuity in the 3D domain. Wang et al. [
22] proposed an unsupervised method for forest point cloud instance segmentation, which achieved promising performance on benchmark datasets. However, it still struggles to effectively interpret LiDAR point cloud information in large-scale and complex forested areas. Weakly supervised methods such as SFL-Net (Slight Filter Learning Network) [
23] leverage sparse annotations to guide model training, yet still require at least 20% of labeled data, making it difficult to completely eliminate human intervention.
The widespread adoption of UAV-based multi-sensor systems (LiDAR + RGB) has opened new avenues for cross-modal semantic transfer. RGB imagery provides rich color and texture cues that assist in species-level tree discrimination [
24], while LiDAR point clouds offer precise 3D geometric information [
25]. However, multimodal fusion presents several significant challenges. First, dense matching point clouds generated from UAV imagery (structure from motion point clouds) often exhibit scale discrepancies and coordinate offsets when compared to LiDAR point clouds, with direct registration errors reaching the meter level. Second, when 2D semantic labels from single-view images are projected into 3D space, canopy occlusion leads to incomplete label coverage. Finally, existing fusion frameworks (e.g., FuseSeg [
17]) must concurrently process high-resolution imagery and large-scale point clouds, resulting in high computational complexity that hampers real-time performance—an essential requirement for operational forestry monitoring. To tackle this issue, this paper proposes a cross-modal semantic transfer framework for individual tree point cloud segmentation. By leveraging the synergistic analysis of UAV imagery and LiDAR data, the framework aims to provide a low-cost yet high-accuracy solution for large-scale forest monitoring.
To address the aforementioned challenges, this study proposes a cross-modal semantic transfer framework for individual tree point cloud segmentation in forest environments. The innovation lies in the integration of multimodal data advantages with optimized computational efficiency. First, we introduce a novel model, Multi-Source Feature Fusion Network (MSFFNet), which performs instance-level segmentation of individual trees based on UAV imagery. Second, a two-stage registration strategy is designed: a coarse alignment between dense matching point clouds and LiDAR data is achieved using the Two-stage Consensus Filtering (TCF) algorithm, followed by local refinement through the Iterative Closest Point (ICP) algorithm, effectively resolving spatial inconsistencies across data sources. Third, a semantic probability field is constructed by integrating multi-view geometry with the Expectation-Maximization (EM) algorithm, addressing inconsistencies in tree instance masks across different views (e.g., over-segmentation or misclassification). Finally, based on the semantic probability field and 3D point cloud, semantic association between 2D pixels and 3D points is established by jointly considering point cloud geometric features and semantic confidence. This enables the transfer of semantic labels from the 2D image domain to the 3D point cloud domain, completing the cross-modal semantic label mapping process.
The main contributions of this study are summarized as follows:
- (1)
This study presents a cross-modal semantic transfer framework for forest scenes, which leverages UAV imagery and LiDAR data to transfer semantic information from the 2D image domain to the 3D point cloud domain. This approach addresses the critical limitation of supervised point cloud segmentation methods that rely heavily on manual annotations.
- (2)
This study develops a label fusion algorithm based on multi-view spatial consistency, which effectively resolves semantic mapping conflicts caused by inconsistent instance masks of the same tree across multiple views. The proposed method ensures semantic consistency across views and preserves global intra-class geometric coherence.
- (3)
To evaluate the performance of the proposed cross-modal semantic transfer framework for individual tree point cloud segmentation, we conducted comprehensive experiments on two UAV-LiDAR datasets, demonstrating its effectiveness in comparison with several state-of-the-art methods.
The structure of this paper is organized as follows:
Section 2 describes the data collection and preprocessing procedures.
Section 3 presents the proposed method in detail, along with the experimental evaluation metrics.
Section 4 reports and analyzes the experimental results of the proposed framework. Finally,
Section 5 concludes the paper with a summary of the main findings.
3. Proposed Method
For a forest scene with LiDAR point clouds and UAV imagery, our objective is to automatically transfer 2D semantic labels—obtained from a pretrained image segmentation network—into the 3D point cloud domain.
Figure 2 illustrates the overall framework of the proposed method. We present a cross-modal label transfer approach tailored for individual tree semantic segmentation in forest environments. The framework takes RGB imagery and LiDAR point clouds as input and predicts instance-level tree crown semantic labels using the proposed segmentation model. The following subsections provide a detailed description of each module within the framework. As shown in
Figure 2, the proposed method consists of three main stages: (1) 2D instance segmentation of individual tree crowns; (2) registration between aerial imagery and LiDAR point clouds; and (3) cross-modal semantic transfer based on a semantic probability field.
3.1. 2D Instance Segmentation of Individual Tree Canopies
Existing 2D instance segmentation networks [
31,
32,
33] perform well on general datasets (such as MS-COCO [
34] and PASCAL VOC [
35]), but struggle to adapt to the unique complexity of forest scenes, for example, significant scale differences, dense forest spatial heterogeneity, and complex background interference caused by terrain undulations and canopy interlacing. To address this challenge, this study proposes a novel instance segmentation network named MSFFNet, which generates high-precision single-tree canopy segmentation masks through a multi-scale feature interaction fusion mechanism. To support model training and validation, we constructed a scene-adaptive annotated dataset (see
Section 2.2), which is based on real forest imagery collected by a drone remote sensing platform and manually annotated by professionals to achieve precise annotation of single-tree canopy instances across the entire study area.
The architecture of MSFFNet is illustrated in
Figure 3. It comprises three primary stages: feature extraction, region of interest (ROI) generation, and instance mask prediction. In the feature extraction stage, MSFFNet emphasizes deep interaction among multi-source data to enhance hierarchical feature integration. Specifically, MSFFNet adopts a dual-branch parallel design that interactively fuses complementary information from multimodal inputs, with each branch dedicated to capturing modality-specific features. To address the computational challenges posed by high-resolution UAV imagery, both branches employ SegFormer as the backbone network. This design leverages sequence reduction mechanisms and implicit positional encoding to model long-range contextual dependencies while significantly reducing parameter overhead. It not only overcomes the inherent computational bottlenecks of Transformer-based architectures in handling high-resolution inputs but also mitigates the sensitivity of conventional methods to image resolution variability. Its efficient feature extraction aligns well with the multiscale structural characteristics of forest canopies, offering an ideal solution for individual tree segmentation in complex terrains. Considering the substantial geometric and semantic differences between RGB images and CHM data, a multi-level feature fusion (MLFF) module is introduced between the two branches. This module facilitates deep cross-modal feature fusion and improves the model’s capability in crown detection and boundary delineation by integrating hierarchical representations across modalities. In the second stage, the fused multi-level feature maps are input into a region proposal network (RPN) to generate initial ROIs. In the final stage, deformable ROI alignment is applied to adaptively match each candidate region with the shared feature maps, enabling precise extraction of enhanced crown representations. The architecture simultaneously decodes three key outputs: crown class confidence, bounding box geometry, and pixel-wise instance masks—thus achieving joint optimization of geometric structure and semantic understanding.
As illustrated in
Figure 4, the MLFF module integrates the texture and spectral information from RGB imagery with the spatial topological features from the CHM. While RGB data provides rich visual and appearance cues of the canopy, the CHM offers precise geometric structural representations—together forming a complementary multi-source data advantage. Deep fusion of these heterogeneous features enables the network to generate more accurate instance masks of individual tree crowns by improving crown boundary discrimination and reducing false positives and missed detections. As shown in
Figure 4, the MLFF module consists of two main components: feature alignment and feature interaction. The feature interaction component further includes channel feature interaction (CFI) and spatial feature interaction (SFI), enabling effective cross-modal information exchange and enhancing the discriminative power of the network.
To mitigate the impact of spatial misalignment between multi-branch and multi-level feature maps during feature interaction, we designed a Feature Alignment Module (FAM), as shown in
Figure 4a. This module employs deformable convolution to achieve cross-branch spatial consistency correction, thereby providing higher-quality and more robust fused features for subsequent channel-wise and spatial feature interactions. Specifically, the feature maps from the dual branches are independently fed into deformable convolution blocks—each consisting of a fixed 3 × 3 deformable convolution kernel, batch normalization, and ReLU activation—for adaptive spatial alignment. The aligned feature maps are then concatenated and passed through a 3 × 3 convolution layer to predict a semantic offset field, which dynamically refines the spatial correspondence between modalities. This mechanism forms a robust foundation for downstream cross-modal feature fusion and interaction.
To achieve deep interaction between geometric and attribute features from multimodal data, the fused features output by the FAM are further processed by the CFI and SFI mechanisms. These modules allow multiple data streams within the network to emphasize each other’s complementary information while aligning feature responses across modalities, thereby reducing the semantic and distributional discrepancies between them. As illustrated in
Figure 4b, the CFI module consists of global average pooling, global max pooling, and a multi-layer perceptron (MLP). Specifically, the two input feature maps undergo both global average pooling and global max pooling to obtain tensors
and
, which respectively capture the compressed global spatial information across each channel and enhance the channel-wise feature response. The resulting four feature vectors are concatenated and passed through an MLP to promote deep interaction and fusion of multimodal feature representations. As shown in
Figure 4c, the SFI module comprises a convolutional layer followed by a sigmoid activation function. After the channel-wise interaction, the two aligned feature maps are concatenated and passed through a 1 × 1 convolution layer to generate a projection tensor
, which strengthens spatial responses across the fused feature maps and facilitates spatial-level interaction of multimodal features.
3.2. Aerial Imagery and LiDAR Point Cloud Registration
Due to sensor system errors (such as camera exposure delay, calibration residuals, and insufficient GPS/IMU observation quality control), there is typically a significant positional deviation between directly georeferenced aerial imagery and ALS data [
36]. This study designed a two-stage 3D-3D registration strategy to ensure the accuracy of cross-modal semantic transfer. First, an MPC is generated from the UAV imagery, followed by noise filtering to remove outliers and obtain an optimized 3D point set. Then, the TCF algorithm is employed to achieve coarse alignment between the MPC and ALS data. Finally, an improved ICP algorithm is applied to perform sub-pixel-level fine registration, addressing spatial inconsistencies across modalities and enhancing the spatial coupling precision between imagery and LiDAR point clouds. This provides a geometrically consistent foundation for subsequent semantic label transfer. The proposed registration framework consists of two main stages: coarse registration and fine registration.
- (1)
Coarse registration of point clouds based on the TCF algorithm
The Random Sample Consensus (RANSAC) algorithm is a widely used method for coarse point cloud registration [
37]. Its core principle involves iteratively selecting a minimal set of randomly sampled point correspondences—typically three non-collinear pairs—to estimate a candidate rigid transformation consisting of a rotation matrix
R and a translation vector
T. An inlier verification mechanism is then applied by evaluating whether the Euclidean distances between the transformed source points and the target points fall below a predefined threshold. The main advantage of RANSAC lies in its robustness to outliers in feature matching. By leveraging random sampling and consistency checking, the algorithm can tolerate more than 50% incorrect correspondences while still estimating a globally optimal coarse transformation. This provides a reliable initialization for subsequent fine registration, helping to avoid convergence to local minima and improving overall registration success rates.
To address the computational inefficiency and mismatch risks associated with conventional RANSAC algorithms in point cloud registration—especially under the presence of outliers arising from acquisition discrepancies between MPC and ALS data—this study introduces TCF approach to optimize coarse registration [
38], thereby providing a high-quality initialization for subsequent fine alignment. Traditional RANSAC relies on purely random sampling during its hypothesis generation phase, resulting in exponentially increasing iterations and limited robustness under non-rigid deformation, as its distance-based inlier verification lacks geometric adaptability. In contrast, the proposed TCF method enhances both efficiency and accuracy through a dimension-reduction strategy, as formulated in Equation (1). In the first stage, a length-constrained single-point sampling scheme is employed to eliminate large-scale outliers. In the second stage, a two-point sampling strategy evaluates angular consistency to retain only high-confidence correspondences. Finally, the transformation parameters—including the rotation matrix
R and translation vector
T—are robustly estimated using a scale-adaptive Cauchy Iterative Reweighted Least Squares (IRLS) solver. The workflow of the TCF algorithm is shown in
Figure 5.
where
s denotes the sampling dimension,
denotes the proportion of outliers,
denotes the number of iterations, and
denotes the success probability, which is generally 0.99.
- (2)
Point cloud fine registration based on the ICP algorithm
With the initial transformation parameters estimated by the TCF algorithm, the MPC and ALS point clouds can be roughly aligned. However, due to differences in features and scale between MPC and ALS point clouds, obtaining high-precision registration results still faces several challenges: (1) MPC and ALS feature distributions are uneven, with systematic biases in local regions (especially tree trunks, understory vegetation, and the ground), leading to misalignment in local point cloud registration; (2) Scale uncertainty (scale drift) in MPC causes minor scale differences between MPC and ALS; (3) Residual point cloud noise and outliers from TCF, as well as registration errors caused by differences in sensor perspectives and occlusions.
To address local deformations and subtle residual deviations after coarse registration, and to further refine the rigid transformation parameters, this study employs the ICP algorithm to optimize the alignment between MPC and ALS point clouds. The ICP algorithm iteratively estimates the optimal rigid transformation by minimizing the Euclidean distance between corresponding point pairs, considering six degrees of freedom (three translations and three rotations), as defined by the objective function in Equation (2). During each iteration, the algorithm dynamically suppresses or removes outliers and searches for nearest neighbors across the entire scene, thereby improving the global spatial consistency. This process results in a registration outcome with significantly enhanced accuracy and robustness.
where
N represents the number of corresponding point pairs, and
q and
p represent the corresponding points in the two sets of point clouds.
The results of coarse and fine registration of point clouds are shown in
Figure 6. After coarse alignment using the RANSAC algorithm, the MPC and ALS point clouds were roughly aligned; however, local deviations remained, as highlighted in the zoomed-in views. Subsequent application of the TCF algorithm produced significantly more accurate registration. As shown in
Figure 6b, both global and local views indicate no noticeable misalignment between the MPC and ALS point clouds. In key regions such as tree canopies, understory trunks, and ground surfaces, the proposed two-stage registration strategy effectively eliminated prominent local deviations. These results demonstrate that the proposed approach achieves accurate alignment between MPC and ALS point clouds in complex forest environments, exhibiting high registration precision.
3.3. Cross-Modal Semantic Transfer
In
Section 3.1, the proposed MSFFNet model is used for instance segmentation of tree crowns from multi-view UAV imagery. However, in multi-view images, multiple instance masks from different viewpoints may correspond to the same physical tree crown. To accurately segment individual trees in the 3D point cloud, it is essential to identify 2D masks that belong to the same tree and associate them with their corresponding regions in 3D space. This task is challenging due to segmentation errors in the 2D domain—such as tree crowns being correctly segmented in some views but over-segmented or falsely detected in others. Examples of such inconsistencies are illustrated in
Figure 7.
To address this challenge, this study combines multi-view geometry with an expectation–maximization (EM) algorithm based on a semantic probability model to construct a semantic probability field for multi-view instance masks and LiDAR point clouds, thereby resolving inconsistencies in instance masks for the same tree across different viewpoints. The algorithm iteratively optimizes the association probabilities between masks and 3D point clouds within the EM framework, explicitly modeling the uncertainty in segmentation results. Let
denote the
kth local instance mask in image
. We iterate over all masks
in all images and create an independent instance container
for each mask, where
denotes the container for the
kth mask instance in image. The algorithm primarily consists of five steps (the workflow is shown in
Figure 8): initial semantic probability association, expectation step, maximization step, iteration control, and instance generation.
- (1)
Initial semantic probability association
First, we establish an initial semantic probability association between 3D points and 2D masks. For each 3D point
, we use the mapping relationship
provided by the SfM system to determine all mask instances
that cover
, calculate the number of coverage viewpoints
, and establish the association between points and masks based on
:
- (2)
Expectation step
In the expectation step of iterative optimization, we introduce multi-view spatial consistency constraints for probability re-estimation based on the semantic probability distribution of the 3D point cloud–mask association in the current round
t. The core objective of this step is to solve the misassociation problem caused by the initial uniform distribution by integrating the spatial compatibility of three-dimensional space and two-dimensional images. Specifically, we design two key geometric factors (visibility verification factor
and spatial proximity weight
) to jointly construct the semantic probability update weights:
where
represents the visibility judgment of a 3D point, which is mainly determined through the mapping relationship
of SfM;
represents the spatial consistency between the 3D point and the 2D mask; W and H represent the width and height of the image, respectively;
represents the projection point
of the point in the image; and
represents the centroid of the
kth mask in the image
. The physical meaning of Formula (6) is that the closer the 3D point is to the center of the mask, the higher the spatial consistency.
- (3)
Maximization steps
Based on the geometrically optimized semantic probability, in order to eliminate redundant canopy masks from different perspectives (i.e., the same canopy is divided into multiple fragmented instances in different images), we perform hierarchical clustering of mask instances using semantic probability distribution similarity metrics.
For each pair of mask instances
, calculate the distribution vectors
and
of the corresponding 3D points. The formula for calculating the cosine similarity is:
The principle of the hierarchical clustering algorithm is presented in Algorithm 1. The algorithm first constructs a similarity matrix
and finds the pair
with the highest similarity. It then uses a threshold
to determine whether to merge two mask instances. Compared with fixed threshold segmentation, hierarchical clustering can adaptively handle differences in tree crown size. For large tree crowns, it allows merging with lower similarity (to accommodate distribution diversity), while for small tree crowns, strict matching is required.
Algorithm 1 Hierarchical clustering algorithm |
Input: Similarity matrix , mask instance |
Initialize: , |
1: While do |
2: Find the maximum similarity mask instance pair |
3: |
4: if then |
5: |
6: |
7: elif do |
8: |
9: end if |
10: end while |
Output: Mask instance merge results |
- (4)
Iterative control
After hierarchical clustering merging, since each instance merging changes the association structure between 3D points and masks, it is necessary to dynamically evaluate the optimization process. We use Formula (8) to reallocate the association semantic probability between 3D points and 2D masks, and use Formula (9) to calculate the probability change
. If
or
, stop iteration; if
, return to step (2) and continue iteration optimization.
- (5)
Instance generation
Finally, after completing multiple rounds of expectation–maximization iterative optimization, we decoupled the soft probabilistic associations into hard spatial divisions, achieving semantic transfer from a continuous probability field to a discrete instance of a 3D point cloud, and ultimately outputting a 3D tree canopy instance segmentation entity with spatial continuity and semantic consistency.
This study integrates multi-view geometry with an EM algorithm based on semantic probability modeling to construct a semantic probability field that aligns multi-view instance masks with LiDAR point clouds. Through multi-stage iterative optimization, the proposed method achieves robust semantic transfer from 2D tree crown masks to 3D point clouds, while addressing inconsistencies in instance segmentation across different viewpoints. The algorithm first establishes initial probabilistic correspondences using SfM projections. It then introduces multi-view spatial consistency constraints—such as visibility validation and spatial proximity weighting—to dynamically refine the probability distribution. A hierarchical clustering approach based on distributional similarity is employed to adaptively merge fragmented instance masks belonging to the same tree, with convergence controlled by monitoring the evolution of the probability field. Finally, a maximum a posteriori decision rule is applied to convert the continuous semantic probability field into discrete 3D tree crown instances, thereby enabling structured semantic-to-point-cloud migration and providing a quantifiable 3D topological foundation for individual tree segmentation.
3.4. Implementation and Evaluation Metrics
All experiments in this study were conducted on a computer equipped with an NVIDIA GeForce RTX 3060 GPU (Nvidia Corporation, Santa Clara, CA, USA) and an Intel i5-12600kf CPU (Intel Corporation, Santa Clara, CA, USA). The training of the 2D single-tree instance segmentation network was performed using the Python 3.7 programming environment and the PyTorch (Version 1.12.0) deep learning framework, with the AdamW optimizer (initial learning rate of 0.001). The batch size and number of epochs were set to 8 and 60, respectively, with a Poly learning rate decay strategy. To mitigate the impact of limited training data and accelerate convergence, the model was pre-trained on the ImageNet dataset [
39]. The pre-trained weights were then used to initialize the MSFFNet network. For the single-tree instance segmentation task in forest scenes, the number of classification categories is set to 1, and the confidence threshold is set to 0.7. Running the complete individual tree segmentation pipeline on our 40 m × 107 m and 30 m × 70 m benchmark dataset (roughly 3.2 million points) took about approximately 190 min. This includes MSFFNet model training (180 min), MSFFNet inference (5 min), point cloud registration (1 min), and semantic mapping (4 min). The runtime scales approximately linearly with the area of the study region. As research code, our implementation is not optimized for minimal runtime or resource usage.
To validate the effectiveness of the MSFFNet network proposed in this paper for 2D single-tree instance segmentation in complex mountainous scenes, we use overall accuracy, precision, recall, F1-score, and intersection-over-union (IoU) as evaluation metrics. The formulas are as follows:
where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.
In addition, to validate the effectiveness of the cross-modal label transfer method proposed in this paper, we use segmentation accuracy (SAC), omission error (OME), and commission error (COE) as evaluation metrics for 3D single tree segmentation. These three metrics are widely used in the evaluation of 3D point cloud single tree segmentation [
40]. Their calculation formulas are as follows:
where
represents the value of correctly segmented tree point clouds,
represents the value of unsegmented tree point clouds,
represents the value of incorrectly segmented tree point clouds, and
represents the value of ground truth tree point clouds.
5. Discussion
This study proposes a cross-modal semantic transfer framework for individual tree segmentation in complex forest environments, combining UAV imagery and LiDAR data to achieve 3D point cloud parsing without manual annotation. The effectiveness of the method was validated on two subtropical forest datasets, where the MSFFNet model achieved the best performance in 2D instance segmentation (IoU: 87.60%), and the semantic transfer framework significantly outperformed existing methods in 3D point cloud segmentation (SAC: 0.92, OME: 0.18, COE: 0.15). These results demonstrate the framework’s ability to address key challenges in forestry remote sensing, such as high annotation costs and cross-domain generalization limitations. Its outstanding performance stems from the synergy of three core components: (1) MSFFNet integrates hierarchical features from RGB and CHM data through deformable alignment and cross-modal interaction (MLFF module), enhancing boundary discrimination in dense canopies. An ablation study confirmed its critical role, as removing the MLFF module caused a notable IoU decrease; (2) Two-stage registration (TCF+ICP) resolves spatial inconsistencies between MPC and LiDAR point clouds, reducing alignment errors from meter-level to sub-pixel accuracy. This ensures geometric consistency in cross-modal mapping, which is particularly important for understory trees where LiDAR returns are sparse; (3) Semantic probability association combines multi-view geometry with EM optimization to address segmentation inconsistencies across viewpoints. By modeling spatial proximity and visibility constraints, the framework achieves robust label fusion, reducing omission errors by 31% compared to PDE-Net.The proposed framework has practical value for forest management. By transferring 2D semantic information into 3D point clouds, it enables efficient extraction of tree structural parameters (e.g., crown volume, tree height) for biodiversity monitoring, timber yield estimation, carbon stock assessment, and ecological interaction analysis. Its runtime (roughly 190 min for 3.2 million points) scales linearly with area, indicating potential for large-scale monitoring. However, several limitations remain. In crown instance segmentation, challenges such as crown overlap, texture similarity among tree species, forest type variations, and different climatic conditions still prevent MSFFNet from extracting some smaller crowns. The method also does not handle understory tree segmentation—a persistent challenge for 2D image-based forest surveys. In point cloud registration, segmentation accuracy depends on registration precision; while end-to-end deep learning approaches could avoid this dependency, they incur much higher computational costs and require large annotated forest point cloud datasets, which are currently scarce. Furthermore, in scenarios involving diverse sensor types and point cloud generation methods, unifying coordinate systems remains an open research topic.
The proposed 2D–3D semantic mapping strategy efficiently transfers per-crown instance semantics into 3D space, avoiding the heavy annotation burden required by deep learning-based approaches. Nevertheless, the method has not yet been tested in deciduous or tropical broadleaf forests with seasonal foliage variation. Future work will focus on two directions: (1) Optimizing runtime through lightweight network design and parallel computing; (2) Integrating topological constraints to improve segmentation of overlapping crowns.