1. Introduction
The accelerated digital transformation of the construction industry has positioned Building Information Modeling (BIM) as a central infrastructure element for the full life-cycle management of buildings [
1]. During the construction phase, BIM is expected to serve as a digital twin of the site, enabling real-time progress monitoring, quality inspection, and safety analysis [
1]. However, the inherent dynamism of construction environments and the limitations of manual measurement often impede the synchronization of BIM models with as-built site conditions, resulting in a persistent “virtual–real discrepancy” [
2,
3].
Three-dimensional (3D) point cloud scanning offers rich geometric and spatial information [
4,
5], but construction-site point clouds often suffer from severe occlusion, clutter, varying point density, and noise [
6]. These limitations lead to blurred boundaries, missing data, and structural ambiguity, making reliable semantic segmentation particularly challenging [
7].
Deep learning has advanced point cloud semantic segmentation with architectures such as PointNet [
8], PointNet++ [
9], DGCNN [
10], and PCT [
11]. However, their performance in real construction scenes is constrained by the scarcity of large-scale annotated 3D datasets [
12]. Public datasets such as BIM-Net [
13] contain only a few dozen files and about 90 million points, far from sufficient for training generalized models.
To overcome this limitation, we propose a BIM-guided Virtual-to-Real (V2R) framework for component-level point cloud semantic segmentation, relying exclusively on BIM-generated supervision. Leveraging the HELIOS++ [
14] simulator, we curated an extensive synthetic point cloud (SPC) dataset comprising 132 virtual scans and approximately
points, surpassing the scale of existing BIM-derived datasets by nearly two orders of magnitude. The dataset (~200 GB) includes component-level labels and spans diverse scanning settings, viewpoints, and densities.
Domain gaps between synthetic and real point clouds—arising from noise characteristics, point sparsity, occlusion, and surface reflectance—are addressed through both data-level and feature-level adaptation. A learnable PointAugment module [
15] injects realistic distortions and occlusions into virtual scans, while channel normalization and weighted residual fusion enhance cross-domain feature alignment.
The segmentation model adopts a dual-branch fusion architecture: PointNet++ captures local geometric patterns, and PCT extracts global semantic relationships. This integration enables accurate boundary preservation and enhances robustness in geometrically complex areas.
This study aims to achieve zero-real-annotation component-level segmentation of real construction site point clouds without any target-scene fine-tuning. Experiments on a representative high-rise residential floor and the multi-scene BIM-Net benchmark demonstrate that the proposed framework achieves strong performance—including 70.89% overall accuracy, 53.14% mean IoU, 69.67% mean accuracy, 54.75% FWIoU, and 59.66% Cohen’s —along with high per-component IoU. These results, spanning both self-acquired site scans and a public benchmark, indicate stable BIM-to-reality transfer across different scanning conditions and scene layouts within similar residential contexts. Confusion matrix analyses show clear differentiation among walls, beams, floors, and slabs, with residual confusion mainly in the “other” class due to missing auxiliary components in virtual scenes.
Different from prior BIM-derived segmentation studies that still rely on real annotations or fine-tuning on target scans, this work formulates an annotation-free BIM-to-reality transfer pipeline and demonstrates that a model trained only on procedurally generated BIM-based SPC can achieve stable component-level segmentation on multiple real construction floors. The key enabler is a coupled design of scalable BIM-driven synthetic supervision (SPC) and dual-level domain bridging (learnable distortion + feature alignment) integrated into a global–local fusion network, which collectively reduces the synthetic–real discrepancy without any real labels.
The main contributions of this work are as follows:
- (i)
Annotation-free BIM-to-real formulation. We explicitly formulate a zero-real-annotation component-level segmentation setting for construction point clouds and provide an end-to-end BIM→SPC→model→real inference pipeline, avoiding any real-scene fine-tuning or pseudo-label bootstrapping.
- (ii)
Scalable BIM-driven synthetic supervision. We build a large-scale, reproducible BIM-derived SPC training corpus (132 scans,
points) via HELIOS++ [
14] with multi-parameter virtual scanning, enabling controllable density/viewpoint variations and component-level labels at negligible manual cost.
- (iii)
Dual-level domain bridging via global–local fusion. We propose a global–local Fusion network integrating PointNet++ and PCT, featuring learnable point cloud distortion (PointAugment) and feature-level alignment (channel normalization and weighted residual fusion). Its performance is validated on real-world scenes and the BIM-Net benchmark, where it consistently outperforms competitive baselines.
3. Methodology
At a high level, the proposed method follows a simple virtual-to-real workflow. First, a Building Information Model (BIM) is constructed to provide accurate geometric and semantic descriptions of building components. Second, the BIM model is used to generate large-scale synthetic point clouds through virtual LiDAR scanning with HELIOS++ [
14], producing fully labeled training data without manual annotation. Third, the synthetic point clouds are partitioned into local blocks using a sliding-window strategy and used to train a Fusion segmentation network that combines local geometric features and global contextual information. Finally, the trained model is directly applied to real-world LiDAR scans for component-level semantic segmentation. This enables the semantic knowledge learned from virtual BIM-derived data to be effectively transferred to real construction scenes.
3.1. Overall Framework
This study develops a complete workflow covering data generation, model training, and result evaluation to support the task of semantic understanding from BIM to real point clouds. As shown in
Figure 1, the framework consists of three main stages: BIM model conversion and synthetic point cloud construction, deep model training, and semantic prediction on real point clouds.
In the data preparation stage, a Building Information Model is first created in Autodesk Revit, where components such as walls/columns, beams, slabs, and ceilings are assigned unified class codes and exported as
OBJ files. HELIOS++ [
14] is then used to perform multi-parameter virtual LiDAR scanning under different acquisition settings to generate semantic point clouds of varying densities. Each scanned point is automatically associated with a component-level semantic label according to its corresponding BIM instance. Subsequently, a unified preprocessing pipeline is applied, including format conversion, invalid point removal, and sliding-window partitioning, producing
.npy data directly usable for training.
In the model training stage, the SPCs for constructing a segmentation network that fuses local geometric features with global semantic information. This stage focuses on network design, augmentation strategies, and optimization for training stability. Details of the model branches, fusion mechanisms, and loss functions are presented in subsequent sections and are not elaborated here.
After training, the model is applied to real LiDAR scans for component-level semantic prediction. The evaluation examines overall accuracy (OA), mean Intersection-over-Union (mIoU), frequency-weighted IoU (FWIoU), Cohen’s Kappa, and other quantitative metrics, and uses visualizations to assess model performance across different regions and component types.
Overall, the workflow integrates BIM geometry and semantic information with virtual scanning technology and deep segmentation models, forming a complete pipeline tailored for real construction environments with strong reproducibility and engineering applicability.
3.2. Data Preparation Stage
An overview of the entire data preparation pipeline is illustrated in
Figure 2. The workflow employs a sequential, modular architecture that encompasses BIM creation, automated viewpoint planning, and virtual LiDAR scanning, followed by synthetic data generation and sliding-window partitioning to prepare datasets for subsequent training and evaluation. This figure provides a global reference for the individual steps detailed in the following subsections.
This stage is designed to bridge BIM models and real construction point clouds in a controlled manner by using BIM as a source of complete geometric and semantic priors instead of relying on limited real annotations. Virtual LiDAR scanning enables systematic control of point density, viewpoints, and coverage, while automated viewpoint planning ensures consistent visibility across scenes. Sliding-window partitioning with overlap is adopted to stabilize local geometric learning and reduce the impact of density imbalance in large-scale scenes. Together, these choices yield a scalable and domain-relevant synthetic training corpus, allowing the segmentation model to focus on structural semantics and supporting annotation-free virtual-to-real transfer.
This study refers to the typical building model in the public BIM-Net dataset [
13] and uses a high-rise residential building in Guangzhou as the prototype. A complete BIM was created in Autodesk Revit, where the primary components (walls/columns, beams, slabs, ceilings, etc.) were assigned unified class codes and semantic labels. The geometric structure and component configuration were based on actual construction blueprints to ensure the virtual scene accurately represents the complexity and spatial characteristics of the real building.
Following the BIM-Net convention, different floors or structural zones extracted from BIM models are assigned unique scene identifiers (e.g., 1px, 7y3, ac2, vvo), each corresponding to a distinct building–floor combination. As summarized in
Table 1, the proposed framework is trained exclusively on 132 synthetic SPC scenes generated from 22 BIM models under six virtual scanning configurations, while a subset of 16 real-world BIM-Net scenes is reserved solely for validation and testing. All real scenes originate from different buildings and floors, with no structural overlap between scenes, ensuring a clear separation between training and evaluation data and supporting reproducible cross-scene performance assessment.
In addition, the virtual LiDAR scanner stations were not manually specified. Instead, their locations (visualized as green points in
Figure 3) were generated using an automated viewpoint planning strategy inspired by the VF-Plan framework proposed by Xiong et al. [
48]. This method evaluates geometric visibility, occlusion, and coverage requirements to produce an optimized set of scanner poses. The automatically planned stations ensure that the synthetic LiDAR captures the building components with coverage and completeness comparable to practical terrestrial laser scanning.
In the virtual scanning phase, ground-based LiDAR simulation was implemented using the HELIOS++ platform. The scanners were evenly arranged according to the building layout to capture comprehensive and complementary point cloud data. To increase the diversity of the SPC and simulate different equipment and acquisition conditions, six typical parameter combinations were designed based on Zou et al.’s study [
49] on the balance between scanning efficiency and point cloud density in construction scenarios (
Table 2). By adjusting the pulse frequency and scanning frequency, the point cloud density and resolution were controlled. Other parameters (scan angle
, pitch angle range
, rotation speed
/s) were kept constant to ensure the experiment’s controllability and reproducibility. Following the “optimal density range” principle outlined in the literature, we selected a point density range close to the best density for inclusion in the training set, ensuring the expression of geometric details while avoiding redundant sampling, thus improving the model’s generalization across different density scenarios.
The raw point clouds output by HELIOS++ are stored in .XYZ format, containing spatial coordinates, reflectance intensity, and component IDs. Semantic labels are automatically generated based on BIM category information, mapping components such as walls/columns, beams, ceilings, floors, and others to labels 1–5, enabling point-level semantic annotation without manual intervention.
Compared with the BIM-Net dataset, which contains just over twenty files and 90.4 million points in total, our synthetic dataset is significantly larger in scale. Statistical analysis shows that 132 SPC files were generated, comprising approximately points in total—nearly two orders of magnitude larger than BIM-Net. Each file contains an average of points, and the overall data volume reaches roughly 200 GB. Such a large-scale dataset provides a far more comprehensive representation of structural geometry and ensures sufficient diversity and density to support high-capacity deep learning models and reliable cross-scene generalization studies.
To help the model learn more stable geometric structures at the local scale, both virtual and real point clouds undergo spatial partitioning using a sliding window approach during a unified preprocessing stage. The partitioning process starts at the corner of the scene’s bounding box, with each window fixed at a
2D projection area. The window slides along the
X-axis with a stride of
, achieving approximately 50% spatial overlap.
Figure 4 shows a top-down view of this partitioning approach: the left side shows the coverage of a single window, while the right side illustrates the overlapping relationship between two adjacent windows. Blue points represent the original point cloud, the light-colored rectangles indicate the corresponding window ranges, and black or dashed borders indicate the window positions.
This overlapping local partitioning approach helps maintain geometric continuity across blocks, improving the stability of boundary region samples and mitigating class imbalance caused by point cloud point density. The sliced data is normalized into a uniform structure (), facilitating batch loading and model training. During inference, predictions are first obtained independently for each sliding window. For points that appear in multiple overlapping windows, the final semantic label is determined by aggregating the predicted class probabilities across all corresponding windows and assigning the class with the highest averaged probability. This simple probability-based fusion strategy ensures consistent predictions in overlapping regions and avoids boundary artifacts when reconstructing the full-scene segmentation.
3.3. Model Architecture Design
There are significant differences between SPC and real scan point clouds in terms of noise levels, point density distribution, and occlusion conditions, which directly lead to a shift in their semantic space. Existing point cloud segmentation models, when trained under fully supervised or single-domain settings, are often insufficient to handle such discrepancies, as they tend to overfit domain-specific feature distributions and exhibit limited robustness under cross-domain deployment. To address this issue, we propose a segmentation framework driven by both local geometric encoding and global semantic modeling, which is motivated by the observation that purely local or purely global representations alone cannot adequately compensate for the combined effects of noise, density variation, and structural occlusion encountered in real construction scenes. aiming to maintain stable recognition performance in real construction environments while fully relying on BIM-generated virtual data for training.
The overall architecture of the model is shown in
Figure 1, consisting of two branches: PointNet++ and PCT. PointNet++ extracts local geometric features through hierarchical sampling and neighborhood aggregation mechanisms, capturing the boundary structures, contact relationships, and fine-scale forms of components. These features exhibit high consistency between virtual and real point clouds. Meanwhile, the PCT branch, centered around the self-attention mechanism, establishes long-range point dependencies through multi-head feature interaction, forming a global understanding of the building components’ overall layout and topological relationships. This global semantic modeling plays a crucial role in resolving feature breaks caused by varying scan densities, viewpoint changes, and local occlusions.
During the feature fusion phase, the network aligns the outputs of both branches into a unified semantic space and adjusts the relative contributions of local and global features through channel normalization and weighted residual connections. This mechanism adaptively mitigates distribution differences between virtual and real point clouds, preventing one branch from being overemphasized during training and reducing the accumulation of cross-domain bias. The fused features are then mapped linearly, and the classification head outputs point-level semantic labels for component recognition in real-world scenes.
Let
denote the concatenated fusion feature, with
and
. A lightweight coarse classifier produces point-wise coarse logits
, which are converted to coarse probabilities
Here,
B denotes the batch size,
N the number of points per sliding window,
E the feature dimension of each branch,
C the number of semantic classes, and
a temperature parameter.
By combining the local geometric stability of PointNet++ with the global semantic consistency of PCT, this structure establishes a transferable feature representation between virtual training data and real application scenarios. This enables the model to maintain strong cross-domain generalization ability, even when real annotations are not available.
3.4. Training and Optimization Strategies
The model training employs distributed multi-GPU parallelism and automatic mixed precision computation to enhance efficiency while ensuring numerical stability. The AdamW optimizer is used, combined with OneCycleLR for learning rate scheduling, and the overall training process follows the setup of mainstream deep learning frameworks. To alleviate training bias caused by class imbalance in the building point cloud, we introduce a class-weighting mechanism during the data sampling phase and use Exponential Moving Average (EMA, ) during the optimization phase to smooth parameter updates, thereby improving convergence stability and generalization ability.
For the loss function, we adopt a combination of cross-entropy and Focal Loss to balance classification stability and the ability to identify hard samples. The standard cross-entropy loss
is defined as:
To enhance the model’s focus on low-confidence samples, Focal Loss is introduced:
where the focusing parameter
has shown stable performance across multiple experiments. The final loss is defined as:
This combination maintains good classification performance under noise interference and sample scarcity conditions.
In terms of data augmentation, in addition to conventional geometric perturbations such as random rotation, scaling, translation, and jittering, we further introduce an adaptive augmentation network based on PointAugment. This network uses a joint constraint of geometric consistency loss () and consistency loss () to adaptively generate samples that are more consistent with the feature distribution of real scans, significantly enhancing the domain adaptation ability between virtual and real point clouds. To increase the model’s robustness to local structures, we also introduce a probabilistic spatial occlusion mechanism (occlusion ratio of 0.2) during training to prevent the model from over-relying on prominent region features.
For completeness, the main training hyperparameters are summarized here, while additional low-level implementation details are provided in the
Appendix A and
Appendix B.
For hyperparameter settings, the initial learning rate is set to , weight decay is , the batch size is 8, gradient accumulation steps are 4, total training epochs are 100, the warm-up phase lasts for the first 15 epochs, the maximum number of points per sample is 8192, and coordinate normalization is enabled. With the combined effect of these optimization strategies, the model achieves stable convergence and significant cross-domain segmentation performance improvement across multiple scenarios.
3.5. Model Validation and Performance Analysis
3.5.1. Visualization and Qualitative Analysis
During the inference phase, the preprocessed test samples (.npy) are input into the trained Fusion model for component-level semantic prediction. The network performs forward propagation on the feature vector of each point, outputs class probabilities, and applies Softmax normalization. The final label is obtained by taking the of the probabilities, resulting in a segmentation output that aligns with the spatial arrangement of the input point cloud. All inference processes are conducted under fixed hardware and random seed conditions, without introducing additional data augmentation or post-processing, ensuring objective and reproducible evaluation.
To visually present the recognition results, a unified color mapping is applied to five target categories: walls/columns (blue), beams (dark green), ceilings (green), floors (orange), and others (red). The predicted results are then projected back into 3D space and compared with the ground truth labels. Additionally, key regions, such as beam-column connections, wall-panel junctions, floor boundaries, and wall corners, are selected for local zoomed-in visualizations to examine boundary continuity, semantic consistency, and detail fidelity under complex geometries and occlusion conditions.
3.5.2. Quantitative Evaluation Metrics
To comprehensively evaluate the overall performance of various models in multi-class semantic recognition tasks, five commonly used metrics are selected for comparison: OA, mIoU, mean Accuracy (mAcc), FWIoU, and Cohen’s Kappa coefficient (). Let denote the number of categories, and , , and represent the true positives, false positives, and false negatives of class i, respectively. The number of samples in class i is , and the total number of samples is . The metrics are defined as follows.
Overall Accuracy (OA) is given by:
The Intersection-over-Union (IoU) for class
i and the mean IoU (mIoU) are defined as:
Mean Accuracy (mAcc) measures the average recall across classes and is unaffected by category frequency imbalance. It is defined as:
where
denotes the recall of class
i. mAcc reflects the model’s ability to recognize fine-grained categories and is particularly suitable for building point clouds where category imbalance and long-tail classes are common.
Considering class imbalance, the Frequency Weighted IoU (FWIoU) is defined as:
To quantify the agreement between predictions and ground truth, Cohen’s Kappa coefficient is further adopted:
where
denotes the observed agreement (i.e., OA) and
denotes the expected random agreement derived from the marginal distributions of predicted and true labels.
For consistent visual comparison across metrics, all evaluation indicators are normalized to the range
using max normalization:
where
v is the original metric value and
is the normalized result. This normalization is solely for visualization and does not affect the original metric calculations or evaluation conclusions.
Based on the above evaluation framework, we conduct systematic comparisons of multiple network models across multi-parameter scanning settings and multi-floor structural scenarios, enabling a comprehensive analysis of their overall recognition performance, cross-domain generalization, and robustness.
3.5.3. Component-Specific Evaluation Metrics
Building on the overall performance metrics, this study further analyzes the model’s recognition ability at the component level to reveal the performance differences across categories. For the five components—walls/columns (Wall/column), beams (Beam), ceilings (Ceiling), floors (Floor), and others (Other)—the IoU for each category is calculated. The definition of component-level IoU is consistent with
in Equation (
6), specifically as follows:
To facilitate a horizontal comparison of model performance across different component categories, we apply the same max normalization method as in Equation (
11) to rescale the component-level IoU:
Compared to the overall mIoU, the component-level IoU more directly reflects the model’s robustness across different structural features. In real-world construction scenarios, planar components (e.g., floors, ceilings) with regular shapes and high point cloud density are typically easier to recognize, while elongated components (e.g., beams, columns) are more prone to misclassification due to occlusion, blurred edges, and scale variation. Analyzing component-level metrics not only reveals the advantages and disadvantages of different feature modeling mechanisms (such as global Transformer and local geometric convolution) across component categories but also helps verify the effectiveness of the BIM-guided fusion strategy proposed in this study for recognizing complex geometric structures.
3.5.4. Classification Error and Confusion Analysis
To further analyze the recognition bias and confusion between different component categories, we construct a confusion matrix , where represents the number of points from the true class i that are predicted as class j. By normalizing the matrix along rows and columns, we obtain the recall and precision distributions for each class, which are used to evaluate the model’s performance in within-class recognition and inter-class differentiation.
Furthermore, to quantify the intensity of class-level confusion, we compute the correlation matrix based on the normalized confusion matrix. This matrix evaluates the systematic confusion relationships between different categories. It reflects the statistical correlation of categories in prediction error patterns and helps identify clusters of classes that are frequently confused with each other. This analysis reveals typical sources of errors when the model encounters spatially similar structures, closely related geometric features, or uneven point cloud density.
Confusion analysis provides an intuitive understanding of the level of confusion between component categories. In building point clouds, differences in morphology and spatial distribution between components are significant. Planar components, such as floors and ceilings, have regular geometry, while elongated components like beams and columns are complex in shape and have blurry boundaries, often leading to misclassification when neighborhood features are unclear. Additionally, occlusion, noise, and local feature degradation further exacerbate inter-class confusion.
In summary, by combining the analysis of the confusion matrix and the correlation matrix, we can systematically uncover the error patterns between specific categories. This provides a foundation for subsequent feature optimization, spatial relationship constraints, or hierarchical semantic prior designs. By integrating BIM topological structure information, we can further explore reducing inter-class confusion using the spatial dependencies between components, thereby improving the consistency and robustness of segmentation results.
4. Experimental Results and Analysis
4.1. Experimental Design and Ablation Rationale
Although the proposed Fusion model integrates multiple feature modeling mechanisms, its experimental evaluation inherently provides a clear module-level ablation analysis. Specifically, the Fusion architecture is constructed by combining two representative paradigms in point cloud semantic segmentation: local geometric modeling (PointNet++) and global contextual modeling (PCT).
Accordingly, reporting the performance of PointNet++, PCT, and their fused variant under an identical training and inference configuration already constitutes a core architectural ablation. This comparison directly reflects three complementary settings: local-only modeling, global-only modeling, and global–local cooperative modeling. By analyzing the performance differences among these settings, the contribution of the fusion design can be quantitatively isolated without introducing additional artificial ablation variants.
In addition, PointAugment [
15] is not treated as a novel algorithmic module in this study, but as a standardized data augmentation strategy [
50,
51]. To ensure fair comparison, it is consistently applied to all baseline methods and model variants, rather than being selectively enabled.
The effectiveness of PointAugment has been systematically validated in prior work and widely adopted in recent point cloud segmentation studies. Therefore, it is regarded as part of the unified experimental setup rather than an independent variable requiring separate ablation. This design choice allows the experimental analysis to focus on the architectural contribution of the proposed Fusion framework itself.
4.2. Component Recognition Results
To visually assess the semantic segmentation performance of each model in real construction scenarios, we selected typical floor point cloud segments and compared the prediction results of PointNet [
8], PointNet++ [
9], DGCNN [
10], PCT [
11], and the proposed Fusion model, against manually labeled ground truth (GT).
Figure 5 shows the input point cloud, the outputs of each model, and zoomed-in regions for visualization. The color map corresponds to five component categories: Wall/column, Beam, Ceiling, Floor, and Other.
From overall appearance to local details, the performance differences across models in component boundaries, complex geometric regions, and occlusion areas are evident. Traditional models like PointNet and DGCNN still exhibit large misclassifications or noisy spots even in relatively simple ceiling and floor regions, indicating limited stability in scenarios with variations in point density and simpler spatial structures. PCT improves semantic consistency in large-scale continuous regions, but the distinction between “wall/column-other” categories remains unclear, with some wall-adjacent objects or noise points erroneously classified into the wall region, leading to fuzzy local boundaries.
The relatively best-performing PointNet++ demonstrates stable recognition ability for major structural components but still exhibits some misclassification and under-segmentation in complex ceiling-floor junctions and the Other category, particularly in areas with large viewpoint changes or sparse point clouds.
In contrast, the Fusion model exhibits more stable boundary continuity and semantic consistency in most scenes. In regions such as beam-column connections, wall-panel intersections, and complex ceiling structures, Fusion’s predictions are often closer to the ground truth, with a noticeable reduction in noise points. In real construction sites, where non-structural components (e.g., furniture, curtains, temporary material piles) are abundant, Fusion model also shows better recall for the Other category and handles common cross-domain issues in real point clouds, such as uneven density, occlusion, and reflection changes. This indicates that the collaborative mechanism between the global Transformer and local PointNet++ branches has strong domain adaptation capabilities in complex components and dynamic environments.
The above results indicate that the architecture combining global and local features helps alleviate the inter-class confusion issues caused by BIM virtual data training, resulting in more stable component recognition in real-world scenarios.
4.3. Overall Performance Metrics Comparison
On the complete dataset, the proposed Fusion model achieves an overall accuracy of 70.89%, a mean IoU of 53.14%, a mean accuracy of 69.67%, a FWIoU of 54.75%, and a Cohen’s of 59.66%. These results demonstrate the model’s strong recognition capability, class separability, and robustness under diverse scanning conditions.
Figure 6 further provides a detailed comparison of the five normalized performance metrics (OA, mIoU, mAcc, FWIoU, and
) across 16 scanning scenes for the five representative semantic segmentation models: PointNet, PointNet++, DGCNN, PCT, and the proposed Fusion model. All metrics are normalized following Equation (
11), enabling consistent comparison across scenes and models.
A statistical examination of the normalized matrix shows that the Fusion model achieves 73 highest values out of 80 scene–metric combinations, outperforming all other models in more than 90% of the evaluation entries. This result indicates that the Fusion model maintains consistently high performance across a wide range of scenes and evaluation metrics.Such consistency suggests stable accuracy and reliable geometric discrimination under varying structural layouts, point densities, and scanning viewpoints.
Among the baseline methods, PointNet++ and DGCNN exhibit the most stable behavior, whereas PointNet and PCT show noticeable performance drops in more challenging scenes such as “1px,” “s9h,” and “skl,” where point sparsity or complex component geometry increases class ambiguity. These variations are most prominent in mIoU and mAcc, which are more sensitive to class imbalance and local structural complexity.
In contrast, the Fusion model consistently achieves the highest or near-highest normalized scores across almost all scene–metric combinations. Its integration of global contextual modeling, local geometric encoding, and category-aware attention strengthens its robustness to occlusion, sampling unevenness, and viewpoint variability.
The heatmap comparison demonstrates that the proposed Fusion model delivers superior cross-scene stability and discriminative capability. Its 73 leading entries, together with the strong performance observed on the complete dataset, highlight the effectiveness of the fusion strategy and its suitability for real construction-site semantic segmentation tasks.
Beyond normalized visualization, the original metric values reveal that the proposed Fusion model achieves consistent and non-trivial absolute improvements over the strongest baseline across multiple scenes. For example, on the 759 scene, Fusion improves mIoU from 51.1% (PointNet++) to 55.5%, corresponding to a +4.4 percentage-point gain, which is the largest absolute mIoU improvement observed among all evaluated scenes. Meanwhile, Cohen’s increases from 54.9 to 58.2 (+3.3 pp), indicating substantially improved agreement between predictions and ground truth. These absolute gains confirm that the dominance observed in the normalized heatmap reflects meaningful performance improvements rather than a visualization artifact.
4.4. Component-Level Performance Comparison
Building upon the overall performance evaluation, we further analyze the Fusion model’s recognition capability at the component level across different categories. At the component level, the Fusion model generally attains the highest or near-highest IoU values across most scene–component combinations, while maintaining consistently competitive performance in the remaining cases. This pattern indicates stable component discrimination under diverse structural layouts and scanning conditions, without relying on scene-specific tuning.
Figure 7 presents the normalized IoU performance of five major components—Wall/Column, Beam, Ceiling, Floor, and Other—across 16 scenes. For consistent cross-scene comparison, all IoU values were normalized using the max-normalization method defined in Equation (
11). It is important to note that in some scenes, beam components do not exist (e.g., certain floor scans do not include beam elements). As a result, all models achieve an IoU of 0 for the Beam category in those scenes, which reflects the absence of the component rather than a model performance issue.
From the cross-scene patterns, the Fusion model maintains relatively high normalized scores across most component categories, demonstrating strong consistency. For core structural components such as Wall/Column and Floor, the model typically reaches higher normalized IoU values. This advantage can be attributed to its dual-branch architecture: the global Transformer branch captures broader spatial dependencies, while the local PointNet++ branch provides fine-grained geometric cues, jointly improving recognition stability.
For categories more susceptible to noise, occlusion, or uneven sampling—such as Ceiling and Other—the Fusion model continues to exhibit robust performance. Compared with PointNet and PCT, it demonstrates better adaptability in ambiguous regions or areas with strong local geometric variation, especially near wall-panel junctions, beam–column connections, and other structurally complex interfaces, where it preserves more balanced boundary predictions.
For elongated components such as Beams, the Fusion model consistently achieves higher normalized scores in scenes where beam elements are present. The performance of PointNet and PCT fluctuates considerably in this category, whereas the Fusion model remains stable, suggesting that its joint global–local geometric representation is particularly suitable for components with scale variability and local sparsity.
The component-level heatmap collectively indicates that the Fusion model delivers strong adaptability across categories, stable cross-scene performance, and clear advantages for components with geometry-sensitive characteristics. Subsequent sections further examine misclassification patterns and structural distinctions through confusion matrix analysis and feature visualization.
4.5. Classification Error and Confusion Relationship Analysis
To further understand the model’s discriminative capability across different component categories, we construct an inter-class feature correlation matrix based on the prediction results of each scene, and visualize it in a circular (ring-shaped) layout as shown in
Figure 8. Higher correlation values typically indicate that categories are closer in feature space and thus more prone to confusion, whereas lower correlations imply clearer decision boundaries learned by the model.
From the overall distribution, the class-wise correlation patterns are generally consistent across models in most scenes. The Fusion model, however, tends to exhibit lower cross-category correlations among multiple component groups (e.g., Wall/Column, Beam, Ceiling, Floor), indicating that the Fusion model produces clearer inter-class separation in the feature space, thereby reducing the likelihood of mutual misclassification.
Local dark regions can still be observed in certain scenes—such as between Beam-Floor or Wall-Other—suggesting that these categories remain feature-similar under specific conditions, which can increase classification difficulty. For example, in scenes such as
1px and
s9h, the Beam–Floor pair exhibits noticeably higher correlation values in
Figure 8. Visual inspection reveals that these errors mainly occur near beam–slab junctions, where elongated beam elements are partially occluded or sparsely sampled and appear locally planar. In practice, this leads to beams being partially absorbed into floor regions, which could result in missing or fragmented beam components in downstream scan-to-BIM reconstruction and quantity take-off workflows.
This phenomenon mainly stems from two factors. First, certain component pairs are naturally adjacent in building structures, making their local geometric features difficult to separate under partial occlusion or sparse sampling. Second, the data characteristics of virtual and real scenes differ substantially.Virtual scenes contain relatively simple component types and lack elements commonly present in real construction sites, such as curtains, temporary obstructions, or other non-structural objects. A representative case can be observed in scenes such as skl and sn8, where the Wall/Column–Other correlation is relatively high.In these scenes, non-structural objects (e.g., temporary materials, stacked equipment, or furniture) are often located close to wall surfaces and share similar vertical geometry. As a result, the model occasionally assigns these objects to the Wall/Column category. From a scan-to-BIM perspective, this leads to spurious wall extensions or false-positive wall regions, which may degrade as-built model accuracy and affect subsequent spatial analysis.
From a cross-scene perspective, the correlation patterns vary rather than being uniform, reflecting differences in scanning viewpoints, point density, and local occlusion conditions. The Fusion model maintains relatively balanced inter-class correlation distributions in most scenes, suggesting that the complementarity between its global Transformer branch and local geometric feature extraction modules enables stable category discrimination under diverse scanning conditions.
Overall, the correlation ring map highlights the Fusion model’s strengths in inter-class feature separability, cross-scene stability, and handling of geometrically adjacent component regions. At the same time, the highlighted Beam-Floor and Wall-Other confusion cases provide concrete insights into the remaining failure modes, indicating that future improvements should focus on enhancing boundary awareness at structural junctions and enriching non-structural component representations to further reduce ambiguity in real-world construction environments.
5. Discussion
To facilitate semantic understanding from BIM to real-world point clouds, this study develops an integrated framework encompassing virtual LiDAR scanning, sliding-window preprocessing, and Fusion model training. When trained solely on BIM-generated synthetic data, the Fusion model achieves robust transfer performance across diverse real-world scenes, consistently outperforming classical baselines in metrics such as mIoU and . This effectiveness stems from our dual-level strategy to mitigate domain gaps through learnable PointAugment and weighted feature alignment. Nevertheless, practical BIM-reality discrepancies—such as construction tolerances, design intent deviations, and unmodeled site elements—remain inherent challenges that can introduce semantic ambiguity at structural junctions. Quantifying the impact of these geometric inconsistencies is essential for enhancing the reliability of the proposed framework in high-fidelity automated construction monitoring.
When virtual point clouds exhibit sufficient structural and geometric consistency with real scenes, models trained solely on synthetic data can achieve competitive performance on real scans, as reported in prior studies [
25,
52,
53]. The experimental results in this work, obtained under the BIM-derived construction site scenario, align well with these findings and further confirm that virtual scanning–generated point clouds can encode transferable component-level semantics, supporting effective virtual-to-real semantic transfer despite inevitable domain discrepancies.
Compared with conventional point cloud training strategies that rely on extensive on-site LiDAR acquisition and labor-intensive manual annotation, the proposed framework substantially reduces the dependency on repeated field scanning and real data labeling. By leveraging BIM-derived synthetic point clouds as the sole source of supervision, effective model training can be conducted prior to or independently of large-scale real data collection. From a practical perspective, the computational cost of synthetic data generation remains moderate: even under the highest configuration settings adopted in this study, generating a complete synthetic point cloud for a single BIM model requires no more than 30 min, which is significantly more efficient than repeated on-site scanning and manual data preparation. Moreover, the proposed framework does not introduce a noticeable increase in training or inference resource requirements compared with representative deep learning–based point cloud segmentation methods, as both training and block-wise inference follow standard GPU-based pipelines.
Despite the absence of real annotations, the proposed Fusion model consistently achieves competitive performance across multiple scenes and floors, outperforming representative models such as PointNet, PointNet++, DGCNN, and PCT under the same training and inference protocol. This indicates that the proposed architectural modules provide robust and transferable feature representations that generalize well across different structural layouts and scanning conditions, rather than being tailored to a specific scene or acquisition setup.
From the perspective of model architecture, the Fusion model combines the local geometric encoding of PointNet++ with the global self-attention modeling of PCT. The overall metric heatmap shows that this structure typically achieves high normalized scores across different scanning densities and viewpoint combinations. Component-level IoU statistics also reflect that the Fusion model maintains balanced recognition performance in most scenes, particularly for key components such as walls/columns, floors, and ceilings. It is evident that the local branch plays a role in fine-scale areas, such as beam-column connections and wall-panel junctions, while the global branch helps maintain semantic consistency under conditions of complex occlusion and viewpoint changes. The synergy between these two branches positively impacts the model’s stable performance under cross-domain conditions.
This observation is consistent with existing literature showing that hybrid architectures integrating local neighborhood features with global contextual modeling tend to outperform single-branch networks under varying point density, occlusion, and viewpoint conditions. Compared with purely local models, which are more sensitive to sparsity, and purely global attention-based models, which may overlook fine geometric details, the Fusion model exhibits a more balanced behavior in complex construction environments.
The inter-class feature correlation map provides another perspective on the analysis from the feature space. Overall, Fusion model exhibits lower cross-category correlation between multiple component groups, suggesting that the model can maintain some degree of category separation in the feature space. However, in some scenes, a higher local correlation is observed between Wall/column and “Other” categories, resulting in a small amount of mutual misclassification. This may be due to the simplified component types in virtual scanning scenes: BIM models mainly consist of structural components, while non-structural objects like curtains, furniture, and temporary materials in real scenes are typically classified as “Other” and are often placed adjacent to walls, causing their geometric distribution to be similar. This leads to blurred feature boundaries between these two categories in the real domain. The virtual domain lacks these components, which prevented the model from adequately learning this semantic difference during training.
The “Other” class in this study is defined as a broad residual category covering non-structural and temporary objects not explicitly represented in the BIM models, such as furniture, temporary materials, and site installations. This coarse definition introduces high intra-class variability and ambiguous geometric characteristics, which complicate the interpretation of misclassification patterns, particularly near dominant structural components like walls and columns. Similar confusion patterns between structural components and heterogeneous “Other” categories have been widely reported in real-world indoor and construction point cloud benchmarks, indicating that this issue reflects a common challenge in component-level semantic segmentation rather than a limitation specific to the proposed framework.
From a scalability perspective, the current framework demonstrates favorable behavior in handling large-scale point clouds through sliding-window partitioning and block-wise inference, which enables practical deployment on scenes containing billions of points. However, as scene scale and structural heterogeneity further increase, such as in industrial facilities or large public infrastructures, the computational cost associated with dense window sampling and multi-branch feature extraction may grow substantially. Moreover, highly heterogeneous scenes that involve diverse component types, complex functional layouts, and frequent temporary installations may introduce semantic patterns that are insufficiently represented in the current BIM-derived synthetic training set.
The above analysis highlights several directions for further improvement in this study. First, the synthetic point cloud generation process remains partially idealized, particularly with respect to material properties, small ancillary components, and temporary site facilities, which may introduce domain discrepancies in the “Other” category and certain elongated components. To address this issue, future work will focus on refining the semantic class taxonomy by decomposing both structural and non-structural elements into more fine-grained categories. For example, MEP-related components such as pipelines, as well as common furniture items (e.g., beds, tables, and cabinets), will be explicitly modeled rather than being uniformly absorbed into the “Other” class. Supported by richer BIM representations or auxiliary annotations, such semantic refinement is expected to reduce intra-class ambiguity, improve component-level discrimination, and enhance the diagnostic value of segmentation results in complex indoor construction environments.
Second, the empirical evaluation in this study is primarily based on self-acquired scans from a specific high-rise residential building, complemented by multiple scenarios from the BIMNet benchmark. Although multi-view scanning, sliding-window partitioning, and the inclusion of public datasets were employed to increase data diversity, the coverage of building typologies, structural systems, and construction stages remains limited. Consequently, while robust performance has been demonstrated within the evaluated scenarios, further validation is required to assess the generalization of the proposed framework to other building types, such as industrial or commercial facilities, which is left for future investigation.
6. Conclusions
This study demonstrates that BIM-generated synthetic point clouds can effectively support component-level semantic segmentation in real construction environments. Without using any real annotations, the Fusion model trained solely on SPC achieves 70.89% overall accuracy, 53.14% mean IoU, 69.67% mean accuracy, 54.75% FWIoU, and 59.66% Cohen’s , confirming the feasibility of annotation-free synthetic supervision for real-world scene understanding within the evaluated residential building scenarios and the BIM-Net benchmark.
The large-scale SPC dataset (132 scans, ~ points) provides extensive geometric diversity and scanning variability. Across both scene-level and component-level evaluations, the Fusion model consistently exhibits leading or near-leading performance, reflecting stable semantic discrimination under varying floors, viewpoints, and scanning conditions in the tested residential buildings. This robustness highlights the effectiveness of the proposed global–local fusion strategy in mitigating domain discrepancies between synthetic and real point clouds.
Overall, the proposed framework exhibits strong potential for practical deployment in large-scale point cloud–based construction applications. Its extension to highly heterogeneous or industrial-scale environments, however, will require careful consideration of computational efficiency and enhanced semantic coverage beyond the current BIM-derived synthetic training set.
While the results underscore robust performance within residential settings, the framework’s generalization to more heterogeneous typologies—such as industrial or commercial complexes—is presently constrained by the semantic diversity and completeness of the underlying BIM representations. This limitation warrants further investigation to establish the framework’s broader applicability across the AEC (Architecture, Engineering, and Construction) industry. Addressing this limitation will require not only larger synthetic datasets but also richer semantic modeling of functional components and construction-specific objects that vary significantly across different building types.
Based on the current analysis, future work can be further advanced in the following directions: (i) constructing a richer BIM–Reality joint dataset by incorporating non-structural components such as windows, curtains, furniture, and MEP pipelines, to better narrow the semantic gap between the virtual and real domains; (ii) further refining the semantic class taxonomy by decomposing the coarse “Other” category into more fine-grained structural and non-structural component classes (e.g., MEP pipelines and common furniture), so as to reduce intra-class ambiguity and improve component-level recognition; (iii) introducing adversarial, contrastive, or self-supervised cross-domain alignment strategies at the model level to enhance adaptation to feature distribution shifts; (iv) developing hierarchical or coarse-to-fine inference strategies to reduce computational overhead and improve scalability in extremely large scenes; (v) exploring multi-modal fusion (e.g., images, depth maps, textures) to improve the distinguishability of geometrically similar components; (vi) embedding the segmentation results from virtual to real into the BIM dynamic updating process for automatic change detection, component status representation, and progress tracking, contributing to the formation of a closed-loop system for construction monitoring.