1. Introduction
Aging buildings represent a major challenge in the global construction landscape, with a substantial share urgently requiring retrofitting and upgrades to comply with current performance and safety standards [
1]. A significant portion of the built environment lacks reliable as-built documentation, a problem especially prevalent in aging residential and commercial buildings where original design drawings are often missing, inaccurate, or outdated [
2]. The absence of precise as-built information could create serious challenges in carrying out building simulations, cost estimations, and renovation planning. To address this, more accurate and up-to-date digital models can be created. Overall, the practical applications of these models are energy retrofitting, facility management, post-disaster reconstruction, façade renovation, solar panel placement, and interior redesign, all of which rely on sufficiently detailed digital representations of existing buildings. These models not only enhance operational efficiency in facility management but also play a vital role in improving energy performance and supporting sustainability goals [
3]. Without them, these processes become time-consuming, error-prone, and costly, often requiring extensive on-site measurements and manual reconstruction of building geometry.
In addition, the construction sector accounts for nearly 40% of global energy use and carbon emissions, and retrofitting the existing building stock is essential to achieve decarbonization and sustainability targets [
4]. However, many aging buildings remain undocumented or poorly represented in digital form, limiting the ability of owners and facility managers to plan retrofits, monitor performance, and implement data-driven maintenance. Automating the generation of building models from imagery offers a practical pathway to accelerate digital transformation across the industry. It enables small firms and public agencies, often constrained by cost and technical expertise, to access up-to-date spatial data for energy analysis, facility management, retrofit and renovation planning. This shows the importance of application of digital workflows, especially when integrated with Artificial Intelligence (AI) techniques, which contribute not only to technical innovation but also to societal goals of reducing emissions, extending building life cycles, and improving safety and occupant comfort through better-informed decision-making [
5].
To address these challenges and enhance the accessibility of digital modeling solutions, Scan-to-BIM has emerged as a workflow for capturing the geometry of physical structures using 3D scanning technologies and creating intelligent models within BIM software [
6,
7]. This approach is valuable for generating as-built BIM models of aging or repurposed buildings that lack original design documents [
8]. The conventional Scan-to-BIM workflow consists of acquiring 3D PC data, analyzing and classifying building elements, and manually converting the segmented data into as-built BIM models [
9,
10,
11]. However, despite its widespread adoption, this process remains labor-intensive and difficult to standardize [
12]. As a result, recent research has increasingly focused on automating the segmentation and classification of PCs using Deep Neural Networks (DNN) algorithms to extract semantically segmented features along with their corresponding geometry to achieve a fully automated Scan-to-BIM workflow [
12].
Although these approaches have advanced the scan-to-BIM reconstruction process, several challenges remain. Firstly, the degree of automation remains constrained, as the generation of high-quality 3D semantic and BIM models still depends on manual or semi-automated tasks in existing approaches [
10,
11,
13]. Secondly, current modeling techniques do not fully utilize the semantic information embedded in PCs, as segmentation is performed after PC generation, missing valuable priors that could improve integration and geometric consistency [
14,
15]. Third, PCs, which are an essential data source for developing BIM models [
16], are typically obtained through range-based techniques such as laser scanning (LiDAR) or image-based methods such as digital photogrammetry [
17]. Despite their effectiveness, these methods present challenges in terms of cost, efficiency, and the need for post-processing [
18,
19]. At their current stage, both methods require post-processing of the generated point cloud to perform semantic segmentation of building elements and extract geometric information necessary for BIM modeling.
With the advancement of AI techniques, post-processing on PCs has improved classification and segmentation accuracy; however, it continues to pose challenges to automation in both range-based and image-based methods. Common issues include noise in raw data, structural incompleteness due to occlusions in scans [
20], and difficulties detecting components on reflective or textureless surfaces [
21]. In image-based methods, projection-based label transfer might lead to misalignment errors [
22]. Moreover, both approaches involve high computational effort and manual intervention. A further limitation lies in the strong dependency on the quality and domain relevance of training data. Widely used 3D segmentation datasets such as S3DIS [
23] exhibit non-uniform sampling, noise, and missing regions, which, despite enabling high point-wise semantic accuracy, can degrade object-level reconstruction fidelity [
24]. Likewise, 2D datasets like ADE20K [
25] and NYU Depth V2 [
26] lack the architectural specificity and spatial coherence necessary for reliable BIM modeling, although they could be popular for general segmentation tasks [
27].
For this reason and to maximize the potential of PCs, this study introduces a novel method that integrates NeRF and computer vision-language models to generate structured, segmented, and color-labeled PCs directly from images, after which the geometry of building elements is extracted and the BIM model is generated automatically. The proposed method addresses identified challenges by (1) automating the workflow, (2) embedding semantic labels during reconstruction to eliminate misalignment, avoid post-processing, reduce computational overhead, and mitigate scan-related issues, (3) removing dependence on LiDAR or photogrammetry to reduce cost and setup complexity, and (4) bypassing limitations of domain-specific 3D and 2D datasets by leveraging direct image-based labeling through large pretrained models. The proposed method is also experimentally evaluated using metrics that assess spatial precision, geometric consistency, and reconstruction performance across both interior and exterior datasets to validate its effectiveness for automated BIM generation.
This paper is organized as follows.
Section 2 reviews existing studies and identifies the research gaps.
Section 3, Methodology, outlines the integrated workflow for combining image segmentation, NeRF-based 3D reconstruction, and automated BIM modeling.
Section 4, Results, provides quantitative evaluations of segmentation accuracy, spatial alignment, and reconstruction quality using multiple NeRF models.
Section 5, Discussion, reflects on the implications of the findings, highlights the methodological contributions, and identifies remaining challenges and future directions. Finally,
Section 6, Conclusion, summarizes the key outcomes and practical advantages of the proposed method.
2. Literature Review
The automated Scan-to-BIM process has significantly evolved through the integration of DNN-based methods over the past decade. This evolution has been driven by the need to process large-scale PCs efficiently, enhance semantic segmentation, and improve topological consistency in as-built BIM reconstruction. The following sections review the existing studies and identify research gaps, as well as key contributions that have advanced the Scan-to-BIM workflow through improvements in geometry-based, hybrid, and deep learning approaches.
2.1. Traditional Geometry-Based Approaches
Early Scan-to-BIM automation research focused on geometry-based techniques, using planar segmentation, topology rules, and shape detection to extract structural elements from PCs. Tang et al. [
28] provided an early comprehensive review by classifying methods into geometric modeling, object recognition, and relationship modeling, while also identifying challenges such as manual intervention, noise, and occlusions. Xiong et al. [
29] extended these efforts with a two-phase methodology that combined planar segmentation and voxelization to detect walls, floors, and ceilings, with visibility reasoning and shape estimation for semantic enrichment of detections.
As BIM adoption increased, semi-automated approaches emerged to reduce manual effort by combining rule-based methods with user intervention. Jung et al. [
8] introduced a hybrid approach that integrated Random Sample Consensus (RANSAC)-based segmentation, grid-based filtering, and boundary tracing to extract architectural elements with subsequent human refinement. Volk et al. [
10] reviewed BIM adoption technological limitations and highlighted that while laser scanning, photogrammetry, and automated object recognition have improved PCs processing, the conversion of unstructured PCs data into semantically rich BIM models remains a challenge.
Subsequent methods have integrated into machine learning or optimization algorithms to improve segmentation and topology in automated Scan-to-BIM workflows. For example, Croce et al. [
30] proposed a semi-automatic approach that combines Machine Learning techniques, specifically the Random Forest, for semantic segmentation and classification of architectural elements in 3D point clouds. They utilized Rhino and Grasshopper, to reconstruct parametric models for H-BIM applications. Ochmann et al. [
31] proposed a volumetric multi-story reconstruction method formulated as an integer linear programming problem, using RANSAC-based plane detection to identify structural surfaces and Markov clustering to group spatially related elements to enforce geometric and topological constraints. Bassier and Vergauwen [
32] proposed an unsupervised method that detects different wall axis types (straight, curved, and polyline-based) and reconstructs wall connections using clustering, geometric feature extraction, and topology reconstruction to generate IFC-compliant BIM models. Rausch and Haas [
33] proposed an automated parametric approach for updating the shape and pose of BIM elements using PCs. Their dyna-BIM method applies genetic algorithms and simulated annealing to align as-designed BIMs with as-built data. Additionally, Perez-Perez et al. [
34] enhanced segmentation in complex environments by integrating Support Vector Machines for semantic classification with AdaBoost for geometric labeling and probabilistic graphical models. Their method refined the detection of planar and non-planar features while preserving semantic consistency throughout the automated reconstruction.
2.2. Deep Learning-Based Approaches
Recent advancements in PC processing have greatly improved the accuracy and efficiency of 3D reconstruction and segmentation, which benefits Scan-to-BIM applications. Large-scale annotated datasets like ScanNet [
35] have enabled better deep learning models for scene understanding. Methods such as PointNet [
36], PointCNN [
37], and PointNeXt [
38] have refined feature extraction and classification. SEGCloud [
39] improved segmentation accuracy, and PointNetLK [
40] advanced PCs registration. These developments laid the foundation for deep learning in Scan-to-BIM workflows to reduce manual effort and increase automation in as-built modeling.
Huan et al. [
41] proposed GeoRec, a DNN for geometry-enhanced semantic 3D reconstruction. The model integrates a geometry extractor with deep learning to improve layout estimation, camera pose recovery, and object detection via three modules: room layout, object detection, and object reconstruction. Tang et al. [
42] developed a hybrid approach combining deep learning with morphological operations and RANSAC-based plane detection. Their method classifies PCs into thirteen semantic categories, refines spatial relationships through Markov Random Field optimization, and uses grammar-based modeling to generate IFC-compliant BIM models.
Recent studies have further advanced Scan-to-BIM automation by enhancing DNN models, segmentation, and connectivity detection. Campagnolo et al. [
43] developed a fully automated pipeline using DNN-based instance segmentation with BIM-Net++, a lightweight voxel-based Convolutional Neural Networks (CNN) designed for semantic segmentation that identifies architectural elements such as walls, floors, and roofs. Their method refines segmentation via RANSAC for planar and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) for non-planar elements before BIM reconstruction. Wu et al. [
44] presented FLKPP, a prototype that integrates neural networks with architectonic grammar for improved segmentation and reconstruction. Their method uses KPConv for 3D semantic segmentation, floor-layer preprocessing, and DNN-based line detection to generate 2D floor plan grids before reconstructing BIM elements like walls, doors, and columns. Drobnyi et al. [
45] proposed a deep geometric neural network for connectivity detection, modeling spatial relationships through planar segmentation, region growing, and proximity-based clustering with a PointNeXt-based model for edge classification to automate digital twin construction. Mahmoud et al. [
46] introduced a framework that first applies a deep learning model for PC segmentation, followed by room clustering using DBSCAN. Line detection via RANSAC extracts walls and major structural elements, while a Dynamo-based algorithm automates the parametric reconstruction of structured components like walls and floors. The framework achieved high accuracy in semantic segmentation and geometric reconstruction, which enhances the automation of Scan-to-BIM workflows.
2.3. Image-Based Approaches
In addition to the range-based Scan-to-BIM methods previously discussed, recent studies have investigated image-based workflows as cost-effective and accessible alternatives to LiDAR-based approaches. These methods typically utilize 2D RGB images captured through handheld devices or UAV. Semantic segmentation is performed using DNNs, after which 3D PCs are generated using photogrammetric techniques such as Structure-from-Motion (SfM) or Multi-View Stereo (MVS).
Han et al. [
47] developed an indoor reconstruction method using LiDAR sensors and image-based MVS data. They applied DeepLabv3 for semantic segmentation, trained on the Cityscapes dataset and their own annotations. Their system reconstructed walls, floors, and ceilings. Similarly, Pantoja-Rosero et al. [
20] aimed to automate the generation of Level of Detail (LOD3) building models, focusing on masonry structures, using SfM and DNN-based semantic segmentation. Their method combined SfM to generate sparse PCs and camera poses with TernausNet deep learning models to segment façade openings. Polyfit is used to create LOD2 models, which were upgraded to LOD3 by triangulating segmented openings into 3D space. The findings demonstrate that the pipeline effectively reconstructs LOD3 models.
In infrastructure-focused work, Saovana et al. [
48] introduced a method called Point cloud Classification based on image-based Instance Segmentation (PCIS). This approach utilizes digital images processed through CNN to generate 2D masks, which are transformed into 3D masks using camera parameters from SfM. These masks classify PCs by projecting rays from camera positions through the masks to the PCs. The findings demonstrate that PCIS achieves high accuracy, with an F1-score of 0.96 for one-class classification and 0.83 for six-class classification. Puliti et al. [
49] introduced a method combining infrared thermography and SfM to detect subsurface defects in building envelopes. They applied a segmentation algorithm using temperature-based sliding windows and edge detection to identify thermal anomalies. The method achieved IoU of 78% for the handheld IR camera test and 83% for the UAV-based IR camera test, along with an F1-score of 0.87 for both tests, demonstrating high accuracy in automated damage detection.
Studies have also paid attention to exterior and interior BIM reconstruction. For example, Yang et al. [
50] proposed an image-based approach for automatic as-built BIM generation by reconstructing 3D facades and identifying surface materials. Using images from uncalibrated cameras, PCs are created via SfM, segmented with RANSAC, and analyzed through semantic reasoning to detect elements like walls and windows. Wong et al. [
22] proposed an image-based Scan-to-BIM method for interior building reconstruction using handheld phone imagery. Their method integrates photogrammetry, semantic segmentation, projection, and geometry-based refinement. Frames were processed using SfM and MVS to create dense PCs, followed by RANSAC and DBSCAN for structural surface detection. YOLOv8, trained on the HBD dataset, was used to segment building components for 2D images. These semantic masks were projected onto the 3D PC using a pinhole camera model. Weighted voting improved label consistency, and boundary refinements addressed windows and doors. The final data was exported to Revit for automated BIM generation, achieving 100% recognition accuracy and a geometric error of 0.056 m.
2.4. Neural Radiance Field 3D Reconstruction
While existing image-based Scan-to-BIM methods have shown promising results, they typically depend on photogrammetry to reconstruct point clouds and then apply semantic labels through projection or manual mapping. This two-step process is often vulnerable to issues such as occlusions, noisy or incomplete geometry, and inaccurate label alignment, especially in complex indoor scenes with clutter, reflective surfaces, or textureless regions. To address these limitations, NeRF is explored in this research, which eliminates the need for intermediate point clouds and projection by directly learning a continuous volumetric representation from posed 2D images. NeRF enables a more integrated and robust workflow for view synthesis and geometry reconstruction and offers higher fidelity and resilience in challenging environments common to indoor BIM applications.
NeRF, introduced by Mildenhall et al. [
51], marked a significant breakthrough in photorealistic view synthesis and 3D scene reconstruction. NeRF represents a static scene as a continuous volumetric function that maps five-dimensional inputs, comprising a 3D spatial location and a 2D viewing direction, to a color and volume density. This function is learned through a fully connected Multilayer Perceptron (MLP), which estimates view-dependent radiance and differential opacity at each sampled coordinate. By casting rays from virtual cameras through the scene and integrating predicted values using differentiable volume rendering, NeRF can generate novel views with remarkable visual fidelity from a sparse set of posed RGB images. It is capable of reconstructing fine geometry and complex lighting effects more efficiently than traditional mesh-based or voxel-based techniques. Unlike prior approaches that rely on discretized representations or require ground-truth 3D geometry, NeRF can be trained using only 2D images and camera intrinsics, which makes it a versatile tool for image-based scene reconstruction.
Building upon this foundation, advancements have been made to adapt NeRF for practical applications. Instant-NGP, proposed by Müller et al. [
52], reduces training time and memory usage through multi-resolution hash encoding and a compact, fully fused MLP. This model enables near-real-time reconstruction. To improve robustness in real-world scenes with camera pose noise and variable lighting conditions, Nerfacto was introduced as part of the Nerfstudio framework by Tancik et al. [
53]. It combines pose refinement, proposal sampling, and hash-based density fields to achieve stable reconstructions even in noisy or incomplete datasets conditions. Based on the 3D Gaussian Splatting technique introduced by Kerbl et al. [
54], Splatfacto replaces volumetric grids and MLPs with differentiable 3D Gaussians. This representation supports direct optimization over Gaussian attributes such as position, opacity, and anisotropy. It achieves competitive visual quality and supports real-time rendering through tile-based rasterization.
These advancements highlight NeRF’s emerging potential as a foundation for the next generation of Scan-to-BIM methods, offering an alternative to conventional workflows that rely on physical scanning devices for geometric data acquisition. With recent improvements in speed, robustness, and output quality, NeRF models such as Instant-NGP, Nerfacto, and Splatfacto may facilitate BIM generation based on image-derived information. However, despite its technical potential, the integration of NeRF into Scan-to-BIM workflows remains largely unexplored.
Despite ongoing advancements, current Scan-to-BIM studies still follow a two-stage process: first generating PCs using scanning devices such as LiDAR or photogrammetry, then applying semantic segmentation in a separate post-processing step using deep learning models on the PCs. This can also be a fragmented process due to multiple tasks, increasing processing time, adding complexity, and constraining the level of automation achievable. This study aims to offer a workflow to avoid or minimize limitations through an automated image-to-BIM method presented in the following sections.
4. Results
This section presents the simulation results of the proposed image-to-BIM method, and covers both the performance of the NeRF models and the accuracy of BIM generation. Subsections highlight key findings on camera pose recovery, computational efficiency, training time, and rendering performance. The final stages focus on PC quality for geometry extraction and the precision of the generated BIM models. Results are structured to reflect technical performance and practical utility across interior and exterior datasets.
4.1. Camera Pose Recovery Analysis
The impact of image data size on camera pose estimation was examined to determine the minimum data volume required for stable NeRF initialization. As shown in
Figure 11, the percentage of successfully recovered poses increases significantly with an increasing number of input frames. With 50 frames, COLMAP recovered just 4% of camera poses. The recovery rate gradually improves, reaching 36% at 100 frames and 55.5% at 200. A sharp increase occurs between 200 and 250 frames, where pose recovery exceeds 92%. Beyond this point, the performance plateaus, with recovery stabilizing between 94% and 98% from 250 to 550 frames. Notably, Nerfstudio’s default setting for frame extraction from video datasets is approximately 300 frames, which, based on these results, offers sufficiently high pose coverage for reliable NeRF reconstruction in many cases. In this study, 550 frames were used to ensure near-complete pose recovery for the exterior scene and to support the generation of dense and accurate 3D models.
This trend emphasizes the importance of spatial image overlap in achieving robust matching features. Visual gaps between consecutive views at lower frame counts limit COLMAP’s ability to identify consistent key points and reduce pose estimation success. As the number of frames increases, the likelihood of overlapping content improves. The observed jump in performance beyond 250 frames suggests that spatial coverage became sufficient for this particular case. However, the optimal threshold may vary depending on scene complexity, camera motion, and lighting conditions.
4.2. Computational Resource Efficiency
System resource usage was analyzed to compare the computational demands of the evaluated NeRF models during training.
Figure 12 shows that CPU usage was highest for Nerfacto, which utilized an average of 52.9% of the CPU. Instant-NGP showed moderate usage at 31.4%, while Splatfacto maintained the lowest CPU load with an average of 5.3%. This highlights that Nerfacto’s volumetric pipeline and scene encoding require more CPU resources, whereas Splatfacto, driven primarily by GPU-based rasterization, places minimal demand on the CPU.
GPU utilization patterns, shown in
Figure 13, paint a different picture. Splatfacto demonstrated the highest average GPU usage at 81.7%, maintaining a high level of utilization throughout training. Instant-NGP followed with 66.9%, while Nerfacto averaged 45.8%. These values reflect the models’ varying reliance on GPU-intensive operations, with Splatfacto showing strong dependence on GPU rendering via Gaussian Splatting.
Figure 14 combines both RAM and GPU memory usage across the training period. GPU memory consumption was lowest for Splatfacto, averaging 2.5 GB, followed by Instant-NGP at 3.3 GB and Nerfacto at 5.6 GB. In terms of RAM usage, Splatfacto again showed the smallest footprint with an average of 3.7 GB. Instant-NGP used approximately 6.7 GB, while Nerfacto maintained the highest RAM usage at around 11 GB throughout the training process. These observations highlight Splatfacto’s system and GPU memory efficiency, making it well-suited for resource-constrained environments.
The results show that Splatfacto is the most efficient model, consistently combining high GPU utilization with the lowest CPU, system RAM, and GPU memory demands. Its high GPU load reflects efficient exploitation of modern hardware optimized for parallel processing and real-time rendering. This efficiency stems from its GPU-based Gaussian splatting approach, which offloads the bulk of computation to the GPU and minimizes CPU and memory overhead. In contrast, Nerfacto delivers high-quality volumetric reconstructions but places the most significant strain on CPU and system memory, consistent with the complexity of its architecture. Instant-NGP offers a balanced profile, achieving faster training with moderate usage of all system resources. These findings inform the selection of a NeRF model, considering hardware availability, training efficiency, and application-specific objectives. A deeper exploration of how these computational trade-offs relate to reconstruction accuracy and practical deployment is presented in
Section 4.4.
4.3. Training Time Analysis
Training duration was benchmarked across NeRF models to assess time efficiency under consistent hardware and data conditions. As illustrated in
Figure 15, Splatfacto consistently achieved the fastest training times across all tested configurations. Its training duration ranged from 410 to 439 s, averaging 425.8 s overall. This performance is directly attributed to its reliance on GPU-accelerated Gaussian splatting, which optimizes the rendering process. In contrast, Nerfacto required longer to complete training, with times ranging from 682 to 694 s and an average of 688.8 s. The higher training time reflects the complexity of Nerfacto’s volumetric rendering pipeline and its greater reliance on CPU and memory resources, as previously discussed in
Section 4.2. Instant-NGP, while recognized for its fast convergence and efficiency in NeRF literature, showed the longest training times in this evaluation. With durations ranging from 736 to 762 s and an average of 749.4 s, it lagged behind Splatfacto and Nerfacto under identical hardware.
Overall, the comparison highlights Splatfacto as the most time-efficient model in this setup, offering rapid training regardless of input size. Interestingly, the number of input frames had a minimal impact on the overall training time for any evaluated models. Each model maintained a consistent training duration despite varying the dataset size.
4.4. Reconstruction Quality and Rendering Performance
Reconstruction fidelity and rendering speed were evaluated to compare the visual quality and throughput efficiency of the NeRF models. As shown in
Figure 16, Splatfacto consistently achieves the highest PSNR values across all frame counts, ranging from 17.79 at 150 to 22.88 at 550 frames. Instant-NGP follows with values between 17.16 and 18.95, while Nerfacto performs the weakest, improving gradually from 15.63 to 18.52. This indicates that Splatfacto produces sharper and less noisy reconstructions, especially as scene coverage increases.
Similarly,
Figure 17 shows that Splatfacto leads in SSI, progressing from 0.69 to 0.86 as the number of frames increases. Instant-NGP improves steadily from 0.65 to 0.73, while Nerfacto again lags behind, rising from 0.62 to 0.68. Results highlight Splatfacto’s stronger ability to preserve structural fidelity and perceptual quality in the rendered outputs.
The LPIPS evaluation, presented in
Figure 18, shows that Splatfacto achieves the lowest perceptual error across all input sizes. As frames increase from 150 to 550, LPIPS values for Splatfacto consistently decrease from 0.31 to 0.15, reflecting a closer perceptual match between the reconstructed and reference images. Instant-NGP shows moderate performance, with values decreasing from 0.44 to 0.34. Nerfacto shows relatively higher LPIPS values, starting at 0.47 and decreasing to 0.35 as the number of frames increases, ultimately approaching the performance of Instant-NGP at higher input sizes. These results indicate that Splatfacto produces more visually faithful reconstructions, especially in scenarios with richer input data.
Beyond visual quality, rendering performance results in
Table 1 show a clear computational advantage for Splatfacto. It processes approximately 71,295,630 rays per second and generates 137.924 frames per second, significantly outperforming Nerfacto, which reaches 838,423 rays and 1.616 frames per second, and Instant-NGP, which handles 341,203 rays and 0.654 frames per second. This high efficiency comes from Splatfacto’s Gaussian splatting architecture that avoids volumetric sampling and neural field queries, allowing much faster rendering.
The results highlight the trade-offs among the evaluated models. Splatfacto delivers the highest reconstruction quality and rendering speed, making it a strong choice for scenarios that prioritize both fidelity and efficiency. Nerfacto produces reasonably good visual quality but requires more CPU and memory resources and renders more slowly due to its volumetric processing pipeline. Instant-NGP shows moderate visual quality and computational demand performance, but it falls short of Splatfacto in rendering speed and output precision.
4.5. Practical PC Quality for Geometry Extraction
To evaluate each model’s suitability for geometric feature extraction, the structural clarity and consistency of the reconstructed PC were analyzed.
Figure 19 presents the extracted PCs from the reconstructed interior scene using each NeRF model, illustrating both front and rear views of the classroom. These color-labeled outputs visually demonstrate each reconstruction’s structural completeness and spatial clarity. Among the three models, the PC generated by Nerfacto exhibits the most consistent geometry and clean surface representation. Wall boundaries, furniture contours, and door/window regions appear sharp and continuous, providing a stable foundation for geometric segmentation and facilitating accurate alignment in downstream processing. The Instant-NGP output retains recognizable scene layout but introduces soft edges and localized noise, particularly near object boundaries. While core architectural features remain identifiable, the reduced edge sharpness may introduce spatial ambiguity during plane fitting or clustering, affecting the precision of wall, window, and door extraction.
Although Splatfacto excels in reconstruction quality, training speed, and rendering efficiency, its integration into geometry extraction workflows presents practical limitations. Since Splatfacto reconstructs scenes using 3D Gaussian splats and lacks native PC export in Nerfstudio, this study employed the 3DGS-to-PC framework by Stuart and Pound [
60] to convert splats into PCs. This framework samples points from Gaussians based on their volume, filters outliers using Mahalanobis distance, and recalculates colors from rendered images to improve visual accuracy. However, the converted outputs are shown in
Figure 19 contain structural noise, such as floating clusters, blurred edges, and irregular densities that obscure object boundaries and disrupt spatial consistency. Although Splatfacto achieves high PSNR and SSI scores, these irregularities in the PCs hinder accurate wall, window, and door classification and extraction within the Image to BIM workflow.
4.6. Automated BIM Creation Results and Evaluation
The final phase of the proposed method was assessed through the generation of BIM models from reconstructed scenes. This subsection presents the results on semantic accuracy, spatial alignment, and geometric precision, based on comparisons between the generated output and ground-truth annotations. As shown in
Figure 6, the reconstruction process using segmented images enables precise label transfer and results in a successful reconstructed 3D scene where building elements such as walls, windows, and doors are preserved across the scene with their predefined colors. Building on this labeled reconstruction,
Figure 7 displays the exported color-labeled PCs, which serve as the input for subsequent processing. Geometric features are then extracted through wall plane fitting and object clustering, as illustrated in
Figure 8. These extracted elements are automatically serialized and imported into Autodesk Revit to generate the final BIM model, shown in
Figure 10. Together, these stages demonstrate the method’s ability to maintain semantic integrity, capture accurate geometry, and produce a structured and automation-ready BIM output directly from 2D imagery.
The workflow’s performance is evaluated using three critical criteria: (1) semantic accuracy in detecting windows and doors, (2) geometric precision evaluated using perimeter comparisons, and (3) spatial fidelity assessed through area-based overlap metrics.
Table 2 presents the number of doors and windows in the analysis of the semantic feature extraction results in both datasets. The proposed method correctly identified all architectural elements: 248 windows and four doors in Dataset 1, and 4 windows and two doors in Dataset 2. This perfect match with the ground truth highlights the reliability of the element detection process and confirms the method’s ability to maintain semantic precision across different scene types.
To contextualize these results,
Table 3 compares door and window extraction performance against previous Scan-to-BIM studies that employed PC-based and image-based approaches. Across references [
22,
46,
61,
62], reported accuracies for door and window instance detection range between 96 and 100 percent. Due to various types and meanings for accuracy, the clarification is needed that the successful identification of an element is assumed as accuracy in detection. The proposed image-to-BIM method achieves results consistent with these benchmarks, which match the highest reported accuracies in detecting an element without reliance on laser scanning or PC input. In these benchmark studies, the number of detected windows and doors ranged from 0 to 28, whereas the proposed method successfully detected up to 248 windows without any error.
Further geometric evaluation was conducted using perimeter accuracy. As shown in
Table 4. The proposed method demonstrated a strong ability to maintain geometric consistency between the reconstructed models and ground-truth buildings. For Dataset 1, the predicted perimeter was 100.38 m, compared to the actual perimeter of 100.32 m, resulting in a minimal deviation of 0.06 m and an accuracy of 99.94%. For Dataset 2, the predicted perimeter was 37.83 m versus an actual value of 37.27 m, with an error of 0.56 m and an accuracy of 98.49%.
Table 5 compares the proposed method’s perimeter accuracy with previously published results. In the benchmark study [
46], perimeter deviations ranged from 0.019 to 0.929 m, corresponding to accuracies between 99.30 and 99.90 percent. The proposed image-to-BIM approach achieves comparable geometric precision.
Table 6 presents the spatial accuracy evaluation based on area comparisons. For Dataset 1, the predicted building footprint aligns closely with the reference geometry, achieving a precision of 0.994, a recall of 0.997, an F1 score of 0.996, and an IoU of 0.992. Dataset 2 reconstruction yields a predicted area of 83.06 m
2 and an intersection of 80.88 m
2, resulting in a precision of 0.974, a recall of 0.995, an F1 score of 0.984, and an IoU of 0.969. These metrics confirm the pipeline’s ability to preserve spatial accuracy and generate BIM models that consistently reflect real-world dimensions.
4.7. Comparative Analysis with Scan-to-BIM Literature
To contextualize the performance of the proposed method, a comparative analysis was conducted against several recent AI enhanced Scan-to-BIM approaches, as summarized in
Table 7. The comparison demonstrates that the proposed method achieves state-of-the-art performance on both Dataset 1 and Dataset 2. Across the selected benchmark studies, evaluated metric values fall within the range of 0.93–0.999, and the proposed method performs at a comparable or higher level of spatial accuracy than traditional Scan-to-BIM. The slightly higher accuracy observed on Dataset 1 can be attributed to the superior input quality provided by the UAV’s stabilized 4K camera and structured flight path.
Unlike conventional Scan-to-BIM studies the present method provides a more efficient and less expensive solution. For example, the conventional approach relies on costly LiDAR or laser-scanning hardware as well as PC processing and CNN-based PC segmentation networks such as RandLA-Net or Mask R-CNN. However, the proposed method performed detection and segmentation directly on RGB imagery using open-source vision–language models (YOLO and SAM) combined with NeRF or Gaussian Splatting for 3D reconstruction which is cost-free and removes the need for PC segmentation.
4.8. Comparative Analysis with NeRF-Based Methods
Recent NeRF-based reconstruction studies have introduced varying levels of automation; however, their workflow remains distinct from the method presented in this study. NeRF-to-BIM [
65] demonstrated an initial attempt to couple NeRF reconstruction with BIM creation, where Instant-NeRF produced PCs later segmented by PointNeXt to classify structural elements such as beams and columns. Although this work verified the potential of NeRF-generated data for BIM applications, it required intensive post-processing and manual refinement. Harbingers of NeRF-to-BIM [
66] extended this to a three-stage process combining NeRF reconstruction, fine-tuned PointNeXt segmentation on synthetic BIM-derived datasets, and BIM generation; however, it still relied on labeled PCs and separate training steps. Similarly, Li et al. [
67] applied Nerfacto to reconstruct building façades from UAV imagery, followed by PointNet++ segmentation of the NeRF-derived PCs. Montas-Laracuente et al. [
68] adopted Gaussian Splatting to generate high-fidelity meshes for heritage documentation but without semantic or BIM automation. In contrast, the proposed method explicitly advances this line of research by bypassing PC post-processing, embedding semantic information directly at the image level before NeRF training. This early-stage semantic integration allows class labels to propagate throughout the 3D reconstruction, resulting in PCs that are already color-segmented and labeled during generation. Consequently, precise geometry extraction can be performed without additional segmentation or manual refinement, thereby achieving a higher degree of automation and practical efficiency than previous NeRF-based approaches.
4.9. Comparative Analysis of Evaluated NeRF Models
Since the proposed method relies on NeRF-based reconstruction, selecting an appropriate model is important.
Table 8 presents a detailed comparison of three NeRF models, evaluating them across visual fidelity, computational performance, and PC suitability for BIM extraction. Among them, Splatfacto delivers the highest reconstruction quality and speed, far surpassing the performance of Instant-NGP and Nerfacto. Despite its speed and visual quality, Splatfacto has practical limitations for Scan-to-BIM workflows. It does not natively export PCs, requiring an external conversion step that introduces structural noise and blurs object boundaries. As a result, the generated PC lacks the consistency needed for object extraction.
In contrast, while Nerfacto’s visual metrics are slightly lower than those of Splatfacto, it produces the most structured and spatially coherent PC outputs, making it more suitable for geometry extraction. Instant-NGP, while known for its speed in NeRF literature, shows longer training times in this evaluation and delivers moderate reconstruction quality. Its PCs exhibit soft edges and localized noise, which may hinder precise segmentation and alignment in BIM generation tasks. Nonetheless, it offers a balanced computational profile, with moderate resource efficiency. Overall, these findings suggest that Nerfacto is the most suitable model for geometry-aware applications, such as Scan-to-BIM, where spatial consistency is crucial.
5. Discussion
5.1. Key Contributions and Implications
The primary contribution of this work lies in offering a method for automated BIM generation from images using accessible hardware and open-source tools. It bridges critical gaps in the Scan-to-BIM literature by (1) demonstrating that BIM models can be generated from image data alone, eliminating reliance on scanning equipment by producing PCs directly from RGB images using NeRF volumetric rendering. This is achieved by integrating object-level semantic segmentation with NeRF-based 3D reconstruction in a unified method, allowing the system to learn spatial structure directly from labeled RGB inputs rather than relying on traditional photogrammetry or LiDAR-based scans; (2) demonstrating, for the first time, how semantic identity can be embedded into NeRF-based volumetric reconstruction and subsequently extracted as structured geometric data for BIM modeling. This removes the need for post-reconstruction segmentation by embedding semantic labels at the pixel level during preprocessing using computer vision-language techniques. These labels are preserved throughout the rendering process, eliminating projection-based mapping that might lead to misalignment and inconsistency, and ensuring that each building element retains its identity from image input to final BIM output; and (3) developing the workflow from image acquisition to BIM output without requiring domain-specific training data. The extracted geometry is serialized in a structured JSON format and directly imported into Revit, enabling automation of BIM model creation based on the building elements embedded semantic and geometric attributes.
Additionally, this study performs a comparative evaluation of three reconstruction methods, namely Instant-NGP, Nerfacto, and Splatfacto, in the context of Scan-to-BIM automation. It establishes criteria for selecting suitable models based on PC quality, rendering efficiency, and spatial fidelity, identifying Nerfacto as the most effective for geometry extraction. This may advance the application of NeRF models using the criteria beyond view synthesis and introduces them as viable tools for geometry-aware BIM modeling.
In practical terms, the proposed method transforms how existing buildings are digitized by eliminating the need for terrestrial scanning devices, or the tedious manual on-site surveying. Tasks that were once labor-intensive, equipment-dependent, and costly such as site surveying and digital modeling, can benefit from using widely accessible camera devices, expanding the practicality and scalability of BIM-based planning across the construction and facilities sector.
The framework is especially impactful for aging, undocumented, or under-maintained buildings, where original drawings are often unavailable. It facilitates rapid creation of digital twins for a wide range of practical applications, including renovation planning, space optimization, and exterior façade restoration. The resulting BIM model is generated directly within Revit using native building elements, which enables compatibility with existing BIM workflows. It functions not only as a geometric record but also it can be useful as an operational asset for facility management tasks such as quantity take-offs, material tracking, and renovation cost estimation performed directly within the Revit environment. Furthermore, the model can be exported from Revit to open formats such as IFC, supporting interoperability with facility management and asset information systems commonly used in practice. In energy retrofit scenarios, the method provides reliable building geometry for tasks such as insulation upgrades, envelope improvements, and solar panel placement, where models can be used directly within Revit for energy analyses or exported to IDF or gbXML formats for integration with tools like EnergyPlus and OpenStudio. However, all these applications need to be tested and evaluated as future studies. The accessibility and speed of this approach could make it a compelling solution for practitioners seeking to modernize building data acquisition workflows without the burden of costly hardware or extensive manual effort.
5.2. Limitations and Future Research Direction
Despite its strengths, the study identified limitations as well. First, NeRF reconstruction quality remains sensitive to image completeness and clarity, and occlusions or inadequate image coverage can compromise model accuracy and completeness. Moreover, lighting variation, textureless or reflective surfaces, and the presence of dynamic objects during image capture may further reduce reconstruction fidelity by introducing inconsistencies in color, depth, and surface definition, which can ultimately lead to reconstruction failure. Second, the performance of the proposed Image-to-BIM workflow was validated on two scenes, one exterior and one interior, as a proof of concept. However, broader validation on buildings featuring intricate or unconventional architectural styles remains untested, which may affect its generalizability. Future research should extend the evaluation to a wider range of typologies, including heritage buildings and complex geometries, to assess the scalability and robustness of the method across diverse architectural conditions.
Third, the present system focuses exclusively on core building elements such as walls, doors, and windows. To enhance the versatility and completeness of this method, future investigations should expand semantic classification and geometric reconstruction capabilities to other building components, including floors, ceilings, columns, stairs, and complex internal spatial layouts. Fourth, While this study successfully automates several fragmented components of the image-to-BIM workflow, including segmentation, reconstruction, element extraction, and BIM creation, future work should focus on developing an end-to-end unified framework that seamlessly integrates all stages into a single automated pipeline. Fifth, while Gaussian splatting methods such as Splatfacto showed the highest potential in reconstruction quality, the lack of native PC export in current implementations remains a limitation. Future work should address this gap to unlock the full utility of Gaussian-based reconstructions in geometry-driven BIM applications. Sixth, incorporating adaptive parameter adjustments within geometric algorithms could improve performance and flexibility across architectural typologies and conditions.
Seventh, a comprehensive ablation study should be conducted to evaluate the contribution of each component, including YOLO for detection, SAM for segmentation, NeRF and Gaussian Splatting for reconstruction, and to optimize their interaction within the framework. This will clarify each module’s impact on accuracy and efficiency and guide future improvements in the image-to-BIM workflow. Eighth, all experiments were conducted on a high-end RTX 4090 GPU, which significantly reduced training time. Future research should evaluate the framework’s scalability on mid-range GPUs and cloud-based computing platforms to ensure broader accessibility and assess performance trade-offs under different hardware configurations. Lastly, while the proposed method effectively creates BIM models for undocumented buildings, future research should focus on extending the framework to enable continuous maintenance and digital twin updates by allowing new image inputs to be reprocessed to detect and integrate physical changes within the existing building model.
6. Conclusions
This study introduced an automated image-to-BIM method that integrates vision-language segmentation, NeRF-based 3D reconstruction, and structured geometric modeling to generate semantically rich and spatially accurate BIM models from ordinary 2D images. Experimental validation demonstrates that the proposed method achieves strong semantic, spatial, and geometric performance while eliminating the need for traditional scanning tools such as LiDAR or photogrammetry. Its key novelty lies in unifying semantic segmentation and 3D reconstruction in the chosen context, by embedding object labels during NeRF processing, thereby removing the need for post-processing on PCs and enabling a more efficient method to automatically generate BIM model.
Experimental validation confirms the method’s high performance across both semantic and geometric dimensions considering its limitations. The system successfully detected windows and doors across two distinct datasets: an exterior scene captured via UAV and an interior scene recorded with a handheld smartphone. In area-based evaluations, the method achieved a precision of 0.994, an IoU of 0.992 for the exterior case, and 0.974 precision with 0.969 IoU for the interior case. Perimeter comparisons confirmed geometric consistency, with deviations of just 0.06 m and 0.56 m for the exterior and interior datasets, yielding accuracies of 99.94% and 98.49%, respectively. The proposed method consistently achieved superior performance in comparative evaluations against recent deep learning-based reconstruction methods. It recorded higher precision (up to 0.994), F1 scores (up to 0.996), and IoU values (up to 0.992), outperforming existing approaches.
In evaluating three NeRF variants, Splatfacto demonstrated the highest image-based reconstruction quality, with PSNR of 22.88, SSI of 0.86, and LPIPS of 0.15, alongside industry-leading rendering speeds of 137.9 FPS and 71M rays per second. However, due to structural artifacts and density inconsistencies introduced during PC conversion, Splatfacto proved less effective for downstream geometric extraction. In contrast, Nerfacto, while slightly lower in visual quality, generated cleaner, more structured PCs that enabled more reliable element detection, making it a more suitable choice for geometry-aware BIM modeling within the proposed method.
By removing reliance on expensive scanning hardware and minimizing technical barriers, this method presents a scalable and accessible alternative for automated BIM generation. Its applicability spans a wide range of real-world use cases, including architectural documentation, renovation planning, energy retrofitting, facility management, and digital twin development. This work not only demonstrates a practical, low-cost pathway for digitizing existing buildings but also contributes to the theoretical advancement of image-driven reconstruction by establishing a new benchmark for semantic-aware NeRF-based Scan-to-BIM automation.