Large-Scale Point Cloud Completion Through Registration and Fusion of Object-Level Reconstructions
Abstract
1. Introduction
2. Related Works
3. Materials and Methods
3.1. Large-Scale 3D Reconstruction Using COLMAP
- SfM Stage: This stage recovers the camera poses and the sparse 3D structure of the scene. The core optimization is achieved through Bundle Adjustment, which jointly refines all camera poses and 3D point coordinates by minimizing the total reprojection error across all views.
- MVS Stage: Following the accurate estimation of camera poses by SfM, the MVS stage generates a dense point cloud by computing and fusing depth maps from multiple neighboring views.
3.2. High-Fidelity Object Reconstruction with an Enhanced NeRF Model
- Incorporated depth maps as a geometric prior. In a standard NeRF, geometry is inferred through photometric consistency across multiple views, with the volume density (σ) integrated along each ray and the network optimized by minimizing photometric loss. However, this indirect approach can lead to scale ambiguity and local minima. In contrast, our method uses a self-supervised depth network to generate dense, metric-scale depth maps for each frame. Through a four-stage iterative pipeline, the depth rendered by the NeRF is used to refine the depth network, creating a feedback loop between depth estimation and NeRF optimization. This enables simultaneous convergence of absolute scale, relative geometry, and fine details, effectively eliminating the “floater” artifacts and geometric drift typical in standard NeRF.
- Multi-resolution Hash Encoding. Multi-resolution hash encoding efficiently represents 3D space using a multi-level feature grid, making it ideal for NeRF rendering. This approach captures both the overall shape (with low-resolution grids) and fine surface details (with high-resolution grids). Our method enhances this by adaptively optimizing the hash encoding using depth maps and object masks during training. In regions with sharp object edges or depth changes, high-resolution features are updated more strongly to capture precise boundaries. In contrast, low-resolution features maintain overall shape stability, preventing overfitting in areas with sparse views. This “local detail, global stability” approach works in synergy with depth and mask constraints, ensuring a unified segmentation boundary, depth, and NeRF density surface, leading to efficient, high-fidelity 3D object reconstruction.
- The model is supervised by a joint loss function with three components. A standard NeRF relies solely on pixel color for supervision, without explicit constraints for object boundaries or geometry. Our method introduces a composite loss function with three components. The first is an absolute depth loss, using the depth maps as supervision to constrain NeRF’s learning of object geometry. The second is a relative depth loss, which enforces correct depth ordering by sampling pixel pairs within the same instance region that are sufficiently separated in image space. The third component is an instance mask loss, which uses a binary mask from SAMv2 to isolate the object from the background, ensuring that the loss is computed only for the target region and sharpening its boundary. These three loss components provide strong supervision with explicit geometric semantics, transforming the weak RGB-only supervision into a robust model that aligns the reconstructed surface with the true edges at a sub-pixel level.
- We leveraged the Segment Anything v2 (SAMv2) model to obtain accurate object masks for NeRF reconstruction. Instead of treating SAMv2 as a static pre-processor, we executed its frozen image-encoder and mask-decoder once per frame before NeRF optimization began, producing binary masks that are kept fixed throughout training. These masks are then used to pixel-wise weight the photometric, depth, and density losses, so gradients flow only into the NeRF MLP and its hash encoding while SAMv2 parameters remain untouched. This lightweight strategy keeps the entire pipeline fully differentiable, yet enables the NeRF to focus on the object region and achieves precise alignment between the mask boundary and the reconstructed surface.
3.3. Evaluation Metrics
3.3.1. Registration Quality Metrics
- Fitness measures how well the source point cloud aligns with the target point cloud after applying a transformation. It calculates the ratio of inlier correspondences (points that align well) to the total number of points in the target cloud. A higher Fitness score indicates better alignment.
- RMSE calculates the root mean square error of the Euclidean distances between inlier points in the two clouds. It quantifies the accuracy of the alignment, where a lower RMSE value signifies better precision in matching the point clouds.
- Normal Consistency evaluates the smoothness and coherence of the surface by examining the alignment of normal vectors (directional vectors perpendicular to the surface) of neighboring points. A higher score reflects a smoother and more geometrically consistent surface, with fewer artifacts.
3.3.2. Completion Quality Metrics
- Geometric Smoothness is designed to quantify the local geometric fidelity of the surface constituted by the new point set generated by the completion algorithm relative to the original point cloud. An ideal completion algorithm ought to generate smooth and natural geometry that transitions seamlessly with the adjacent original surface, rather than introducing noisy or jagged surface artifacts. Consequently, this metric measures the roughness of the completed surface, serving as a key proxy for its physical realism.We employ Surface Variation as the core computational method, a technique based on the covariance analysis of local neighborhoods. For each point within the completed region, we construct the covariance matrix from the set of points within its spherical neighborhood of radius r. The eigenvalues of this matrix describe the variance of the neighborhood’s point distribution along the three principal directions. Among these, the smallest eigenvalue, , corresponds to the variance along the normal direction of the point’s local tangent plane. The Surface Variation is thus defined as the normalized smallest eigenvalue, formulated as follows:The value of the Surface Variation metric ranges from 0 to 1/3. A lower value signifies that the point’s local neighborhood is highly coplanar, thereby indicating that the completed surface is smoother and more physically plausible.
- Distribution uniformity is used to assess the quality of the spatial distribution of a completed point cloud. The resultant point set from a superior completion algorithm should be uniformly distributed to avoid the formation of unnatural point clusters or sparse voids in localized regions, as such distributional artifacts can adversely affect subsequent downstream tasks like meshing and feature extraction. To quantify the distribution uniformity of the point cloud, we compute the Local Density Variance (LDV) metric and further derives the Normalized Uniformity Error (NUE), thereby eliminating scene-scale and sampling-resolution biases and enabling comparable assessment across different objects and datasets.
- In the formula, = treats the disk enclosed by the k-th nearest neighbour of point i as the local neighbourhood, defining the density at that point as the ratio of the fixed point count k to the disk area; the smaller the radius , the more compact the area and the higher the density. The variance of all these local densities is then computed as , yielding the density fluctuation across the entire point cloud. A smaller indicates smaller density differences among neighbourhoods and thus a more uniform spatial distribution.We adopt the Normalized Uniformity Error (NUE) as the final metric. Defined as the ratio of the local-density variance of the completed region, , to that of the original large-scale scene, , this dimensionless value directly reflects how consistently the completed point cloud fluctuates in local density compared with its surroundings. Empirically, NUE ≤ 1.0 indicates that the completed area is more uniform than the scene; 1.0 < NUE ≤ 1.3 is perceived as a seamless transition; and NUE > 2.0 signals noticeable density stripes, calling for parameter retuning or additional sampling. Through this normalization equation, quantitative uniformity comparison across different objects and datasets is achieved.
- To assess the Structural Plausibility of the completed geometry, we introduce the Primitive Fitting Score metric. Methodologically, this is achieved by iteratively applying a robust estimation algorithm, such as RANSAC (Random Sample Consensus), to the set of newly generated points, , to segment instances of multiple geometric primitives. We define as the union of all points identified as inliers to any of the successfully detected primitives, where . The Primitive Fitting Score is then defined as the ratio of the cardinality of this inlier set to the total number of newly generated points. Its core formula is:
- Consequently, a higher score (ranging from 0 to 1, with values closer to 1.0 being better) provides strong evidence that the completion is dominated by regular, interpretable geometric structures. This indicates that the algorithm has successfully restored the object’s expected shape with high structural fidelity.
4. Experiments
4.1. Datasets
4.2. Detailed Experimental Setup
4.2.1. Stage 1: Large-Scale Scene Reconstruction Using COLMAP
4.2.2. Stage 2: Object-Level Reconstruction with an Enhanced NeRF Model
4.2.3. Stage 3: Point Cloud Completion via Registration and Fusion
4.3. Ablation Experiments
5. Discussion
- In this work, we propose a novel algorithm that leverages high-fidelity, object-level reconstructions to markedly boost the accuracy of local objects within large-scale scenes. The core idea is to introduce a completion algorithm founded on high-fidelity data rather than on purely algorithmic inference. Specifically, an improved Neural Radiance Fields (NeRF) model was employed to generate metrically accurate, physically plausible data that were subsequently used to complete the missing geometry. This innovation guarantees that the resulting point cloud simultaneously attains superior geometric precision and physical realism, thereby satisfying the stringent accuracy demands of downstream tasks.From a scientific perspective, our contribution lies in tightly coupling large-scale scene reconstruction with object-level reconstruction, offering the field a new viewpoint and methodology. By fusing high-resolution object data into the global scene, the algorithm preserves fine local details without sacrificing an accurate description of overall geometry. This fusion not only elevates reconstruction accuracy but also enriches the level of detail, laying a solid foundation for understanding and analyzing complex environments.Moreover, the proposed algorithm exhibits strong generalizability. Relying on high-fidelity, physically based data as a powerful prior, the completion method performed consistently across diverse scenes and object types. Such cross-domain adaptability opens broad prospects for future applications—ranging from virtual and augmented reality to smart manufacturing—and underscores the long-term research value of our approach.
- Introducing Complementary Unsupervised Evaluation Metrics: We critically reviewed the limitations of existing geometry-based evaluation metrics and proposed a suite of complementary, unsupervised metrics (e.g., assessing geometric smoothness, uniformity, and structural plausibility). This provides a more scientific and comprehensive perspective for evaluating the physical realism and functional usability of point cloud completion results.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Nistér, D.; Naroditsky, O.; Bergen, J. Visual odometry for ground vehicle applications. J. Field Robot. 2006, 23, 3–20. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
- Sucar, E.; Liu, S.; Ortiz, J.; Davison, A.J. iMAP: Implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6229–6238. [Google Scholar] [CrossRef]
- Bloesch, M.; Czarnowski, J.; Clark, R.; Leutenegger, S.; Davison, A.J. Codeslam—Learning a compact, optimisable representation for dense visual slam. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2560–2568. [Google Scholar] [CrossRef]
- Guo, Z.C.; Forbes, J.R.; Barfoot, T.D. Marginalizing and Conditioning Gaussians onto Linear Approximations of Smooth Manifolds with Applications in Robotics. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 19–23 May 2025; pp. 2606–2612. [Google Scholar] [CrossRef]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv 2020, arXiv:2003.08934. [Google Scholar] [CrossRef]
- Chen, A.; Xu, Z.; Geiger, A.; Yu, J.; Su, H. TensoRF: Tensorial radiance fields. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 333–350. [Google Scholar] [CrossRef]
- Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q.; Recht, B.; Kanazawa, A. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5501–5510. [Google Scholar] [CrossRef]
- Yang, Y.; Zhang, S.; Huang, Z.; Zhang, Y.; Tan, M. Cross-ray neural radiance fields for novel-view synthesis from unconstrained image collections. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 15901–15911. [Google Scholar] [CrossRef]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J.J.I. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
- Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny, M.; Ghanem, B. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Adv. Neural Inf. Process. Syst. 2022, 35, 23192–23204. [Google Scholar] [CrossRef]
- Yuan, W.; Khot, T.; Held, D.; Mertz, C.; Hebert, M.J.I. PCN: Point Completion Network. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018. [Google Scholar] [CrossRef]
- Dai, A.; Ritchie, D.; Bokeloh, M.; Reed, S.; Sturm, J.; Niener, M. ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans. arXiv 2017, arXiv:1712.10215. [Google Scholar] [CrossRef]
- Mezghanni, M.; Boulkenafed, M.; Lieutier, A.; Ovsjanikov, M. Physically-aware generative network for 3d shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9330–9341. [Google Scholar] [CrossRef]
- Li, A.; Zimmer-Dauphinee, J.R.; Kalyanam, R.; Lindsay, I.; VanValkenburgh, P.; Wernke, S.; Aliaga, D. Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 11759–11768. [Google Scholar] [CrossRef]
- Aiger, D.; Mitra, N.J.; Cohen-Or, D. 4-points congruent sets for robust pairwise surface registration. ACM Trans. Graph. (TOG) 2008, 27, 1–10. [Google Scholar] [CrossRef]
- Fan, H.; Su, H.; Guibas, L. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Kazhdan, M.; Bolitho, M.; Hoppe, H. Poisson surface reconstruction. In Proceedings of the Fourth Eurographics Symposium on Geometry Processing, Cagliari Sardinia, Italy, 26–28 June 2006. [Google Scholar] [CrossRef]
- Hoppe, H.; Derose, T.; Duchamp, T.; Mcdonald, J.; Stuetzle, W. Surface reconstruction from unorganized points. ACM SIGGRAPH Comput. Graph. 1992, 26, 71–78. [Google Scholar] [CrossRef]
- Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:1512.03012. [Google Scholar] [CrossRef]
- Achlioptas, P.; Diamanti, O.; Mitliagkas, I.; Guibas, L. Learning Representations and Generative Models for 3D Point Clouds. arXiv 2017, arXiv:1707.02392. [Google Scholar] [CrossRef]
- Smith, R.; Self, M.; Cheeseman, P.J.M.I.; Recognition, P. Estimating Uncertain Spatial Relationships in Robotics. Mach. Intell. Pattern Recognit. 1988, 5, 435–461. [Google Scholar] [CrossRef]
- Cui, H.; Shen, S.; Gao, W.; Wang, Z. Progressive large-scale structure-from-motion with orthogonal MSTs. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 79–88. [Google Scholar] [CrossRef]
- Zhu, S.; Zhang, R.; Zhou, L.; Shen, T.; Fang, T.; Tan, P.; Quan, L. Very large-scale global sfm by distributed motion averaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4568–4577. [Google Scholar] [CrossRef]
- Ding, Y.; Zhu, Q.; Liu, X.; Yuan, W.; Zhang, H.; Zhang, C. Kd-mvs: Knowledge distillation based self-supervised learning for multi-view stereo. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 630–646. [Google Scholar] [CrossRef]
- Zhu, Q.; Min, C.; Wei, Z.; Chen, Y.; Wang, G. Deep learning for multi-view stereo via plane sweep: A survey. arXiv 2021, arXiv:2106.15328. [Google Scholar] [CrossRef]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Rusu, R.B.; Blodow, N.; Beetz, M. Fast point feature histograms (FPFH) for 3D registration. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 3212–3217. [Google Scholar] [CrossRef]
- Fischler, M.A.; Bolles, R.C.J.C.o.t.A. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Shi, P.; Yan, S.; Xiao, Y.; Liu, X.; Zhang, Y.; Li, J.; Letters, A. RANSAC back to SOTA: A two-stage consensus filtering for real-time 3D registration. IEEE Robot. Autom. Lett. 2024, 9, 11881–11888. [Google Scholar] [CrossRef]
- Fan, Y.; Zhang, Q.; Tang, Y.; Liu, S.; Han, H. Blitz-SLAM: A semantic SLAM in dynamic environments. Pattern Recognit. 2022, 121, 108225. [Google Scholar] [CrossRef]






| Item | Configuration |
|---|---|
| Interpreter | Python 3.12.2 |
| Dependencies | NumPy 2.2.5, open3d 0.19.0, OpenCV-python 4.12.0.88, torch 2.12 |
| Operating System | Ubuntu 20.04 LTS |
| Training Platform | PyTorch 2.4.0 |
| Hardware Configuration | GPU NVIDIA RTX 4090 |
| Number of Training Epochs | 60 |
| Optimizer | Adam |
| Data Processing Batch Size | 24 |
| Learning Rate Schedule | WarmUp for the first 2 epochs, followed by cosine annealing, maximum learning rate 0.001 |
| Evaluation Metrics | House | Balloon | Sculpture |
|---|---|---|---|
| Fitness (0-1) ↑ | 0.9302 | 0.9916 | 0.9207 |
| Inlier RMSE (0-1) ↓ | 0.0852 | 0.0096 | 0.0348 |
| Normal Consistency (0-1) ↑ | 0.6023 | 0.6210 | 0.7948 |
| Evaluation Metrics | House | Balloon | Sculpture |
|---|---|---|---|
| Geometric Smoothness (0-1) ↓ | 0.1597 | 0.1897 | 0.1219 |
| Distribution Uniformity (1.0–1.3) ↓ | 1.17 | 1.29 | 1.21 |
| Structural Plausibility (0-1) ↑ | 0.9984 | 0.9954 | 0.9975 |
| Evaluation Metrics | Object | Without Scale Rectification | Without Cropping or Segmentation |
|---|---|---|---|
| Fitness (0-1) ↑ | House | 0.3083 | 0.4129 |
| Balloon | 0.0820 | 0.5876 | |
| Sculpture | 0.3085 | 0.3621 | |
| Inlier RMSE (0-1) ↓ | House | 0.2614 | 0.4424 |
| Balloon | 0.1810 | 0.1804 | |
| Sculpture | 0.3724 | 0.0655 | |
| Normal Consistency (0-1) ↑ | House | 0.4448 | 0.3056 |
| Balloon | 0.2196 | 0.5695 | |
| Sculpture | 0.4834 | 0.7373 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
He, T.; Fang, Y.; Li, K.; Yang, L. Large-Scale Point Cloud Completion Through Registration and Fusion of Object-Level Reconstructions. Appl. Sci. 2026, 16, 554. https://doi.org/10.3390/app16010554
He T, Fang Y, Li K, Yang L. Large-Scale Point Cloud Completion Through Registration and Fusion of Object-Level Reconstructions. Applied Sciences. 2026; 16(1):554. https://doi.org/10.3390/app16010554
Chicago/Turabian StyleHe, Taiming, Yixuan Fang, Keyuan Li, and Lu Yang. 2026. "Large-Scale Point Cloud Completion Through Registration and Fusion of Object-Level Reconstructions" Applied Sciences 16, no. 1: 554. https://doi.org/10.3390/app16010554
APA StyleHe, T., Fang, Y., Li, K., & Yang, L. (2026). Large-Scale Point Cloud Completion Through Registration and Fusion of Object-Level Reconstructions. Applied Sciences, 16(1), 554. https://doi.org/10.3390/app16010554

