VGGT-Geo: Probabilistic Geometric Fusion of Visual Geometry Grounded Transformer Priors for Robust Dense Indoor SLAM
Abstract
1. Introduction
- A deeply coupled framework synergizing generative priors and geometric optimization: We leverage the generative capabilities of a VGGT to provide “Warm-start” initialization for the non-convex backend optimization, placing the solver within the basin of convergence. Simultaneously, we utilize dense features from DINOv3 for fine-grained association. This “Coarse Generation + Fine Optimization” dual-branch architecture effectively resolves the contradiction between the initialization difficulties of traditional methods and the poor geometric consistency of pure learning-based methods.
- A Confidence-Aware Multi-Modal Backend: Targeting indoor texture-less and structured environments, we design a joint factor graph incorporating sparse points, semantic line segments, and metric depth priors. The core innovation lies in leveraging the confidence map predicted by large models to dynamically re-weight residuals across different modalities. This allows the system to adaptively utilize line features to constrain rotational Degrees of Freedom (e.g., in Manhattan Worlds) and employ MoGe-2 priors as adaptive Anchors, thereby strictly eliminating monocular scale drift.
- SOTA performance and robust Zero-Shot generalization in challenging real-world scenarios: We conducted extensive evaluations on benchmarks ranging from standard datasets (TUM, Replica) to challenging self-collected scenarios (featuring long corridors and white walls). Rigorous verification against LiDAR Ground Truth demonstrates that VGGT-Geo not only outperforms existing SOTA methods in standard scenes but also maintains centimeter-level trajectory consistency in Out-of-Distribution (OOD) scenarios where baseline methods completely fail.
2. Related Work
2.1. Foundation Models for 3D Vision
2.2. Hybrid SLAM: Bridging Learning and Optimization
2.3. Geometric Priors and Uncertainty Modeling
3. Method
3.1. System Overview and Probabilistic Formulation
3.2. Generative Frontend: Initialization & Confidence-Aware Perception
- a.
- Holistic Initialization via VGGT:
- b.
- Perceptual Confidence map via DINOv3 Features:
3.3. Construction of Geometric Constraints
3.3.1. Confidence-Aware Line Constraints
3.3.2. Joint Optimization with Sparse Keypoints
3.3.3. Adaptive Depth Regularization with Metric Priors
3.4. Backend: Graph Construction and Optimization
4. Experiment
4.1. Experiments
4.1.1. Datasets
4.1.2. Baselines & Metrics
4.1.3. Implementation Details
4.2. Evaluation
4.2.1. Results on Self-Capture Datasets
4.2.2. Results on Public Datasets
- RGB-D TUM
- Replica
4.2.3. Results on Large Scene Datasets
4.3. Ablation Study
- w/o Depth Prior
- w/o Confidence
- w/o Lines
- w/o VGGT initialization
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Fehr, M.; Furrer, F.; Dryanovski, I.; Sturm, J.; Gilitschenski, I.; Siegwart, R.; Cadena, C. TSDF-based change detection for consistent long-term dense reconstruction and dynamic object discovery. In Proceedings of the 2017 IEEE International Conference on Robotics and automation (ICRA), Singapore, 29 May–3 June 2017; pp. 5237–5244. [Google Scholar]
- Yan, L.; Hu, X.; Zhao, L.; Chen, Y.; Wei, P.; Xie, H. DGS-SLAM: A fast and robust RGBD SLAM in dynamic environments combined by geometric and semantic information. Remote Sens. 2022, 14, 795. [Google Scholar] [CrossRef]
- Islam, Q.U.; Ibrahim, H.; Chin, P.K.; Lim, K.; Abdullah, M.Z. MVS-SLAM: Enhanced multiview geometry for improved semantic RGBD SLAM in dynamic environment. J. Field Robot. 2024, 41, 109–130. [Google Scholar] [CrossRef]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
- Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 139. [Google Scholar] [CrossRef]
- Wang, S.; Leroy, V.; Cabon, Y.; Chidlovskii, B.; Revaud, J. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20697–20709. [Google Scholar]
- Wang, H.; Agapito, L. 3d reconstruction with spatial memory. arXiv 2024, arXiv:2408.16061. [Google Scholar] [CrossRef]
- Liu, Y.; Dong, S.; Wang, S.; Yin, Y.; Yang, Y.; Fan, Q.; Chen, B. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 16651–16662. [Google Scholar]
- Agarwal, S.; Snavely, N.; Seitz, S.M.; Szeliski, R. Bundle adjustment in the large. In Proceedings of the European Conference on Computer Vision, Heraklion, Crete, 5–11 September 2010; pp. 29–42. [Google Scholar]
- Nikoohemat; Diakité, A.A.; Zlatanova, S.; Vosselman, G. Indoor 3D reconstruction from point clouds for optimal routing in complex buildings to support disaster management. Autom. Constr. 2020, 113, 103109. [Google Scholar] [CrossRef]
- Wang, J.; Chen, M.; Karaev, N.; Vedaldi, A.; Rupprecht, C.; Novotny, D. Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5294–5306. [Google Scholar]
- Zhang, Y.; Jin, R.; Zhou, Z.H. Understanding bag-of-words model: A statistical framework. Int. J. Mach. Learn. Cybern. 2010, 1, 43–52. [Google Scholar] [CrossRef]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Wu, J.; Cui, Z.; Sheng, V.S.; Zhao, P.; Su, D.; Gong, S. A Comparative Study of SIFT and its Variants. Meas. Sci. Rev. 2013, 13, 122. [Google Scholar] [CrossRef]
- Murai, R.; Dexheimer, E.; Davison, A.J. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 16695–16705. [Google Scholar]
- Maggio, D.; Lim, H.; Carlone, L. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold. arXiv 2025, arXiv:2505.12549. [Google Scholar]
- Coughlan, J.M.; Yuille, A.L. Manhattan world: Compass direction from a single image by bayesian inference. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–25 September 1999; Volume 2, pp. 941–947. [Google Scholar]
- Walker, H.M. Degrees of freedom. J. Educ. Psychol. 1940, 31, 253. [Google Scholar] [CrossRef]
- Pumarola, A.; Vakhitov, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. PL-SLAM: Real-time monocular visual SLAM with points and lines. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4503–4508. [Google Scholar]
- Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
- Wang, R.; Xu, S.; Dai, C.; Xiang, J.; Deng, Y.; Tong, X.; Yang, J. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5261–5271. [Google Scholar]
- Gauvain, J.L.; Lee, C.H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 1994, 2, 291–298. [Google Scholar] [CrossRef]
- Xu, Y.; Xu, W.; Cheung, K.-Y.K.; Tu, Z. LETR: Line Transformers for Joint End-to-End Line Segment Detection and Description. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Schubert, D.; Goll, T.; Demmel, N.; Usenko, V.; Stückler, J.; Cremers, D. The TUM VI benchmark for evaluating visual-inertial odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1680–1687. [Google Scholar]
- Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J.J.; Mur-Artal, R.; Ren, C.; Verma, S.; et al. The replica dataset: A digital replica of indoor spaces. arXiv 2019, arXiv:1906.05797. [Google Scholar] [CrossRef]
- Knapitsch, A.; Park, J.; Zhou, Q.Y.; Koltun, V. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph. ToG 2017, 36, 78. [Google Scholar] [CrossRef]
- Fisher, A.; Cannizzaro, R.; Cochrane, M.; Nagahawatte, C.; Palmer, J.L. ColMap: A memory-efficient occupancy grid mapping framework. Robot. Auton. Syst. 2021, 142, 103755. [Google Scholar] [CrossRef]
- Zhang, Z.; Scaramuzza, D. A tutorial on quantitative trajectory evaluation for visual (-inertial) odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7244–7251. [Google Scholar]
- Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. Dinov3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [PubMed]
- Zheng, C.; Xu, W.; Zou, Z.; Hua, T.; Yuan, C.; He, D.; Zhou, B.; Liu, Z.; Lin, J.; Zhu, F.; et al. Fast-livo2: Fast, direct lidar-inertial-visual odometry. IEEE Trans. Robot. 2024, 41, 326–346. [Google Scholar] [CrossRef]










| Methods | Scene 1 | Scene 2 | ||||
|---|---|---|---|---|---|---|
| ATE (cm) | RTE (cm) | RRE (°) | ATE (cm) | RTE (cm) | RRE (°) | |
| MAST3R-SLAM | 7.8 | 2.3 | 6.3 | 6.1 | 4.9 | 5.6 |
| SLAM-3R | 12.6 | 6.9 | 12.9 | 9.2 | 7.3 | 7.9 |
| VGGT-SLAM | Track Fail | Track Fail | Track Fail | Track Fail | Track Fail | Track Fail |
| OURS | 4.1 | 1.2 | 0.79 | 4.7 | 1.3 | 1.21 |
| Methods | Scene 1 | Scene 2 |
|---|---|---|
| RMSE (cm) | RMSE (cm) | |
| MAST3R-SLAM | 12.4 | 15.7 |
| SLAM-3R | 10.9 | 16.8 |
| VGGT-SLAM | Fail | Fail |
| OURS | 5.2 | 4.9 |
| Methods | Sequence | ||||
|---|---|---|---|---|---|
| Desk | Desk2 | Room | xyz | Longoffice | |
| MAST3R-SLAM | 3.51 | 5.52 | 1.13 | 2.01 | 4.92 |
| SLAM-3R | 4.13 | 5.65 | 1.44 | 3.41 | 7.93 |
| VGGT-SLAM | 2.55 | 4.16 | 1.01 | Fail | 2.72 |
| OURS | 2.43 | 3.71 | 0.82 | 1.42 | 1.21 |
| Methods | Sequence | ||||||
|---|---|---|---|---|---|---|---|
| Office0 | Office1 | Office2 | Office3 | Room0 | Room1 | Room2 | |
| MAST3R-SLAM | 1.67 | 2.31 | 1.79 | 3.45 | 4.01 | 3.61 | 3.95 |
| SLAM-3R | 4.65 | 5.23 | 4.23 | 5.35 | 3.19 | 3.12 | 3.15 |
| VGGT-SLAM | 6.34 | 9.12 | 12.13 | 9.19 | 10.78 | 12.45 | 7.19 |
| OURS | 1.98 | 2.26 | 1.65 | 2.18 | 2.87 | 3.01 | 3.56 |
| Scene | Metric | MAST3R | SLAM3R | Ours |
|---|---|---|---|---|
| Barn | F-score | 0.765 | 0.686 | 0.813 |
| Mean Dist (cm) | 9.17 | 7.19 | 5.17 | |
| Caterpillar | F-score | 0.698 | 0.623 | 0.745 |
| Mean Dist (cm) | 10.12 | 11.54 | 8.23 | |
| Truck | F-score | 0.654 | 0.589 | 0.712 |
| Mean Dist (cm) | 12.45 | 14.21 | 9.56 |
| Method | Depth | Confidence | Line Const | Init | Scene 1 (Office) | Scene 3 (Corrido) |
|---|---|---|---|---|---|---|
| A (w/o Prior) | × | √ | √ | VGGT | 0.082 | Track Fail |
| B (w/o Confidence) | √ | × | √ | VGGT | 0.065 | 0.154 |
| C (w/o Lines) | √ | √ | × | VGGT | 0.058 | 0.092 |
| D (Trad.Init) | √ | √ | √ | Vel | 0.081 | 0.186 |
| OURS | √ | √ | √ | VGGT | 0.041 | 0.068 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Qin, K.; Li, J.; Zlatanova, S.; Wu, H.; Wu, H.; Gao, Y.; Zhou, D.; Li, Y.; Shen, S.; Qu, X.; et al. VGGT-Geo: Probabilistic Geometric Fusion of Visual Geometry Grounded Transformer Priors for Robust Dense Indoor SLAM. ISPRS Int. J. Geo-Inf. 2026, 15, 85. https://doi.org/10.3390/ijgi15020085
Qin K, Li J, Zlatanova S, Wu H, Wu H, Gao Y, Zhou D, Li Y, Shen S, Qu X, et al. VGGT-Geo: Probabilistic Geometric Fusion of Visual Geometry Grounded Transformer Priors for Robust Dense Indoor SLAM. ISPRS International Journal of Geo-Information. 2026; 15(2):85. https://doi.org/10.3390/ijgi15020085
Chicago/Turabian StyleQin, Kai, Jing Li, Sisi Zlatanova, Haitao Wu, Hao Wu, Yin Gao, Dingjie Zhou, Yuchen Li, Sizhe Shen, Xiangjun Qu, and et al. 2026. "VGGT-Geo: Probabilistic Geometric Fusion of Visual Geometry Grounded Transformer Priors for Robust Dense Indoor SLAM" ISPRS International Journal of Geo-Information 15, no. 2: 85. https://doi.org/10.3390/ijgi15020085
APA StyleQin, K., Li, J., Zlatanova, S., Wu, H., Wu, H., Gao, Y., Zhou, D., Li, Y., Shen, S., Qu, X., Zhang, Z., Yang, B., & Xu, S. (2026). VGGT-Geo: Probabilistic Geometric Fusion of Visual Geometry Grounded Transformer Priors for Robust Dense Indoor SLAM. ISPRS International Journal of Geo-Information, 15(2), 85. https://doi.org/10.3390/ijgi15020085

