DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction
Abstract
1. Introduction
- Depth-Prior Multi-Modal Fusion: We use the pre-trained MARIGOLD diffusion-based depth estimator to generate high-fidelity depth priors [14], which are concatenated with RGB features to alleviate the ill-posed nature of single-view reconstruction.
- Adaptive Global–Local Feature Fusion: Our encoder processes ResNet-based local features and DINO-ViT global features in parallel, then merges them with pixel-wise depth priors via a learnable fusion module that dynamically adjusts information weights according to texture and occlusion.
- Significant Performance Improvements: We validate the effectiveness of our approach in fair comparisons with other implicit reconstruction baselines, demonstrating that DP-AMF outperforms existing methods on key metrics (CD, F-Score, NC) [15,16,17] and achieves higher reconstruction quality and detail fidelity in complex scenes.
2. Related Work
2.1. Multi-Modal Single-View Reconstruction
2.2. Global–Local Feature Extraction
3. Methodology
3.1. Depth Prior Generation
3.2. Feature Extraction and Fusion
3.3. Two-Stage Training Objectives
4. Experiments
4.1. Datasets
4.2. Experimental Setup
5. Results and Analysis
5.1. Compared Experiments
5.2. Ablation Experiments
6. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Choy, C.B.; Xu, D.; Gwak, J.; Chen, K.; Savarese, S. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. arXiv 2016, arXiv:1604.00449. [Google Scholar]
- Godard, C.; Aodha, O.M.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. arXiv 2017, arXiv:1609.03677. [Google Scholar] [CrossRef]
- Perwez, U.; Yamaguchi, Y.; Ma, T.; Dai, Y.; Shimoda, Y. Multi-scale GIS-synthetic hybrid approach for the development of commercial building stock energy model. Appl. Energy 2022, 323, 119536. [Google Scholar] [CrossRef]
- Li, Q.; Zhao, B.; Wang, X.; Yang, G.; Chang, Y.; Chen, X.; Chen, B.M. Autonomous building material stock estimation using 3D modeling and multilayer perceptron. Sustain. Cities Soc. 2025, 130, 106522. [Google Scholar] [CrossRef]
- Palliwal, A.; Song, S.; Tan, H.T.W.; Biljecki, F. 3D city models for urban farming site identification in buildings. Comput. Environ. Urban Syst. 2021, 86, 101584. [Google Scholar] [CrossRef]
- Li, Q.; Yang, G.; Bian, C.; Long, L.; Wang, X.; Gao, C.; Wong, C.L.; Huang, Y.; Zhao, B.; Chen, X.; et al. Autonomous design framework for deploying building integrated photovoltaics. Appl. Energy 2025, 377, 124760. [Google Scholar] [CrossRef]
- Sun, L.; Jiang, Y.; Guo, Q.; Ji, L.; Xie, Y.; Qiao, Q.; Huang, G.; Xiao, K. A GIS-based multi-criteria decision making method for the potential assessment and suitable sites selection of PV and CSP plants. Resour. Conserv. Recycl. 2021, 168, 105306. [Google Scholar] [CrossRef]
- Li, Q.; Long, L.; Li, X.; Yang, G.; Bian, C.; Zhao, B.; Chen, X.; Chen, B.M. Life cycle cost analysis of circular photovoltaic façade in dense urban environment using 3D modeling. Renew. Energy 2025, 238, 121914. [Google Scholar] [CrossRef]
- Wang, H.; Zhang, G.; Cao, H.; Hu, K.; Wang, Q.; Deng, Y.; Gao, J.; Tang, Y. Geometry-Aware 3D Point Cloud Learning for Precise Cutting-Point Detection in Unstructured Field Environments. J. Field Robot. 2025, 42, e22567. [Google Scholar] [CrossRef]
- Brinatti Vazquez, G.D.; Lacapmesure, A.M.; Martínez, S.; Martínez, O.E. SUPPOSe 3Dge: A Method for Super-Resolved Detection of Surfaces in Volumetric Fluorescence Microscopy. J. Opt. Photonics Res. 2024, 1, 2350. [Google Scholar] [CrossRef]
- Ranftl, R.; Bochkovskiy, A.; Koltun, V. MiDaS: High-Quality Depth Estimation with Minimal Training Data. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020. [Google Scholar]
- Wang, Q.; Zhang, H.; Lin, M. Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 876–886. [Google Scholar]
- Oechsle, M.; Mescheder, L.; Niemeyer, M.; Strauss, T.; Geiger, A. Texture Fields: Learning Texture Representations in Function Space. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation. arXiv 2024, arXiv:2312.02145. [Google Scholar] [CrossRef]
- Fan, H.; Su, H.; Guibas, L. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 605–613. [Google Scholar]
- Knapitsch, A.; Park, J.; Zhou, Q.Y.; Koltun, V. Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction. ACM Trans. Graph. 2017, 36, 1–13. [Google Scholar] [CrossRef]
- Smith, J.; Wang, L.; Lee, D. Normal Consistency for Surface Reconstruction in 3D Modeling. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Xie, H.; Yao, H.; Sun, X.; Zhou, S.; Zhang, S. Pix2Vox: Context-Aware 3D Reconstruction From Single and Multi-View Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2690–2698. [Google Scholar] [CrossRef]
- Sun, B.; Jiang, P.; Kong, D.; Shen, T. IV-Net: Single-view 3D volume reconstruction by fusing features of image and recovered volume. Vis. Comput. 2023, 39, 6237–6247. [Google Scholar] [CrossRef]
- Groueix, T.; Fisher, M.; Kim, V.G.; Russell, B.C.; Aubry, M. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. arXiv 2018, arXiv:1802.05384. [Google Scholar]
- Shen, Q.; Yang, X.; Wang, X. Anything-3D: Towards Single-view Anything Reconstruction in the Wild. arXiv 2023, arXiv:2304.10261. [Google Scholar]
- Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1623–1637. [Google Scholar] [CrossRef]
- Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. arXiv 2021, arXiv:2103.13413. [Google Scholar]
- Kim, T.; Lee, J.; Lee, K.T.; Choe, Y. Single-View 3D Reconstruction Based on Gradient-Applied Weighted Loss. J. Electr. Eng. Technol. 2024, 19, 4523–4535. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Yang, W.J.; Wu, C.C.; Yang, J.F. Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation. Sensors 2025, 25, 80. [Google Scholar] [CrossRef]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. arXiv 2021, arXiv:2104.14294. [Google Scholar] [CrossRef]
- Fu, H.; Cai, B.; Gao, L.; Zhang, L.X.; Wang, J.; Li, C.; Zeng, Q.; Sun, C.; Jia, R.; Zhao, B.; et al. 3D-FRONT: 3D Furnished Rooms with Layouts and Semantics. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Sun, X.; Wu, J.; Zhang, X.; Zhang, Z.; Zhang, C.; Xue, T.; Tenenbaum, J.B.; Freeman, W.T. Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Liu, H.; Zheng, Y.; Chen, G.; Cui, S.; Han, X. Towards High-Fidelity Single-view Holistic Reconstruction of Indoor Scenes. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Li, J.; Wang, X.; Li, D. Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 55–65. [Google Scholar]
- Xu, K.; Lin, Y.; Huang, F. Holistic 3D Scene Understanding from a Single Image with Implicit Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 102–114. [Google Scholar]
- Boulch, A.; Marlet, R. POCO: Point Convolution for Surface Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6302–6314. [Google Scholar]
- Huang, J.; Gojcic, Z.; Atzmon, M.; Litany, O.; Fidler, S.; Williams, F. Neural Kernel Surface Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4369–4379. [Google Scholar]
- Li, Q.; Yang, G.; Gao, C.; Huang, Y.; Zhang, J.; Huang, D.; Zhao, B.; Chen, X.; Chen, B.M. Single drone-based 3D reconstruction approach to improve public engagement in conservation of heritage buildings: A case of Hakka Tulou. J. Build. Eng. 2024, 87, 108954. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, J. Research on the Conservation of Historical Buildings Based on Digital 3D Reconstruction. Procedia Comput. Sci. 2023, 228, 593–600. [Google Scholar] [CrossRef]
- Shanti, Z.; Al-Tarazi, D. Virtual Reality Technology in Architectural Theory Learning: An Experiment on the Module of History of Architecture. Sustainability 2023, 15, 16394. [Google Scholar] [CrossRef]
- Whiton, R.; Chen, J.; Johansson, T.; Tufvesson, F. Urban Navigation with LTE using a Large Antenna Array and Machine Learning. In Proceedings of the 2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring), Helsinki, Finland, 19–22 June 2022; pp. 1–5. [Google Scholar] [CrossRef]
- Zhang, Y.; Nakajima, T. Exploring the Design of a Mixed-Reality 3D Minimap to Enhance Pedestrian Satisfaction in Urban Exploratory Navigation. Future Internet 2022, 14, 325. [Google Scholar] [CrossRef]
- Long, L.; Gan, Z.; Liu, Z.; Zhao, B.; Li, Q. MSD-Det: Masonry structures damage detection dataset for preventive conservation of heritage. J. Cult. Herit. 2025, 73, 358–370. [Google Scholar] [CrossRef]
- Yang, G.; Zhao, B.; Zhang, J.; Wen, J.; Li, Q.; Lei, L.; Chen, X.; Chen, B. Det-Recon-Reg: An Intelligent Framework Toward Automated UAV-Based Large-Scale Infrastructure Inspection. IEEE Trans. Instrum. Meas. 2025, 74, 1–16. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Stage 1 | 1.0 | 0 | 0 | 0 |
Stage 2 (3D-FRONT) | 1.0 | 0→0.1 (epochs 30–80) | 0→0.01 (epochs 30–80) | 0→0.1 (epochs 30–80) |
Stage 2 (Pix3D) | 1.0 | 0→0.1 (epochs 50–150) | 0→0.01 (epochs 50–150) | 0→0.1 (epochs 50–150) |
Dataset | Train | Validation | Test |
---|---|---|---|
3D-FRONT | 22,103 (74.5%) | 2550 (8.6%) | 5006 (16.8%) |
Pix3D | 6931 (55.6%) | 2778 (22.3%) | 2762 (22.1%) |
Item | Configuration |
---|---|
System | Ubuntu 20.04, NVIDIA RTX 3090 |
Framework | PyTorch + CUDA |
Training Stages | 2 (Stage 1: Geometry only; Stage 2: Full reconstruction) |
Epochs | 200(3D-FRONT) 300(Pix3D) |
Optimizer | Adam, LR = |
Batch Size | 96 (3D-FRONT) 128(Pix3D) |
Image Resolution | |
CNN Backbones | ResNet-32, ResNet-18 (ImageNet pre-trained) |
ViT Backbone | DINO-ViT-16 (768-dim CLS) |
Depth Prior | MARIGOLD diffusion pretrained model |
Point Sampling | N = 64 samples/ray |
Metrics | Category | Bed | Chair | Sofa | Table | Desk | Nightstand | Cabinet | Bookshelf | Mean |
---|---|---|---|---|---|---|---|---|---|---|
CD ↓ (7.64%) | MGN | 15.48 | 11.67 | 8.72 | 20.90 | 17.59 | 17.11 | 13.13 | 10.21 | 14.07 |
LIEN | 16.81 | 41.40 | 9.51 | 35.65 | 26.63 | 16.78 | 11.70 | 11.70 | 28.52 | |
InstPIFu | 18.17 | 14.06 | 7.66 | 23.25 | 33.33 | 11.73 | 6.04 | 8.03 | 14.46 | |
SSR | 13.12 | 12.05 | 6.47 | 19.32 | 28.45 | 11.87 | 6.18 | 7.23 | 13.08 | |
Ours | 12.05 | 10.89 | 5.94 | 18.47 | 26.87 | 10.15 | 5.27 | 5.82 | 12.08 | |
F-Score ↑ (2.81%) | MGN | 46.81 | 57.49 | 64.61 | 49.80 | 46.82 | 47.91 | 54.18 | 54.55 | 55.64 |
LIEN | 44.28 | 31.61 | 61.40 | 43.22 | 37.04 | 50.76 | 69.21 | 55.33 | 45.63 | |
InstPIFu | 47.85 | 59.08 | 67.60 | 56.43 | 48.49 | 57.14 | 73.32 | 66.13 | 61.32 | |
SSR | 52.13 | 62.47 | 69.21 | 60.34 | 52.78 | 60.12 | 75.45 | 68.09 | 62.25 | |
Ours | 54.21 | 64.37 | 71.08 | 62.01 | 54.66 | 62.45 | 76.12 | 69.30 | 64.00 | |
NC ↑ (5.88%) | MGN | 0.829 | 0.758 | 0.819 | 0.785 | 0.711 | 0.833 | 0.802 | 0.719 | 0.787 |
LIEN | 0.822 | 0.793 | 0.803 | 0.755 | 0.701 | 0.814 | 0.801 | 0.747 | 0.786 | |
InstPIFu | 0.799 | 0.782 | 0.846 | 0.804 | 0.708 | 0.844 | 0.841 | 0.790 | 0.810 | |
SSR | 0.832 | 0.803 | 0.849 | 0.814 | 0.709 | 0.861 | 0.828 | 0.806 | 0.813 | |
Ours | 0.834 | 0.812 | 0.861 | 0.822 | 0.729 | 0.869 | 0.848 | 0.815 | 0.824 |
Metrics | Models | Bed | Bookcase | Chair | Desk | Sofa | Table | Tool | Wardrobe | Misc | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|
CD ↓ (4.84%) | MGN | 22.91 | 33.61 | 56.47 | 33.95 | 9.27 | 81.19 | 94.70 | 10.43 | 137.50 | 44.32 |
LIEN | 11.18 | 29.61 | 40.01 | 65.36 | 10.54 | 146.13 | 29.63 | 4.88 | 144.06 | 51.31 | |
InstPIFu | 10.90 | 7.55 | 32.44 | 22.09 | 8.13 | 45.82 | 10.29 | 1.29 | 47.31 | 24.65 | |
SSR | 6.31 | 7.21 | 26.23 | 28.63 | 5.68 | 43.87 | 8.29 | 2.07 | 35.03 | 21.79 | |
Ours | 6.05 | 6.92 | 25.51 | 27.73 | 5.52 | 42.10 | 7.98 | 1.93 | 34.12 | 20.83 | |
F-Score ↑ (2.74%) | MGN | 34.69 | 28.42 | 35.67 | 65.36 | 51.15 | 17.05 | 57.16 | 52.04 | 10.41 | 36.20 |
LIEN | 37.13 | 15.51 | 25.70 | 26.01 | 49.71 | 21.16 | 5.85 | 59.46 | 11.04 | 31.45 | |
InstPIFu | 54.99 | 62.26 | 35.30 | 47.30 | 56.54 | 37.51 | 64.24 | 94.62 | 27.03 | 45.62 | |
SSR | 68.78 | 66.69 | 55.18 | 42.49 | 71.22 | 51.93 | 65.38 | 91.84 | 46.92 | 59.71 | |
Ours | 69.45 | 67.12 | 56.23 | 43.78 | 72.04 | 53.87 | 66.05 | 92.30 | 48.31 | 61.35 | |
NC ↑ (5.85%) | MGN | 0.737 | 0.592 | 0.525 | 0.633 | 0.756 | 0.794 | 0.531 | 0.809 | 0.563 | 0.659 |
LIEN | 0.706 | 0.514 | 0.591 | 0.581 | 0.775 | 0.619 | 0.506 | 0.844 | 0.481 | 0.646 | |
InstPIFu | 0.782 | 0.646 | 0.547 | 0.758 | 0.753 | 0.796 | 0.639 | 0.951 | 0.580 | 0.683 | |
SSR | 0.825 | 0.689 | 0.693 | 0.776 | 0.866 | 0.835 | 0.645 | 0.960 | 0.599 | 0.778 | |
Ours | 0.831 | 0.696 | 0.702 | 0.781 | 0.871 | 0.842 | 0.652 | 0.965 | 0.610 | 0.791 |
Model | Params (M) | GFLOPs (G) | ΔParams (M) | ΔGFLOPs (G) |
---|---|---|---|---|
SSR | 36.29 | 147.44 | – | – |
DP-AMF | 36.48 | 215.85 | +0.19 | +68.41 |
Depth Module | Global Encoder | CD ↓ | F-Score ↑ | NC ↑ |
---|---|---|---|---|
MARIGOLD | DINO-ViT | 16.23 | 59.97 | 0.806 |
MARIGOLD | ViT | 17.87 (+1.64) | 59.11 (−0.86) | 0.789 (−0.017) |
MARIGOLD | ✗ | 19.00 (+2.77) | 59.24 (−0.73) | 0.767 (−0.036) |
MiDaS | DINO-ViT | 17.48 (+1.25) | 59.85 (−0.12) | 0.791 (−0.015) |
✗ | DINO-ViT | 21.08 (+4.85) | 56.22 (−3.75) | 0.778 (−0.028) |
✗ | ✗ | 24.15 (+7.92) | 54.99 (−4.98) | 0.770 (−0.036) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, L.; Xie, C.; Kitahara, I. DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction. J. Imaging 2025, 11, 246. https://doi.org/10.3390/jimaging11070246
Zhang L, Xie C, Kitahara I. DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction. Journal of Imaging. 2025; 11(7):246. https://doi.org/10.3390/jimaging11070246
Chicago/Turabian StyleZhang, Luoxi, Chun Xie, and Itaru Kitahara. 2025. "DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction" Journal of Imaging 11, no. 7: 246. https://doi.org/10.3390/jimaging11070246
APA StyleZhang, L., Xie, C., & Kitahara, I. (2025). DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction. Journal of Imaging, 11(7), 246. https://doi.org/10.3390/jimaging11070246