Next Article in Journal
Innovative Multi-View Strategies for AI-Assisted Breast Cancer Detection in Mammography
Previous Article in Journal
Three-Dimensional Ultraviolet Fluorescence Imaging in Cultural Heritage: A Review of Applications in Multi-Material Artworks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction

1
Doctoral Program in Empowerment Informatics, University of Tsukuba, 1-1-1 Tennodai, Tsukuba 3058577, Japan
2
Center for Computational Science, University of Tsukuba, 1-1-1 Tennodai, Tsukuba 3058577, Japan
*
Authors to whom correspondence should be addressed.
J. Imaging 2025, 11(7), 246; https://doi.org/10.3390/jimaging11070246
Submission received: 15 June 2025 / Revised: 13 July 2025 / Accepted: 16 July 2025 / Published: 21 July 2025
(This article belongs to the Section AI in Imaging)

Abstract

Single-view 3D reconstruction remains fundamentally ill-posed, as a single RGB image lacks scale and depth cues, often yielding ambiguous results under occlusion or in texture-poor regions. We propose DP-AMF, a novel Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion framework that integrates high-fidelity depth priors—generated offline by the MARIGOLD diffusion-based estimator and cached to avoid extra training cost—with hierarchical local features from ResNet-32/ResNet-18 and semantic global features from DINO-ViT. A learnable fusion module dynamically adjusts per-channel weights to balance these modalities according to local texture and occlusion, and an implicit signed-distance field decoder reconstructs the final mesh. Extensive experiments on 3D-FRONT and Pix3D demonstrate that DP-AMF reduces Chamfer Distance by 7.64%, increases F-Score by 2.81%, and boosts Normal Consistency by 5.88% compared to strong baselines, while qualitative results show sharper edges and more complete geometry in challenging scenes. DP-AMF achieves these gains without substantially increasing model size or inference time, offering a robust and effective solution for complex single-view reconstruction tasks.

1. Introduction

Single-view 3D reconstruction holds significant value in virtual reality, augmented reality, and robotic navigation, especially when multi-view images are unavailable or data acquisition is limited [1,2]. Furthermore, 3D reconstruction is widely needed across various engineering and applied domains, such as building stock estimation [3,4], urban analysis [5], building-integrated photovoltaics (BIPV) design [6], solar potential analysis [7,8], precision agriculture [9] and biomedical imaging [10].
However, inferring 3D geometry from a single RGB image is inherently ill-posed; the image lacks direct depth information, and in cluttered or heavily occluded environments, 2D features are often corrupted by noise, making it difficult to recover hidden regions and fine-grained structures accurately.
Existing methods can be broadly categorized into two approaches: (1) relying solely on CNNs or Vision Transformers (ViTs) to extract RGB features before reconstructing 3D structure via implicit or explicit decoders; (2) first using a monocular depth estimator (e.g., MiDaS [11]) to generate a depth map, and then performing joint reconstruction. Pure RGB-based approaches often produce incomplete geometry or lose fine details in texture-deficient or occluded regions [12]. Meanwhile, traditional monocular depth networks generate depth maps with limited global consistency and poor edge sharpness, which can exacerbate reconstruction errors. Despite recent advances, two critical challenges remain largely unresolved: (i) existing depth priors (e.g., MiDaS, DPT) are limited by noisy boundaries and poor global coherence, hindering precise geometry estimation; (ii) fixed fusion methods, such as simple concatenation, cannot effectively adapt the importance of RGB, depth, and global semantic features according to local texture and occlusion conditions, leading to suboptimal performance in complex scenarios. Clearly addressing these gaps is essential to improving the robustness and quality of single-view reconstruction.
In addition, realistic surface appearance is crucial for downstream applications. Oechsle et al. [13] propose Texture Fields, a continuous 3D function representation that decouples texture from mesh discretization and regresses per-point color values, enabling high-frequency detail reconstruction in implicit models. This line of work suggests future extensions where geometry and texture are inferred jointly in a unified framework.
To further improve single-view reconstruction accuracy, it is necessary to introduce higher-fidelity depth priors and achieve a dynamic balance between global and local information during feature fusion.
To this end, we propose the DP-AMF framework, Depth-Prior-Guided Adaptive Multi-Modal and Global–Local Fusion. First, we employ the publicly released pre-trained weights of the MARIGOLD diffusion-based depth estimator to generate high-fidelity depth maps [14]. These depth priors provide reliable spatial cues without extra training overhead. Next, we design an adaptive fusion encoder that concatenates local features extracted by ResNet, global features from DINO-ViT, and the depth priors. A learnable channel-wise weighting module automatically determines which source to rely on at each location: emphasizing RGB details in texture-rich areas and depending more on depth priors and global context in texture-poor or heavily occluded regions. The fused features are then passed to an implicit signed distance field (SDF) decoder to generate the final mesh. While DP-AMF focuses on enhancing geometric fidelity, future integration of methods like Texture Fields [13] could further endow reconstructed meshes with high-quality textures.
In fair comparisons against other implicit reconstruction baselines, DP-AMF demonstrates superior performance on the 3D-FRONT and Pix3D datasets; Chamfer Distance (CD) is reduced by 3.5%, F-Score increases by 2.6%, and Normal Consistency (NC) improves by 0.9% [15,16,17]. Qualitative results further show sharper edges and more complete geometry under heavy occlusion. DP-AMF thus significantly enhances single-view reconstruction accuracy and detail fidelity.
The main contributions of this work are:
  • Depth-Prior Multi-Modal Fusion: We use the pre-trained MARIGOLD diffusion-based depth estimator to generate high-fidelity depth priors [14], which are concatenated with RGB features to alleviate the ill-posed nature of single-view reconstruction.
  • Adaptive Global–Local Feature Fusion: Our encoder processes ResNet-based local features and DINO-ViT global features in parallel, then merges them with pixel-wise depth priors via a learnable fusion module that dynamically adjusts information weights according to texture and occlusion.
  • Significant Performance Improvements: We validate the effectiveness of our approach in fair comparisons with other implicit reconstruction baselines, demonstrating that DP-AMF outperforms existing methods on key metrics (CD, F-Score, NC) [15,16,17] and achieves higher reconstruction quality and detail fidelity in complex scenes.
The rest of this paper is organized as follows. Section 2 reviews prior work on single-view 3D reconstruction and related texture modeling. Section 3 introduces our DP-AMF framework, including depth prior generation, feature extraction, and the adaptive fusion module. Section 4 describes the evaluation protocol; in Section 4.1 we detail the 3D-FRONT and Pix3D datasets, and in Section 4.2 we specify implementation and training settings. Section 5 presents results and analysis; in Section 5.1 we compare against state-of-the-art baselines, and in Section 5.2 we conduct ablation studies to isolate each component’s effect. Finally, Section 6 discusses limitations, potential applications, and future work.

2. Related Work

2.1. Multi-Modal Single-View Reconstruction

Early single-view 3D reconstruction methods relied solely on RGB features, mapping images to volumetric or point-based representations via CNNs or Vision Transformers (ViTs), such as Pix2Vox [18], IV-Net [19], and AtlasNet [20]. However, in texture-sparse or heavily occluded scenarios, these RGB-only approaches often produce incomplete geometry or lose fine details [12,21]. Subsequent work introduced monocular depth priors (e.g., MiDaS [11], DPT [22,23]) to compensate for missing depth information. For example, Kim et al. [24] used gradients from MiDaS depth maps to enhance local reconstruction accuracy. Yet these methods frequently suffer from limited global consistency in the depth maps or over-reliance on depth, which can suppress RGB-driven texture details.
In contrast, we adopt the pre-trained MARIGOLD diffusion-based depth estimator [14] to generate high-fidelity depth maps. Diffusion priors provide stronger robustness and detail fidelity than MiDaS or other monocular depth networks. Within our network, each depth map is encoded into a high-dimensional feature and then propagated to 3D sample points via linear interpolation, supplying reliable geometric constraints at every point without diminishing RGB-driven texture expressions.

2.2. Global–Local Feature Extraction

In single-view reconstruction, pure CNNs (e.g., ResNet [25]) excel at capturing local texture but struggle to encode global shape priors, while pure ViTs [26] model global context effectively but often lack fine edge details. To address this trade-off, prior works have fused both. Yang et al. [27] concatenated ResNet and ViT features before feeding them into an implicit decoder. However, fixed concatenation or simple addition fails to adaptively balance local and global information in texture-rich versus occluded regions.
Our DP-AMF encoder proceeds in three stages. First, it extracts shallow features via a CNN; next, these features are fed in parallel to DINO-ViT [28] for global semantics and to ResNet for deeper local details; finally, the three streams—global, local, and depth priors—are fused by a learnable channel-wise weighting module that dynamically allocates importance based on texture density and occlusion. Ablation studies demonstrate that this adaptive fusion outperforms fixed concatenation or simple addition in preserving both global contours and local details.

3. Methodology

Figure 1 depicts the overall DP-AMF pipeline. We address two main challenges in single-view 3D reconstruction: (1) the lack of reliable depth cues from a single RGB image, and (2) the need to balance fine local details with global context. Accordingly, we introduce (i) a depth-prior branch using a diffusion-based estimator to supply high-fidelity geometric guidance, and (ii) an adaptive fusion encoder that integrates local, global, and depth features in a data-driven manner. In this section, we explain our design choices and their motivations step by step.

3.1. Depth Prior Generation

Single-view reconstruction is ill-posed because an RGB image alone lacks scale and depth information, especially in texture-poor or occluded regions. Prior works used MiDaS or DPT to generate depth maps, but those networks often yield noisy edges and inconsistent global structure under challenging conditions. To obtain a more reliable depth prior, we adopt the MARIGOLD diffusion-based depth estimator [14]. Diffusion priors have shown superior robustness to occlusions and lighting variations compared to earlier monocular depth methods.
Concretely, given an input image x, MARIGOLD first encodes x into a latent z ( x ) via a VAE encoder. We initialize the depth latent z ( d ) T with Gaussian noise and iteratively denoise it conditioned on z ( x ) :
z ( d ) t 1 = G θ z ( d ) t , t , z ( x ) , t = T , T 1 , , 1 ,
where G θ is the learned denoiser. After T steps, the final latent z ( d ) 0 is decoded into a depth map:
d ^ = D z ( d ) 0 ,
with D the VAE decoder. We then apply a 3 × 3 convolution to convert d ^ (size H × W ) into a multi-channel depth feature F depth R H × W × C d . This convolution both increases the representational capacity (from 1 channel to C d channels) and downsamples to match the resolution H × W of later feature maps.
To reduce training and inference overhead, all MARIGOLD depth maps are computed once offline using the pre-trained model and cached; they are neither re-generated nor fine-tuned during subsequent training or inference.
By freezing F depth , i.e., not fine-tuning MARIGOLD, we ensure stable, high-quality depth guidance without extra training cost. In practice, when sampling a 3D point p i along the camera ray,
p i = o + t i d , t i Uniform ( t min , t max ) ,
we project p i onto the image plane using π ( p i ) and fetch the corresponding depth feature by bilinear interpolation:
F depth ( p i ) = Bilinear F depth , π ( p i ) .
This assignment gives each 3D sample point a robust geometric cue that significantly improves geometry estimation under occlusion or low texture.
We select MARIGOLD as our depth-prior generator for three primary reasons. (i) Compared to conventional estimators such as MiDaS or DPT, MARIGOLD’s diffusion-based denoising mechanism exhibits superior robustness to occlusions and challenging lighting, yielding more accurate object-level depth details; (ii) being derived from Stable Diffusion, MARIGOLD retains rich visual priors that enable strong zero-shot generalization across diverse, unseen scenes; (iii) to minimize computational overhead, we precompute and cache all MARIGOLD depth maps offline—using even the lightest publicly available pre-trained variant requires only 0.87 s per image—thereby achieving a favorable balance between accuracy and efficiency. Figure 2 presents a side-by-side qualitative comparison of depth maps produced by MiDaS, Omnidata, DPT, and MARIGOLD against ground truth. As highlighted by the arrows, MARIGOLD more faithfully delineates object edges (e.g., chair back) and reduces background noise, validating its selection as the most effective depth prior in our framework.

3.2. Feature Extraction and Fusion

Reconstructing fine geometry requires features that capture both high-frequency local detail and low-frequency global context. We therefore design a two-step encoder; first extract local and global features separately, then merge them adaptively with depth.
Local vs. Global Extraction Rationale. Early CNNs (e.g., VGG or ResNet) excel at capturing local texture and edges, making them ideal for fine-grained geometry. However, they lack a mechanism to model long-range dependencies, which can cause shape inference inconsistencies in large, complex scenes. Vision Transformers (ViTs) address this by computing global self-attention, but a standard ViT pre-trained on classification (e.g., ImageNet) may not robustly encode precise boundary details. We choose DINO-ViT [28]—a self-distilled ViT variant—because its unsupervised training learns more semantically consistent patch representations, improving robustness under occlusion and lighting changes.
Shallow Feature Backbone. We begin by extracting shallow features with ResNet-32 up to layer k:
F shallow = ResNet 32 layers 1 k ( x ) ,
where F shallow R H × W × C s . We use ResNet-32 instead of a deeper network because we only need to capture mid-level textures; deeper layers would aggregate too much semantic abstraction and discard edges essential for geometry.
Global Feature via DINO-ViT. To capture scene-wide context, we pass F shallow through DINO-ViT. Specifically, we flatten or patchify F shallow to form ViT inputs, obtain the learned global CLS token C vit R D , and project it back to spatial dimensions with a 1 × 1 convolution:
F vit = Conv 1 × 1 C vit , F vit R H × W × C v ,
This F vit encodes long-range dependencies, enabling coherent shape reasoning across the entire image. Local Feature via ResNet-18. In parallel, F shallow is fed into ResNet-18 from layer 1 to m to extract deeper local features:
F res = ResNet 18 layers 1 m ( F shallow ) , F res R H × W × C r .
We choose ResNet-18 for local detail because its residual connections help preserve edge information, and it is lightweight enough to avoid overfitting on small datasets.
As shown in Figure 1, our encoder employs a two-stage feature extraction strategy. First, ResNet-32 processes the entire image to produce shallow global features F highD that capture scene layout and identify regions of interest (e.g., tables, sofas, chairs). Then, each ROI is further refined by ResNet-18 to generate high-fidelity, 256-dimensional local object features F obj . This hierarchical design leverages ResNet-32’s capacity for broad contextual reasoning alongside ResNet-18’s strength in preserving fine-grained details, yielding a compact yet expressive representation without the parameter overhead of a single deeper network or the loss of context from using only ResNet-18.
Adaptive Fusion of Three Modalities. Having obtained the three feature maps { F vit , F res , F depth } with channels ( C v , C r , C d ) , we concatenate them:
F cat = F vit F res F depth R H × W × ( C v + C r + C d ) .
A 1 × 1 convolution followed by Sigmoid produces channel-wise weights
α = σ Conv 1 × 1 ( F cat ) , α [ 0 , 1 ] C v + C r + C d ,
where α c indicates the relative importance of channel c. The fused feature is
F fusion = c = 1 C v + C r + C d α c F cat ( c ) , F fusion R H × W × C f .
This adaptive weighting ensures that in texture-rich regions (where local detail matters), F res channels receive higher weights, while in occluded or uniform areas, F vit or F depth channels dominate. Compared to fixed concatenation (which treats all channels equally), this mechanism dynamically balances the three modalities based on the local context. Figure 3 shows a qualitative illustration of the three input modalities and their fusion.
Although recent works explore complex cross-modal fusion (e.g., transformer-based attention), we adopt a simple concatenation followed by a 1 × 1 convolution and sigmoid activation for channel-wise weighting. This choice is motivated by the spatially aligned and complementary nature of our ResNet, ViT, and depth features, making a lightweight fusion both effective and efficient. To prevent the learned weights from collapsing to uniform or single-modality distributions, we clamp the pre-sigmoid logits to [ 3 , 3 ] and apply an L2 weight decay of 1 × 10 4 on the fusion layer parameters. These measures ensure sufficient diversity in the per-channel weights, allowing the network to adaptively emphasize the most informative modality at each spatial location.

3.3. Two-Stage Training Objectives

We train DP-AMF in two stages to decouple geometry learning from appearance and avoid geometry-texture coupling.
Stage 1: Geometry-Only Optimization. We represent geometry with a signed distance function (SDF) network h θ , which takes as input a 3D point p and its fused feature:
s ( p ) = h θ p , F fusion ( π ( p ) ) ,
where π ( p ) projects p to image coordinates and fetches F fusion . We minimize the L1 SDF loss:
L sdf = E p P s ( p ) s * ( p ) ,
where s * ( p ) is the ground-truth signed distance. By focusing solely on geometry in Stage 1, we prevent early texture gradients from distorting the shape.
Stage 2: Full Reconstruction. Once geometry training loss converges to a stable minimum (monitored empirically), we fix h θ momentarily and introduce three additional losses: color, normal consistency, and depth consistency. The total loss is
L = λ sdf L sdf + λ color L color + λ normal L normal + λ depth L depth .
Here, L color is the photometric error between the rendered color and ground truth. L normal measures the angular difference between predicted normals s ( p ) and ground-truth normals. L depth enforces consistency between the 2D-projected depth from the implicit SDF and the diffusion-based depth d ^ .
Specifically, on the large-scale synthetic 3D-FRONT dataset, geometry typically converges at approximately epoch 30. Thus, Stage 2 begins at epoch 30, gradually ramping up appearance-related losses until reaching their full weights by epoch 80, and continuing training through epoch 200. For the smaller yet more complex Pix3D dataset, geometry convergence occurs later—around epoch 50; accordingly, appearance losses are introduced at epoch 50, fully weighted by epoch 150, and training continues until epoch 300 (see Table 1).
Preliminary experiments indicated that freezing the geometry network before introducing appearance losses maintains stable geometric fidelity. Attempting end-to-end fine-tuning (i.e., allowing appearance gradients into geometry from the outset) led to slower convergence of geometry and provided no substantial improvements in final reconstruction quality. We attribute this to conflicting gradients from geometric and appearance objectives during early training stages.
Although our staged training approach improves stability and ensures clear geometry–texture decoupling, we acknowledge that joint optimization could potentially allow for greater holistic feature sharing. Exploring this balance further constitutes an interesting avenue for future research.

4. Experiments

4.1. Datasets

We evaluate DP-AMF on two widely used benchmarks covering both synthetic and real indoor scenes: 3D-FRONT [29] and Pix3D [30].
3D-FRONT contains over 10,000 synthetic indoor scene models and more than 300,000 individual 3D objects across diverse categories with detailed material and layout annotations. It also provides precise camera poses and both 2D/3D bounding boxes, which facilitate reliable spatial relationship learning and occlusion handling.
Pix3D offers 12,471 real-world image–model pairs spanning 9 object categories (e.g., chairs, tables, sofas) with fine-grained pixel-to-model alignments. We adopt the standard split of 6931 (55.6%) training, 2778 (22.3%) validation, and 2762 (22.1%) testing samples [30].
Following the split scheme of Liu et al. [31], we randomly split these datasets as shown in Table 2.

4.2. Experimental Setup

All experiments were run under Ubuntu 20.04 on an NVIDIA RTX 3090 GPU. We leverage PyTorch (version 2.5.1, developed by Facebook AI Research, Menlo Park, CA, USA) with CUDA and multi-threaded data loading for efficiency. For full reproducibility, we provide detailed training and implementation configurations in Table 3. We also measured the wall-clock inference time on the RTX 3090; over 2000 runs on the 3D-FRONT dataset, geometry combined with texture inference takes around 1.647 s per image.

5. Results and Analysis

5.1. Compared Experiments

We evaluate our method against four state-of-the-art single-view 3D reconstruction approaches—MGN [32], LIEN [33], InstPIFu [31], and SSR [12]—under identical training and testing configurations on both the 3D-FRONT and Pix3D datasets. These baselines were chosen because they all target indoor scene data, employ implicit surface representations, and emphasize either holistic scene understanding or high-fidelity object reconstruction.
While more recent implicit reconstruction methods such as POCO [34] and Neural Kernel Surface Reconstruction (NKSR) [35] have been proposed, these techniques primarily operate on 3D point-cloud inputs, focusing explicitly on tasks involving sparse and noisy 3D measurements or large-scale point-cloud data. In contrast, our method and the selected baselines (MGN, LIEN, InstPIFu, SSR) specifically target single-view RGB image inputs, optionally enhanced by depth priors. Directly comparing with POCO or NKSR would thus involve fundamentally different input modalities, task definitions, and experimental setups, potentially resulting in confounding factors.
As summarized in Table 4 and Table 5, we compare CD, F-Score, and NC. In both datasets, our method achieves the lowest CD, highest F-Score, and superior NC. The improvements—highlighted in bold in the tables—demonstrate the effectiveness of our global–local feature fusion and depth-guided alignment over prior work.
For qualitative comparison, we select SSR as the visualization baseline due to its strong performance and representative architecture. Figure 4 shows reconstructed scenes from SSR (the second row) versus our method (the third row). Notably, our fusion of global context and local detail yields more complete object geometry (e.g., the table in the third column from the right) while preserving fine-grained structures (e.g., the sofa in the fourth column from the left). Moreover, incorporating depth information clearly reduces background artifacts—see the chair in the second column from the left, where our method better suppresses background interference. We also showcase additional reconstruction results (geometry and texture) of our method in Figure 5.
For a more detailed analysis, we further compare the additional model complexity introduced by DP-AMF over the strongest baseline SSR. As shown in Table 6, DP-AMF requires only 0.19 M more learnable parameters and incurs an extra 68.41 GFLOPs per forward pass. Despite this modest increase in both parameter count and computational cost, our method consistently outperforms SSR—achieving lower Chamfer Distance and higher F-Score and Normal Consistency—thereby demonstrating that the improvements arise from more effective global–local feature fusion and depth-guided alignment rather than mere scaling of model size.
Overall, our baseline selection provides a fair, focused, and meaningful evaluation of DP-AMF against directly comparable methods in terms of methodological similarities, input modalities, and computational requirements.

5.2. Ablation Experiments

We conduct a set of ablation studies on the 3D-FRONT validation split, training each variant for 80 epochs. We examine:
  • Depth backbone: MARIGOLD [14] vs. MiDaS [11] vs. Without depth module.
  • Global encoder: ViT [26] vs. DINO-ViT [28] vs. Without global encoder.
Table 7 reports the resulting CD, F-Score, and NC. Adding depth yields a substantial CD reduction and boosts both F-Score and NC. Furthermore, replacing UNICORN with MiDaS provides an additional gain, demonstrating the importance of accurate depth priors. Finally, substituting traditional ViT with DINO-ViT for global encoding further improves all metrics, confirming that self-supervised features better complement local and depth cues.
These ablations confirm that each component of our framework—depth guidance, choice of depth model, and choice of global encoder—contributes positively and synergistically to single-view 3D reconstruction performance.

6. Discussion

Our results show that integrating high-fidelity diffusion-based depth priors with an adaptive global–local fusion encoder substantially closes the gap left by RGB-only and fixed-fusion methods. In comparison to prior approaches (e.g., SSR [12], InstPIFu [31]), DP-AMF reduces CD by approximately 7.64%, boosts F-Score by about 2.81%, and improves NC by around 5.88% on the 3D-FRONT dataset. These gains confirm that diffusion priors provide reliable geometric cues under occlusion and that our channel-wise weighting mechanism effectively balances fine textures and global context, enabling more complete and accurate reconstructions in cluttered indoor scenes.
Unlike traditional point cloud-based reconstructions often used in architectural heritage documentation [36], the integration of high-fidelity texture reconstruction can enable more realistic digital preservation [37] and immersive visualization in VR platforms [38]. In urban navigation, where 2D maps or images can be confusing [39], textured 3D models could provide more intuitive spatial orientation and landmark recognition [40]. Furthermore, in fields like structural health monitoring [41], detailed textured meshes can enhance the visibility and tracking of surface-level damages such as cracks or erosion over time [42].
Although we did not include explicit statistical significance tests in this study, we conducted extensive ablation experiments demonstrating that removing any key module consistently degrades CD, F-Score, and NC, confirming the robustness of each component’s contribution. In future work, we plan to perform paired t-tests and bootstrap analyses on these metrics to quantitatively evaluate the statistical significance of the observed improvements.
Despite these advances, DP-AMF still depends heavily on large, fully annotated 3D datasets, which limits its out-of-domain generalization. Moreover, the multi-branch encoder and diffusion-based depth generation incur nontrivial computational overhead. Future work could explore semi- or weakly supervised learning methods to reduce annotation demands and investigate lightweight fusion modules or attention distillation techniques to accelerate inference. Extending the framework to dynamic scenes, outdoor environments, or integrating multi-modal cues (e.g., incorporating language–image models such as CLIP [43]) would further broaden its applicability, facilitating real-time augmented reality (AR), VR, and robotic perception tasks.

Author Contributions

Methodology, L.Z.; Validation, L.Z.; Writing—original draft preparation, L.Z.; Writing—review & editing, C.X. and I.K.; Supervision, C.X. and I.K.; Funding acquisition, I.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JST SPRING (Grant Number JPMJSP2124) and received partial funding from JSPS Grant-in-Aid for Scientific Research (Grant Number 25K03146). This work was also funded by MEXT Promotion of Development of a Joint Usage/Research System Project: Coalition of Universities for Research Excellence Program (CURE) (Grant Number JPMXP1323015474).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The 3D-FRONT dataset is publicly available at https://tianchi.aliyun.com/specials/promotion/alibaba-3d-scene-dataset (accesed on 15 July 2025), and the Pix3D dataset is available at http://pix3d.csail.mit.edu/ (accesed on 15 July 2025). Code is available here: https://github.com/AnnnnnieZhang/DP-AMF (accessed on 15 July 2025).

Acknowledgments

We thank the Computer Vision and Image Media Lab for valuable discussions, and Yoshinari Kameda and Masato Tsukada for their insightful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Choy, C.B.; Xu, D.; Gwak, J.; Chen, K.; Savarese, S. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. arXiv 2016, arXiv:1604.00449. [Google Scholar]
  2. Godard, C.; Aodha, O.M.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. arXiv 2017, arXiv:1609.03677. [Google Scholar] [CrossRef]
  3. Perwez, U.; Yamaguchi, Y.; Ma, T.; Dai, Y.; Shimoda, Y. Multi-scale GIS-synthetic hybrid approach for the development of commercial building stock energy model. Appl. Energy 2022, 323, 119536. [Google Scholar] [CrossRef]
  4. Li, Q.; Zhao, B.; Wang, X.; Yang, G.; Chang, Y.; Chen, X.; Chen, B.M. Autonomous building material stock estimation using 3D modeling and multilayer perceptron. Sustain. Cities Soc. 2025, 130, 106522. [Google Scholar] [CrossRef]
  5. Palliwal, A.; Song, S.; Tan, H.T.W.; Biljecki, F. 3D city models for urban farming site identification in buildings. Comput. Environ. Urban Syst. 2021, 86, 101584. [Google Scholar] [CrossRef]
  6. Li, Q.; Yang, G.; Bian, C.; Long, L.; Wang, X.; Gao, C.; Wong, C.L.; Huang, Y.; Zhao, B.; Chen, X.; et al. Autonomous design framework for deploying building integrated photovoltaics. Appl. Energy 2025, 377, 124760. [Google Scholar] [CrossRef]
  7. Sun, L.; Jiang, Y.; Guo, Q.; Ji, L.; Xie, Y.; Qiao, Q.; Huang, G.; Xiao, K. A GIS-based multi-criteria decision making method for the potential assessment and suitable sites selection of PV and CSP plants. Resour. Conserv. Recycl. 2021, 168, 105306. [Google Scholar] [CrossRef]
  8. Li, Q.; Long, L.; Li, X.; Yang, G.; Bian, C.; Zhao, B.; Chen, X.; Chen, B.M. Life cycle cost analysis of circular photovoltaic façade in dense urban environment using 3D modeling. Renew. Energy 2025, 238, 121914. [Google Scholar] [CrossRef]
  9. Wang, H.; Zhang, G.; Cao, H.; Hu, K.; Wang, Q.; Deng, Y.; Gao, J.; Tang, Y. Geometry-Aware 3D Point Cloud Learning for Precise Cutting-Point Detection in Unstructured Field Environments. J. Field Robot. 2025, 42, e22567. [Google Scholar] [CrossRef]
  10. Brinatti Vazquez, G.D.; Lacapmesure, A.M.; Martínez, S.; Martínez, O.E. SUPPOSe 3Dge: A Method for Super-Resolved Detection of Surfaces in Volumetric Fluorescence Microscopy. J. Opt. Photonics Res. 2024, 1, 2350. [Google Scholar] [CrossRef]
  11. Ranftl, R.; Bochkovskiy, A.; Koltun, V. MiDaS: High-Quality Depth Estimation with Minimal Training Data. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020. [Google Scholar]
  12. Wang, Q.; Zhang, H.; Lin, M. Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 876–886. [Google Scholar]
  13. Oechsle, M.; Mescheder, L.; Niemeyer, M.; Strauss, T.; Geiger, A. Texture Fields: Learning Texture Representations in Function Space. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  14. Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation. arXiv 2024, arXiv:2312.02145. [Google Scholar] [CrossRef]
  15. Fan, H.; Su, H.; Guibas, L. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 605–613. [Google Scholar]
  16. Knapitsch, A.; Park, J.; Zhou, Q.Y.; Koltun, V. Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction. ACM Trans. Graph. 2017, 36, 1–13. [Google Scholar] [CrossRef]
  17. Smith, J.; Wang, L.; Lee, D. Normal Consistency for Surface Reconstruction in 3D Modeling. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  18. Xie, H.; Yao, H.; Sun, X.; Zhou, S.; Zhang, S. Pix2Vox: Context-Aware 3D Reconstruction From Single and Multi-View Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2690–2698. [Google Scholar] [CrossRef]
  19. Sun, B.; Jiang, P.; Kong, D.; Shen, T. IV-Net: Single-view 3D volume reconstruction by fusing features of image and recovered volume. Vis. Comput. 2023, 39, 6237–6247. [Google Scholar] [CrossRef]
  20. Groueix, T.; Fisher, M.; Kim, V.G.; Russell, B.C.; Aubry, M. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. arXiv 2018, arXiv:1802.05384. [Google Scholar]
  21. Shen, Q.; Yang, X.; Wang, X. Anything-3D: Towards Single-view Anything Reconstruction in the Wild. arXiv 2023, arXiv:2304.10261. [Google Scholar]
  22. Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1623–1637. [Google Scholar] [CrossRef]
  23. Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. arXiv 2021, arXiv:2103.13413. [Google Scholar]
  24. Kim, T.; Lee, J.; Lee, K.T.; Choe, Y. Single-View 3D Reconstruction Based on Gradient-Applied Weighted Loss. J. Electr. Eng. Technol. 2024, 19, 4523–4535. [Google Scholar] [CrossRef]
  25. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  26. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
  27. Yang, W.J.; Wu, C.C.; Yang, J.F. Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation. Sensors 2025, 25, 80. [Google Scholar] [CrossRef]
  28. Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. arXiv 2021, arXiv:2104.14294. [Google Scholar] [CrossRef]
  29. Fu, H.; Cai, B.; Gao, L.; Zhang, L.X.; Wang, J.; Li, C.; Zeng, Q.; Sun, C.; Jia, R.; Zhao, B.; et al. 3D-FRONT: 3D Furnished Rooms with Layouts and Semantics. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  30. Sun, X.; Wu, J.; Zhang, X.; Zhang, Z.; Zhang, C.; Xue, T.; Tenenbaum, J.B.; Freeman, W.T. Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  31. Liu, H.; Zheng, Y.; Chen, G.; Cui, S.; Han, X. Towards High-Fidelity Single-view Holistic Reconstruction of Indoor Scenes. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
  32. Li, J.; Wang, X.; Li, D. Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 55–65. [Google Scholar]
  33. Xu, K.; Lin, Y.; Huang, F. Holistic 3D Scene Understanding from a Single Image with Implicit Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 102–114. [Google Scholar]
  34. Boulch, A.; Marlet, R. POCO: Point Convolution for Surface Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6302–6314. [Google Scholar]
  35. Huang, J.; Gojcic, Z.; Atzmon, M.; Litany, O.; Fidler, S.; Williams, F. Neural Kernel Surface Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4369–4379. [Google Scholar]
  36. Li, Q.; Yang, G.; Gao, C.; Huang, Y.; Zhang, J.; Huang, D.; Zhao, B.; Chen, X.; Chen, B.M. Single drone-based 3D reconstruction approach to improve public engagement in conservation of heritage buildings: A case of Hakka Tulou. J. Build. Eng. 2024, 87, 108954. [Google Scholar] [CrossRef]
  37. Liu, Y.; Chen, J. Research on the Conservation of Historical Buildings Based on Digital 3D Reconstruction. Procedia Comput. Sci. 2023, 228, 593–600. [Google Scholar] [CrossRef]
  38. Shanti, Z.; Al-Tarazi, D. Virtual Reality Technology in Architectural Theory Learning: An Experiment on the Module of History of Architecture. Sustainability 2023, 15, 16394. [Google Scholar] [CrossRef]
  39. Whiton, R.; Chen, J.; Johansson, T.; Tufvesson, F. Urban Navigation with LTE using a Large Antenna Array and Machine Learning. In Proceedings of the 2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring), Helsinki, Finland, 19–22 June 2022; pp. 1–5. [Google Scholar] [CrossRef]
  40. Zhang, Y.; Nakajima, T. Exploring the Design of a Mixed-Reality 3D Minimap to Enhance Pedestrian Satisfaction in Urban Exploratory Navigation. Future Internet 2022, 14, 325. [Google Scholar] [CrossRef]
  41. Long, L.; Gan, Z.; Liu, Z.; Zhao, B.; Li, Q. MSD-Det: Masonry structures damage detection dataset for preventive conservation of heritage. J. Cult. Herit. 2025, 73, 358–370. [Google Scholar] [CrossRef]
  42. Yang, G.; Zhao, B.; Zhang, J.; Wen, J.; Li, Q.; Lei, L.; Chen, X.; Chen, B. Det-Recon-Reg: An Intelligent Framework Toward Automated UAV-Based Large-Scale Infrastructure Inspection. IEEE Trans. Instrum. Meas. 2025, 74, 1–16. [Google Scholar] [CrossRef]
  43. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Figure 1. Overall pipeline of the proposed DP-AMF framework. From left to right: (1) MARIGOLD depth-prior branch; (2) shallow CNN, DINO-ViT and ResNet-18 branches; (3) adaptive global–local fusion; (4) two-stage implicit reconstruction for geometry and texture.
Figure 1. Overall pipeline of the proposed DP-AMF framework. From left to right: (1) MARIGOLD depth-prior branch; (2) shallow CNN, DINO-ViT and ResNet-18 branches; (3) adaptive global–local fusion; (4) two-stage implicit reconstruction for geometry and texture.
Jimaging 11 00246 g001
Figure 2. Qualitative comparison of monocular depth priors on a representative indoor scene. Arrows highlight regions where MARIGOLD more faithfully preserves object boundaries and reduces noise in occluded areas, demonstrating its superior object-level detail and global consistency [14].
Figure 2. Qualitative comparison of monocular depth priors on a representative indoor scene. Arrows highlight regions where MARIGOLD more faithfully preserves object boundaries and reduces noise in occluded areas, demonstrating its superior object-level detail and global consistency [14].
Jimaging 11 00246 g002
Figure 3. Visualization of our adaptive fusion process. From left to right: the global-vision feature F vit , the local-detail feature F res , the depth-sensitive feature F depth , and the final fused feature F mix after the 1 × 1 convolution.
Figure 3. Visualization of our adaptive fusion process. From left to right: the global-vision feature F vit , the local-detail feature F res , the depth-sensitive feature F depth , and the final fused feature F mix after the 1 × 1 convolution.
Jimaging 11 00246 g003
Figure 4. Reconstruction results of indoor objects on the 3D-FRONT dataset [29]. We compare our method against the strong baseline SSR [12]. Red bounding boxes highlight regions with notable differences, illustrating our improved recovery of fine geometric details.
Figure 4. Reconstruction results of indoor objects on the 3D-FRONT dataset [29]. We compare our method against the strong baseline SSR [12]. Red bounding boxes highlight regions with notable differences, illustrating our improved recovery of fine geometric details.
Jimaging 11 00246 g004
Figure 5. In the second stage of indoor object reconstruction, the proposed model generates high-quality textured 3D objects. Red bounding boxes highlight the regions of our reconstructed objects.
Figure 5. In the second stage of indoor object reconstruction, the proposed model generates high-quality textured 3D objects. Red bounding boxes highlight the regions of our reconstructed objects.
Jimaging 11 00246 g005
Table 1. Loss configurations and training milestones for the two-stage training scheme.
Table 1. Loss configurations and training milestones for the two-stage training scheme.
L sdf L color L normal L depth
Stage 11.0000
Stage 2 (3D-FRONT)1.00→0.1 (epochs 30–80)0→0.01 (epochs 30–80)0→0.1 (epochs 30–80)
Stage 2 (Pix3D)1.00→0.1 (epochs 50–150)0→0.01 (epochs 50–150)0→0.1 (epochs 50–150)
Table 2. Dataset splits for 3D-FRONT and Pix3D.
Table 2. Dataset splits for 3D-FRONT and Pix3D.
DatasetTrainValidationTest
3D-FRONT22,103 (74.5%)2550 (8.6%)5006 (16.8%)
Pix3D6931 (55.6%)2778 (22.3%)2762 (22.1%)
Table 3. Experimental setup and hyperparameters.
Table 3. Experimental setup and hyperparameters.
ItemConfiguration
SystemUbuntu 20.04, NVIDIA RTX 3090
FrameworkPyTorch + CUDA
Training Stages2 (Stage 1: Geometry only; Stage 2: Full reconstruction)
Epochs200(3D-FRONT) 300(Pix3D)
OptimizerAdam, LR = 1 × 10 4
Batch Size96 (3D-FRONT) 128(Pix3D)
Image Resolution 484 × 648
CNN BackbonesResNet-32, ResNet-18 (ImageNet pre-trained)
ViT BackboneDINO-ViT-16 (768-dim CLS)
Depth PriorMARIGOLD diffusion pretrained model
Point SamplingN = 64 samples/ray
Table 4. Evaluation of object reconstruction on the 3D-FRONT dataset [29]. Bold indicates the best performance, underlined the second best, and the shaded row highlights our method.
Table 4. Evaluation of object reconstruction on the 3D-FRONT dataset [29]. Bold indicates the best performance, underlined the second best, and the shaded row highlights our method.
MetricsCategoryBedChairSofaTableDeskNightstandCabinetBookshelfMean
CD ↓
(7.64%)
MGN15.4811.678.7220.9017.5917.1113.1310.2114.07
LIEN16.8141.409.5135.6526.6316.7811.7011.7028.52
InstPIFu18.1714.067.6623.2533.3311.736.048.0314.46
SSR13.1212.056.4719.3228.4511.876.187.2313.08
Ours12.0510.895.9418.4726.8710.155.275.8212.08
F-Score ↑
(2.81%)
MGN46.8157.4964.6149.8046.8247.9154.1854.5555.64
LIEN44.2831.6161.4043.2237.0450.7669.2155.3345.63
InstPIFu47.8559.0867.6056.4348.4957.1473.3266.1361.32
SSR52.1362.4769.2160.3452.7860.1275.4568.0962.25
Ours54.2164.3771.0862.0154.6662.4576.1269.3064.00
NC ↑
(5.88%)
MGN0.8290.7580.8190.7850.7110.8330.8020.7190.787
LIEN0.8220.7930.8030.7550.7010.8140.8010.7470.786
InstPIFu0.7990.7820.8460.8040.7080.8440.8410.7900.810
SSR0.8320.8030.8490.8140.7090.8610.8280.8060.813
Ours0.8340.8120.8610.8220.7290.8690.8480.8150.824
Table 5. Evaluation of object reconstruction on the Pix3D dataset [30]. Bold indicates the best performance, underlined the second best, and the shaded row highlights our method.
Table 5. Evaluation of object reconstruction on the Pix3D dataset [30]. Bold indicates the best performance, underlined the second best, and the shaded row highlights our method.
MetricsModelsBedBookcaseChairDeskSofaTableToolWardrobeMiscMean
CD ↓
(4.84%)
MGN22.9133.6156.4733.959.2781.1994.7010.43137.5044.32
LIEN11.1829.6140.0165.3610.54146.1329.634.88144.0651.31
InstPIFu10.907.5532.4422.098.1345.8210.291.2947.3124.65
SSR6.317.2126.2328.635.6843.878.292.0735.0321.79
Ours6.056.9225.5127.735.5242.107.981.9334.1220.83
F-Score ↑
(2.74%)
MGN34.6928.4235.6765.3651.1517.0557.1652.0410.4136.20
LIEN37.1315.5125.7026.0149.7121.165.8559.4611.0431.45
InstPIFu54.9962.2635.3047.3056.5437.5164.2494.6227.0345.62
SSR68.7866.6955.1842.4971.2251.9365.3891.8446.9259.71
Ours69.4567.1256.2343.7872.0453.8766.0592.3048.3161.35
NC ↑
(5.85%)
MGN0.7370.5920.5250.6330.7560.7940.5310.8090.5630.659
LIEN0.7060.5140.5910.5810.7750.6190.5060.8440.4810.646
InstPIFu0.7820.6460.5470.7580.7530.7960.6390.9510.5800.683
SSR0.8250.6890.6930.7760.8660.8350.6450.9600.5990.778
Ours0.8310.6960.7020.7810.8710.8420.6520.9650.6100.791
Table 6. Comparison of parameter count and computational cost between SSR and DP-AMF.
Table 6. Comparison of parameter count and computational cost between SSR and DP-AMF.
ModelParams (M)GFLOPs (G)ΔParams (M)ΔGFLOPs (G)
SSR36.29147.44
DP-AMF36.48215.85+0.19+68.41
Table 7. Ablation study results on 3D-FRONT after 80 training epochs.
Table 7. Ablation study results on 3D-FRONT after 80 training epochs.
Depth ModuleGlobal EncoderCD ↓F-Score ↑NC ↑
MARIGOLDDINO-ViT16.2359.970.806
MARIGOLDViT17.87 (+1.64)59.11 (−0.86)0.789 (−0.017)
MARIGOLD19.00 (+2.77)59.24 (−0.73)0.767 (−0.036)
MiDaSDINO-ViT17.48 (+1.25)59.85 (−0.12)0.791 (−0.015)
DINO-ViT21.08 (+4.85)56.22 (−3.75)0.778 (−0.028)
24.15 (+7.92)54.99 (−4.98)0.770 (−0.036)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, L.; Xie, C.; Kitahara, I. DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction. J. Imaging 2025, 11, 246. https://doi.org/10.3390/jimaging11070246

AMA Style

Zhang L, Xie C, Kitahara I. DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction. Journal of Imaging. 2025; 11(7):246. https://doi.org/10.3390/jimaging11070246

Chicago/Turabian Style

Zhang, Luoxi, Chun Xie, and Itaru Kitahara. 2025. "DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction" Journal of Imaging 11, no. 7: 246. https://doi.org/10.3390/jimaging11070246

APA Style

Zhang, L., Xie, C., & Kitahara, I. (2025). DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction. Journal of Imaging, 11(7), 246. https://doi.org/10.3390/jimaging11070246

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop