From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics

Wu, Guang; Ge, Mingyuan; Wang, Yunxiang; Chen, Youhao; Liu, Li

doi:10.3390/app16062678

Open AccessArticle

From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics

by

Guang Wu

^1,2,

Mingyuan Ge

²

,

Yunxiang Wang

²,

Youhao Chen

²

and

Li Liu

^2,*

¹

Chongqing Cultural Relics and Archaeology Research Institute, Chongqing 400013, China

²

School of Big Data & Software Engineering, Chongqing University, Chongqing 400044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2678; https://doi.org/10.3390/app16062678

Submission received: 31 January 2026 / Revised: 4 March 2026 / Accepted: 9 March 2026 / Published: 11 March 2026

Download

Browse Figures

Versions Notes

Abstract

The acquisition of high-quality three-dimensional (3D) models of cultural relics often relies on expensive scanning equipment or multi-view image capture, which limits large-scale deployment in real-world heritage conservation scenarios. Large-scale water impoundment in the Three Gorges region has resulted in the permanent submergence of numerous cultural relics and archaeological remains. For many of these artifacts, only a single two-dimensional image remains as the sole visual record, posing significant challenges for reconstructing their original three-dimensional geometry and appearance. This limitation renders traditional multi-view reconstruction and physical scanning methods infeasible. To address this challenge, we propose a generative framework for reconstructing high-fidelity 3D digital models of Chinese Three Gorges cultural relics from a single two-dimensional (2D) image. Building upon recent advances in generative 3D representation learning, the proposed method adopts a transformer-based image-to-triplane architecture to infer an implicit 3D representation directly from a single RGB image. A vision transformer encoder is employed to extract global and local visual features, which are subsequently projected into a compact triplane representation through a cross-attention-based decoder. The reconstructed triplane features are further decoded by a neural radiance field (NeRF) to synthesize dense geometry and appearance, enabling accurate mesh extraction and novel-view rendering. To enhance robustness under in-the-wild conditions, the model implicitly estimates camera parameters during inference without relying on explicit calibration information. The proposed method is evaluated on a dataset of Chinese Three Gorges cultural relics, covering diverse artifact categories and visual styles. Experimental results demonstrate that the proposed framework is capable of producing structurally coherent and visually consistent 3D reconstructions from a single image, effectively preserving key morphological characteristics of cultural relics under limited data conditions. Compared with existing single-image and multi-view reconstruction baselines, the proposed framework exhibits better reconstruction accuracy, visual consistency, and generalization capability. This study provides an efficient and scalable solution for the digital reconstruction of cultural relics and offers a practical pathway for large-scale 3D digitization of heritage artifacts from archival images. This work provides a practical solution for the digital reconstruction of submerged heritage artifacts and contributes to the application of generative 3D modeling techniques in cultural heritage preservation and restoration.

Keywords:

3D reconstruction; single image; cultural relics; Three Gorges

1. Introduction

The digital documentation and reconstruction of cultural relics constitute a fundamental component of cultural heritage preservation, supporting long-term conservation, academic research, virtual exhibition, and public dissemination [1,2]. High-fidelity three-dimensional (3D) digital models enable accurate recording of geometric structure, surface details, and spatial relationships, thereby providing an irreplaceable digital surrogate for fragile or endangered artifacts [3,4]. In recent years, 3D digitization has become an essential tool in heritage science, facilitating not only visual presentation but also structural analysis, restoration planning, and cross-disciplinary studies.

However, the acquisition of high-quality 3D models in real-world heritage scenarios remains highly constrained. Conventional digitization techniques, such as laser scanning [5], structured-light scanning [6], or multi-view photogrammetry [7], typically require controlled environments, specialized equipment, and comprehensive image coverage from multiple viewpoints. These requirements significantly limit their applicability in large-scale heritage documentation, particularly for artifacts that are no longer physically accessible or whose acquisition conditions cannot be replicated [8,9]. As a result, many culturally significant relics lack accurate 3D digital representations, leaving critical historical and structural information insufficiently preserved.

In recent years, with the rapid advancement of NeRF representation in 3D modeling, numerous related techniques and methods have been applied in the field of cultural relic digitization. We present a detailed comparison of existing approaches, as illustrated in the Figure 1.

As illustrated in Figure 1a, implicit neural scene representations model a 3D scene as a continuous function parameterized by coordinate-based neural networks. These methods, including neural radiance fields and related coordinate networks, are highly expressive and memory-efficient, as they avoid storing dense volumetric grids and instead infer geometry and appearance through function evaluation. Such implicit formulations are well suited for representing complex geometries and fine details. However, in practice, they rely on deep multilayer perceptrons that must be fully evaluated for each spatial query, leading to high computational cost and limited rendering efficiency, especially at high resolutions.

In contrast, explicit 3D representations, exemplified by discrete voxel grids as shown in Figure 1b, store scene information directly in a structured volumetric form. These representations allow fast random access and efficient evaluation during rendering or surface extraction. Nevertheless, their memory consumption grows cubically with spatial resolution, making them difficult to scale to high-fidelity reconstructions or large scenes. This trade-off between computational efficiency and memory usage fundamentally limits purely explicit approaches in practical applications.

Building on this line of research, tri-plane representations have emerged as a particularly effective hybrid design. Instead of maintaining dense volumetric grids, features are explicitly stored on three orthogonal axis-aligned planes, dramatically reducing memory overhead. During rendering, a lightweight implicit decoder aggregates features sampled from these planes to predict density and appearance at arbitrary 3D locations. This design preserves the expressiveness of implicit models while enabling fast evaluation and efficient training, making it especially suitable for high-quality 3D reconstruction and generative modeling.

Beyond the limitations in representation mentioned above, applying 3D generation technology to specific Three Gorges archaeological artifacts also faces the following challenges.

First, many cultural relics from the Three Gorges region have been permanently submerged due to large-scale water impoundment projects. Numerous artifacts and archaeological remains can no longer be physically accessed, scanned, or re-photographed. For a large portion of these relics, only a single two-dimensional archival image survives as the sole visual record. This makes conventional multi-view reconstruction pipelines and active 3D scanning techniques fundamentally infeasible.

Second, single-image-based reconstruction is intrinsically ill-posed. A single RGB image lacks explicit depth cues, camera parameters, and multi-view geometric constraints. Critical information such as occluded structures, backside geometry, and true scale is inherently missing, leading to severe ambiguity in 3D shape inference [10,11]. Existing reconstruction methods often suffer from distorted geometry, incomplete surfaces, or inconsistent global structures under such conditions.

Third, cultural relic imagery typically exhibits large variations in appearance and quality. Many archival images contain complex backgrounds, uneven illumination, occlusions, and degradation artifacts caused by aging, digitization, or environmental factors [12,13]. These challenges further complicate the separation of foreground relic structures from irrelevant background content, significantly reducing reconstruction robustness and generalization ability.

Motivated by these challenges, this work aims to explore whether a plausible and structurally coherent 3D representation can be recovered from a single unconstrained image, without relying on explicit camera calibration or multi-view supervision. As illustrated in Figure 1c, recent advances in generative 3D representation learning provide a promising direction for addressing this problem by modeling shape and appearance implicitly in a compact latent space [14,15]. Hybrid representations explicitly store intermediate features in a compact, structured form while employing lightweight implicit decoders to aggregate these features for continuous querying and rendering. Local implicit representations and plane-based hybrid models fall into this category, offering improved scalability and efficiency.

In particular, transformer-based architectures enable effective extraction of global and local visual features from a single image [16,17], while implicit volumetric representations allow continuous modeling of geometry and appearance [18,19]. By integrating these techniques, it becomes possible to infer dense 3D structures from sparse 2D observations, offering a practical pathway for large-scale digital restoration of cultural relics from archival imagery.

Based on the above motivations, this paper makes the following contributions:

We present a generative single-image-to-3D reconstruction framework tailored for cultural relic digitization, enabling the recovery of structurally coherent 3D models from a single RGB image under unconstrained conditions.
We adopt a transformer-based image-to-triplane representation that effectively captures both global structure and fine-grained visual details, and decode it into an implicit volumetric representation for high-quality geometry synthesis.
We demonstrate the effectiveness of the proposed framework on a dataset of Chinese Three Gorges cultural relics, showing superior reconstruction accuracy, surface completeness, and visual consistency compared with existing single-image and multi-view baselines.
This study provides a scalable and practical solution for heritage digitization from limited visual data, contributing to the broader application of generative 3D modeling techniques in cultural heritage preservation and digital restoration.

2. Related Work

2.1. Neural Implicit Representations for 3D Reconstruction

The paradigm of representing 3D scenes through neural implicit functions has fundamentally transformed computer vision and graphics. Occupancy Networks [20] and DeepSDF [15] pioneered the use of continuous neural networks to encode 3D geometry, enabling resolution-independent surface modeling. These approaches represent shapes as decision boundaries of neural classifiers, offering memory efficiency compared to explicit voxel grids. However, they primarily focus on geometry alone without appearance modeling. Neural Radiance Fields (NeRF) [18] extended this paradigm by encoding both density and view-dependent color within a single MLP, enabling photorealistic novel view synthesis through differentiable volume rendering. This seminal work sparked extensive follow-up research addressing various limitations. Mip-NeRF [21] and Mip-NeRF 360 [22] improved anti-aliasing and unbounded scene handling, while BARF [23] and NeRF– [24] relaxed the requirement of known camera parameters through joint pose optimization. Ref-NeRF [25] enhanced modeling of specular reflections, and NeRF in the Wild [26] addressed appearance variations in unconstrained photo collections. Despite these advances, vanilla NeRF requires multiple views for training, limiting direct applicability to single-image reconstruction. Recent surveys by Tewari et al. [27] and Xiao et al. [28] comprehensively review the evolution of neural rendering, highlighting the transition toward efficiency and generalization. Our work builds upon these foundations but addresses the unique challenge of single-image reconstruction without multi-view supervision.

2.2. Single-Image 3D Reconstruction: From Optimization to Feed-Forward Models

Single-image 3D reconstruction has progressed through distinct methodological phases. Early approaches relied on shape-from-shading [29], template-based fitting [30], or category-specific deformable models [31], achieving limited generalization across object categories. The advent of deep learning enabled data-driven shape prediction through encoder-decoder architectures, yet these methods typically produced coarse voxel grids [32] or point clouds [33] lacking fine surface detail. The emergence of large-scale 3D datasets catalyzed a shift toward feed-forward reconstruction models. Pix2Vox [34] and 3D-R2N2 [32] demonstrated voxel prediction from single images using recurrent networks, though resolution remained constrained by memory limitations. Point-based methods like PSGN [33] offered greater flexibility but required post-processing for mesh extraction. More recently, the convergence of implicit representations and transformer architectures has enabled unprecedented reconstruction quality. LRM [35] demonstrated that transformers trained on million-scale datasets can directly predict triplane representations from single images, achieving geometric fidelity previously requiring per-shape optimization. This paradigm shift was rapidly extended: TripoSR [36] improved efficiency through distillation, Instant3D [37] integrated multi-view diffusion for enhanced quality, and ZeroShape [38] explored zero-shot generalization. Concurrently, CRM [39] and MeshLRM [40] replaced volumetric rendering with direct mesh extraction using differentiable iso-surfacing, accelerating inference while maintaining quality. However, these methods predominantly target synthetic object-centric datasets such as Objaverse [41] and ABO [42]. Their performance degrades significantly on real-world archival imagery with complex backgrounds, illumination variations, and imaging artifacts—precisely the conditions characteristic of cultural heritage documentation [43].

2.3. Triplane and Hybrid 3D Representations

The tension between implicit and explicit 3D representations has motivated hybrid approaches combining complementary strengths. Voxel-based methods [44] offer efficient random access but suffer cubic memory scaling, while pure implicit functions [14] enable infinite resolution yet require expensive network evaluations. Triplane representations, introduced by Chan et al. [45], resolve this tension through factorized 3D feature grids. By decomposing volumetric features into three orthogonal 2D planes, triplanes achieve memory complexity compared to voxels, while preserving high spatial resolution. This representation has become foundational for efficient 3D generative modeling, as demonstrated in EG3D [45] for face synthesis and StyleNeRF [46] for general objects. In the context of reconstruction, triplanes enable fast feature sampling through bilinear interpolation followed by lightweight MLP decoding. Recent innovations explore triplane variants: TriplaneGaussian [47] hybridizes triplanes with 3D Gaussian splatting [48] for real-time rendering; volume-triplane combinations enhance expressiveness through progressive refinement; and multi-scale triplanes capture hierarchical geometric detail. Despite these advances, triplane-based methods remain sensitive to training distribution and may produce artifacts when applied to out-of-domain imagery [49]. Our work adopts triplane representations for their efficiency but addresses robustness limitations through improved feature extraction and camera-agnostic inference, critical for handling diverse cultural heritage imagery.

2.4. Vision Transformers and Self-Supervised Learning

The architectural backbone for visual understanding has undergone a fundamental transformation with the rise of Vision Transformers (ViT) [17]. Unlike convolutional networks with inductive biases for local spatial coherence, transformers employ self-attention to capture global relationships, proving particularly effective for high-level semantic tasks [50,51]. For 3D reconstruction, the quality of visual features critically determines geometric fidelity. Supervised pre-training on ImageNet [52] provides generic visual representations but may lack the granularity needed for precise shape inference. Self-supervised learning offers an alternative paradigm: DINO [53] demonstrates that discriminative training without labels yields features capturing object parts and semantic correspondence. DINOv2 [54] scales this approach to larger models and datasets, producing robust visual features transferable to diverse downstream tasks. DINOv3 [55] further advances self-supervised vision learning by scaling and stabilizing the self-distillation framework of DINO [53] and DINOv2 [54], producing more robust, transferable, and spatially consistent visual representations across diverse downstream tasks. The integration of DINO-pretrained encoders into reconstruction pipelines has shown remarkable effectiveness. LRM [35] leverages DINO features for triplane decoding, while subsequent methods [56] explore cross-attention mechanisms for fusing image and 3D representations. However, the application to degraded, low-quality archival images—common in cultural heritage—remains underexplored. Our work builds upon DINO’s robust feature learning but adapts the architecture for handling challenging real-world imagery through implicit camera estimation and background-robust training.

2.5. Digital Documentation and Traditional Reconstruction

3D reconstruction serves as the technical foundation for digital cultural heritage (DCH), enabling “digital twinning” for conservation and analysis [57,58]. Early efforts primarily relied on terrestrial laser scanning (TLS) to capture high-precision point clouds of large-scale sites. To achieve better texture fidelity, multi-view photogrammetry based on Structure-from-Motion (SfM) and Multi-View Stereo (MVS) became the industry standard [59]. Recent studies have further explored the integration of multi-modal sensors (e.g., combining LiDAR with thermal or hyperspectral imaging) to document internal structures and material decay. However, these methods are resource-intensive, requiring controlled lighting and hundreds of images, which are often unavailable for lost or dispersed artifacts.

3. Methodology

3.1. Task Definition and Motivation

The task addressed in this study is the reconstruction of a three-dimensional digital representation of a cultural relic from a single two-dimensional image. Specifically, given a single RGB image

I \in R^{H \times W \times 3}

, the objective is to infer a 3D representation that faithfully captures the geometric structure and visual characteristics of the depicted artifact.

Formally, the task can be described as learning a mapping function

F : I \to S

, where

S

denotes an implicit 3D representation of the cultural relic, parameterized as a volumetric field that encodes spatial density and appearance information. The representation

S

is subsequently converted into an explicit surface mesh

M = Ψ (S)

, where

Ψ (\cdot)

denotes a surface extraction operator, such as marching cubes.

Unlike conventional 3D reconstruction tasks, this problem assumes that only a single image is available and that camera parameters, including intrinsic and extrinsic information, are unknown. Consequently, the reconstruction problem is severely ill-posed, as depth, shape, and viewpoint must be inferred from limited visual evidence. The task, therefore, requires the model to rely on learned geometric priors and generative reasoning to produce a plausible and structurally consistent 3D reconstruction.

The final output is an explicit 3D mesh

M

, optionally accompanied by a texture representation, which enables visualization, analysis, and long-term digital preservation of cultural relics that are no longer physically accessible. The workflow illustration of our proposed method is shown in Figure 2.

3.2. Implicit 3D Representation

We model the artifact using a continuous neural radiance field (NeRF) [18], which represents the object as a function:

F : (x, d) \to (σ, c),

(1)

where

x \in R^{3}

denotes a 3D spatial coordinate,

d \in S^{2}

is the viewing direction,

σ \in R^{+}

represents the volumetric density, and

c \in R^{3}

is the emitted RGB color.

This implicit formulation enables continuous modeling of complex artifact surfaces and avoids the resolution limitations inherent to explicit voxel-based representations.

3.3. Image Encoder

To extract robust visual features from the input image, we employ an image encoder

E_{θ}

based on a Vision Transformer (ViT) [17] architecture pre-trained using DINO [53] self-supervised learning.

The encoder maps the input image to a sequence of latent tokens:

Z = E_{θ} (I), Z \in R^{N \times D},

(2)

where N denotes the number of tokens and D is the embedding dimension.

The use of a transformer-based encoder allows the model to capture long-range dependencies and global semantic structure, which is particularly beneficial for reconstructing large-scale shapes and structural motifs commonly found in cultural heritage artifacts.

3.4. Image-to-Triplane Decoder

To efficiently bridge 2D image representations and 3D geometry, weemploys a triplane representation consisting of three orthogonal feature planes:

T = {P_{x y}, P_{y z}, P_{x z}},

(3)

where each plane

P \in R^{R \times R \times C}

encodes spatial features along a pair of coordinate axes.

An image-to-triplane decoder

D_{ϕ}

projects the image tokens

Z

into the triplane feature space:

T = D_{ϕ} (Z) .

(4)

The decoder is implemented as a stack of transformer layers, each composed of a self-attention module and a cross-attention module. The self-attention mechanism models interactions within the triplane feature space, while the cross-attention mechanism injects semantic and structural information from the image tokens into the 3D representation.

3.5. Triplane Feature Sampling

Given a 3D query point

x = (x, y, z)

, its feature representation is obtained by bilinearly sampling the corresponding coordinates on each triplane:

f (x) = Concat (P_{x y} (x, y), P_{y z} (y, z), P_{x z} (x, z)),

(5)

where

f (x) \in R^{3 C}

denotes the aggregated feature vector.

This factorized representation significantly reduces memory consumption while preserving sufficient spatial expressiveness to model intricate surface details, such as engravings and relief patterns.

3.6. Radiance Field Prediction

The aggregated feature

f (x)

and the viewing direction

d

are fed into a multilayer perceptron

G_{ψ}

:

(σ, c) = G_{ψ} (f (x), d),

(6)

which predicts volumetric density and color.

This formulation decouples spatial feature learning from view-dependent appearance modeling, enabling accurate reconstruction under varying viewpoints.

3.7. Camera-Agnostic Rendering

Unlike conventional NeRF-based approaches that require known camera metadata, our framework employs a learned camera estimation branch to achieve camera-agnostic inference.

We integrate a lightweight MLP-based head

Φ_{c a m}

that operates on the global latent representation

z

extracted by the ViT encoder. This branch directly regresses the camera parameters

v = {f, R, t}

, where f is the focal length and

(R, t)

denote the camera pose.

During the training phase, the camera branch is optimized jointly with the triplane generator. The predicted camera parameters

v

are used by the differentiable renderer

R

to generate the reconstructed image

\hat{I}

. The network is supervised by minimizing the rendering loss

L_{render}

between the ground-truth image

I

and the predicted rendering:

L_{render} = \frac{1}{H W} \sum_{i, j} {| R {(S, Φ_{c a m} (z))}_{i j} - I_{i j} |}_{1} .

(7)

To ensure stability under single-view constraints, we do not optimize camera parameters during inference. Instead, the fixed weights of

Φ_{c a m}

provide a deterministic, feed-forward mapping from image features to camera space. This design allows for robust 3D reconstruction from historical archival images even when their original focal lengths or viewpoints are entirely undocumented.

3.8. Surface Extraction and Texture Generation

To obtain an explicit geometric representation, an isosurface is extracted from the predicted density field using the marching cubes algorithm

Ψ

:

M = Ψ (σ (x), τ),

(8)

where

τ

denotes a density threshold.

Optionally, a texture atlas is generated by projecting radiance field colors onto the extracted mesh:

T = Bake (M, F)

, yielding a textured mesh suitable for visualization, structural analysis, and digital preservation.

4. Experimental Results

4.1. Dataset and Setup

Dataset. We have compiled a total of 3302 artifacts by collecting data from all six Three Gorges archaeological sites (Daxi Site, Zhongba Site, Xiaotianxi Site, Baidicheng Site, and Linjiang No. 2 Team Zinc Smelting Site). The artifact types include porcelain, pottery, bronze ware, jade objects, stone carvings, and pottery figurines, among others. Since all artifacts were photographed during rescue excavations and no complete physical specimens remain, the ground-truth 3D shapes for most artifacts were synthesized and rendered based on recollections by corresponding site archaeologists. Although these are not original artifact 3D data, their accuracy has been jointly verified by archaeologists from the Chongqing Institute of Cultural Relics and Archaeology and the original excavation teams at each site. Therefore, we utilize this data as the ground-truth for the artifacts. The dataset was divided into three subsets: 2500 cultural relic instances were used for training, 500 instances for validation, and the remaining 302 instances were reserved for testing.

Multi-view rendering setup. For all experiments, we render multi-view images using a set of uniformly sampled virtual cameras placed on a circular trajectory around the object. Specifically, we sample

N = 30

camera positions evenly distributed along a horizontal orbit with a fixed radius R, while keeping the camera elevation constant. All cameras are oriented to look at the object center. Formally, the camera position

c_{i} \in R^{3}

for the i-th view is defined as:

c_{i} = (R \times cos (\frac{2 π i}{N}), R \times sin (\frac{2 π i}{N}), 0)

(9)

where

i = 0, 1, \dots, N - 1

,

R = 1.9

denotes the camera radius. The camera orientation is defined such that the optical axis always points toward the scene origin, with a fixed up-vector aligned with the global z-axis.

Evaluation protocol. To obtain explicit surface representations, we apply the Marching Cubes algorithm [60] to extract meshes from the learned implicit 3D fields. We evaluate the reconstruction performance from three complementary perspectives: geometric accuracy, surface completeness, and computational efficiency. Chamfer Distance (CD) is adopted to measure the geometric discrepancy between the predicted and ground-truth shapes, reflecting the accuracy of global shape reconstruction. A lower CD indicates better alignment of overall geometry. The F-score (FS) is used to assess surface-level consistency by jointly considering precision and recall under different distance thresholds. Higher F-score values imply improved surface completeness and structural integrity, which are critical for preserving fine morphological characteristics of cultural artifacts. FS@0.1 evaluates fine-grained surface accuracy under a strict tolerance, emphasizing local geometric details. FS@0.2 focuses on structural consistency by allowing moderate deviations, while FS@0.5 measures global shape completeness under a relaxed threshold. Together, these metrics provide a comprehensive assessment of surface fidelity from detailed to coarse levels. In addition, inference time is reported to evaluate computational efficiency. This metric reflects the practical applicability of the method for large-scale digital restoration and archival scenarios.

Computational Efficiency and Resource Consumption. To evaluate the scalability of the proposed framework for large-scale archival deployment, we provide a detailed analysis of the computational resources. The model was trained on NVIDIA RTX 4090 GPUs. For a single input image, the entire pipeline—from triplane generation to mesh extraction via Marching Cubes—takes approximately 1.5 to 2.0 s (on an NVIDIA RTX 3090). This is significantly faster than optimization-based methods (e.g., DreamFusion), which typically require 30–60 min per object. The peak GPU memory consumption during inference is approximately 6.4 GB, making it deployable on consumer-grade hardware. The ’Image-to-Triplane’ architecture allows for batch processing. For a large-scale repository of 1000 relic images, the total reconstruction time is under 35 min, demonstrating high throughput for institutional digital archiving.

Setup details. Our framework is designed for high-throughput archiving; the Image-to-Triplane feed-forward process avoids the costly per-shape optimization loop. We have added data regarding peak VRAM usage (6.4 GB) and total throughput (approx. 1000 artifacts per 35 min on a single GPU) to Section 4.1 to justify its scalability for museum-scale deployment. To ensure model reproducibility, we detail the architecture’s key parameters, including the number of Transformer layers, dimensions of the triplane, specific NeRF model settings, and primary training configurations, as shown in Table 1.

Table 1. Model configuration of Algorithm 1.

Parameter		Value
Image Tokenizer	image resolution	$512 \times 512$
	patch size	16
	attention layers	12
	feature channels	768
Triplane Tokenizer	tokens	$32 \times 32 \times 3$
Triplane Tokenizer	channels	16
Backbone	channels	1024
	attention layers	16
	attention heads	16
	attention head dim	64
	cross attention dim	768
Triplane Upsampler	factor	2
	input channels	1024
	output channels	40
	output shape	$64 \times 64 \times 40$
NeRF MLP	width	64
	layers	10
	activation	SiLU
Renderer	samples per ray	128
	radius	0.87
	density activation	exp
	density bias	−1.0
Training	learning rate	$4 \times 10^{- 4}$
	optimizer	AdamW
	weight decay	0.05
	lr scheduler	CosineAnnealingLR
	warm-up steps	2000
	batch size	64
	total epochs	40
	Training Time	∼46 h
	total params	∼438 M
	FLOPs	∼156 G

Algorithm 1: From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics

4.2. Qualitative Evaluation

Qualitative comparisons are conducted between the proposed method and the representative baseline LRM under multiple viewing angles, with the corresponding results presented in Figure 3 and Figure 4. These visualizations aim to evaluate geometric consistency, structural stability, and surface appearance of the reconstructed 3D models from single-view inputs.

Figure 3 focuses on the consistency of reconstructed geometry across uniformly sampled viewpoints (

i = 0, 5, 10, 15, 20, 25, 29

). As observed, the proposed method maintains stable object proportions, coherent silhouettes, and consistent axial symmetry throughout viewpoint changes, including regions that are not directly visible in the input image. In contrast, the baseline method exhibits noticeable geometric inconsistencies when rendered from unseen viewpoints, such as irregular curvature, varying thickness, and localized structural degradation. These artifacts indicate limited generalization of the baseline approach under view extrapolation.

Figure 4 provides additional qualitative evidence by highlighting reconstruction performance on artifacts with more complex geometries. For objects featuring narrow necks, protruding rims, or concave structures, the proposed method preserves structural continuity and surface smoothness across rotations. The reconstructed surfaces remain visually stable, whereas the baseline results often contain fragmented geometry and abrupt surface transitions. Such issues become more apparent at oblique viewing angles, where incomplete geometric inference can lead to implausible deformations.

Furthermore, the proposed method demonstrates improved preservation of fine-scale structural characteristics that are relevant for cultural heritage documentation, including rim definition, body curvature, and base geometry. Although texture fidelity is not the primary focus of this qualitative evaluation, the reconstructed appearances show fewer visual artifacts and more uniform surface distributions, contributing to enhanced visual interpretability.

Overall, the qualitative results presented in both figures are consistent with the quantitative evaluation, indicating that the proposed framework is capable of producing structurally stable and visually coherent 3D reconstructions from single images. The observed robustness across diverse viewpoints and object geometries suggests its applicability to practical scenarios such as digital documentation, visualization, and long-term archival preservation of cultural heritage artifacts.

4.3. Quantitative Evaluation

Table 2 presents a quantitative comparison between the proposed method and several representative single-image 3D reconstruction approaches on the test set. Reconstruction quality is evaluated using Chamfer Distance (CD) and F-score (FS) under multiple distance thresholds, which jointly measure geometric accuracy and surface completeness.

Among early generative methods, One-2-3-45 exhibits the weakest performance, with a relatively high Chamfer Distance of 0.378 and low F-scores (0.299, 0.597, and 0.776 at thresholds 0.1, 0.2, and 0.5, respectively), indicating limited reconstruction precision and incomplete surface coverage. ZeroShape improves the CD to 0.223 and increases FS@0.1 to 0.423, but its performance remains constrained under stricter thresholds. TGS further reduces the CD to 0.188 and achieves an FS@0.1 of 0.579, reflecting improved geometric fidelity. More recent approaches, including LRM and VGGT, demonstrate stronger reconstruction accuracy. LRM achieves a CD of 0.177 and an FS@0.2 of 0.755, while VGGT further reduces the CD to 0.170 and attains higher F-scores across most thresholds (FS@0.1 = 0.607, FS@0.5 = 0.838), indicating improved global alignment and surface consistency. The proposed method achieves the best performance across all evaluation metrics. It records the lowest Chamfer Distance of 0.163, outperforming VGGT by 4.1% relative improvement. In terms of surface accuracy, it achieves the highest F-scores at all thresholds, with FS@0.1 = 0.622, FS@0.2 = 0.756, and FS@0.5 = 0.848. Compared with the strongest baseline (VGGT), the proposed method improves FS@0.1 and FS@0.5 by 1.5 and 1.0 percentage points, respectively. The consistent gains under stricter thresholds indicate enhanced preservation of fine-scale geometric details while maintaining overall surface completeness.

These quantitative results demonstrate that the proposed framework delivers more accurate and structurally consistent 3D reconstructions from single images, which is particularly important for digital heritage documentation scenarios that require faithful geometric representation for visualization, analysis, and long-term preservation.

Figure 5 illustrates the comparison between our method and the baseline across five dimensions. Specifically, on the inference time axis, values nearer the center represent longer durations, while those farther from the center indicate lower time complexity. On the Chamfer Distance axis, closer to the center signifies larger distances, whereas farther from the center denotes smaller distances. For the three axes F-score@0.1, F-score@0.2, and F-score@0.5, closer to the center indicates smaller values. To provide a comprehensive evaluation of reconstruction performance, the proposed method is analyzed and compared with representative single-image 3D reconstruction approaches across five complementary dimensions: geometric accuracy, surface completeness, fine-scale detail preservation, computational efficiency, and overall performance balance.

Geometric accuracy is primarily assessed using Chamfer Distance (CD), which measures the average bidirectional distance between reconstructed and ground-truth surfaces. Lower CD values indicate more precise geometric alignment. The proposed method achieves the lowest CD among all compared approaches, demonstrating improved global shape fidelity and reduced geometric deviation. This suggests that the reconstructed surfaces more closely approximate the true object geometry, even under limited input conditions.

Surface completeness is evaluated using the F-score metric, which jointly considers precision and recall under predefined distance thresholds. At moderate and loose thresholds (FS@0.2 and FS@0.5), the proposed method consistently attains higher scores, indicating more complete surface coverage and fewer missing regions. This reflects the model’s ability to recover coherent object structures rather than fragmented or partially reconstructed shapes.

Fine-scale detail preservation is captured by the F-score under a strict distance threshold (FS@0.1). Performance under this metric is particularly challenging, as it emphasizes accurate recovery of small-scale geometric features. The proposed method demonstrates a clear advantage in this regime, achieving the highest FS@0.1 score. This improvement indicates enhanced sensitivity to subtle geometric variations, which is crucial for faithfully representing intricate surface characteristics.

Computational efficiency is evaluated through inference time, reflecting the practical cost of reconstruction. While some baseline methods achieve competitive accuracy, they require substantially longer inference times. The proposed approach maintains a favorable efficiency profile, achieving high-quality reconstruction without excessive computational overhead. This balance supports scalable deployment for large artifact collections.

Overall performance balance is illustrated by the combined behavior across all metrics. Rather than optimizing a single criterion at the expense of others, the proposed method exhibits stable and well-rounded performance across accuracy, completeness, detail preservation, and efficiency. This balanced profile highlights its suitability for real-world digital heritage reconstruction tasks, where both geometric fidelity and practical feasibility are essential.

4.4. Failure Case Analysis

Although our model achieved promising results, as shown in Figure 6, there are still some typical failure cases:

Case 1: The original object is a pottery jar, but the shooting angle was taken from above the jar’s mouth, causing the model to misinterpret it as a pottery bowl.
Case 2: The entity in the input photo is a four-legged pottery vessel, but since the photo failed to capture all four legs, the model’s final output resulted in a discontinuous four-legged vessel.
Case 3: The object in the input image was originally a clay figurine. However, the photograph was taken entirely from a top-down orthographic perspective, failing to capture the figurine’s thickness. This ultimately caused the model to collapse into a uniformly thick 3D clay figurine.
Case 4: The object should have been a bronze mirror (one side smooth, the other decorated). However, due to the model’s limited texture perception, the decorative patterns on the back were lost, and the smooth surface was not represented.
Case 5: The object should have been a jade buckle. Although our model reconstructed part of the outline and shape, the smooth texture of the jade was not effectively expressed.

Analysis of the above five failure cases reveals that our model struggles to demonstrate robust geometric reconstruction capabilities for instances with complex structures and irregular shapes. Specifically, when presented with photographs taken from overly vertical or parallel angles, our model is prone to misclassification, resulting in insufficient geometric shape restoration. For models featuring intricate patterns and complex structures, our approach exhibits poor reconstruction performance due to its lack of training on large-scale datasets, which hinders its generalization ability.

4.5. Ablation Study

To verify the effectiveness of our design choices, we conduct a series of ablation experiments. As shown in Table 3, the DINO-based encoder is the most critical component for structural accuracy; without its high-level semantic features, the model struggles with complex topologies. The camera estimation module is indispensable for robustness—its absence causes the coordinate system to collapse when processing uncalibrated archival photos. Furthermore, we observe that while triplane resolution impacts the granularity of the mesh, a 32 × 32 resolution combined with our upsampling strategy provides the most efficient trade-off for large-scale heritage digitization.

As shown in Table 3 w/o DINO, replacing the DINO-pre-trained encoder with a standard ViT (pre-trained on ImageNet) leads to a significant degradation in reconstruction accuracy (CD increases by 16%). This confirms that DINO provides essential high-level semantic and geometric priors, which are crucial for “hallucinating” the occluded parts of complex cultural artifacts.

The omission of the camera-agnostic estimation branch (Table 3 w/o Camera Estimation) results in the largest performance drop. Without the ability to infer viewpoint parameters from uncalibrated historical images, the model suffers from severe geometric distortion and alignment errors. This validates our module’s necessity for processing real-world archival data where camera metadata is missing.

We investigated the trade-off between resolution and efficiency. While a lower resolution (

16 \times 16

, Table 3 w/o Triplane Resolution) reduces inference time, it fails to capture the intricate surface textures of bronzeware. Our chosen

32 \times 32

resolution with a factor-2 upsampler provides the optimal balance, ensuring high-fidelity detail preservation within a reasonable computational budget.

Finally, we investigated the impact of triplane resolution on reconstruction quality and computational efficiency, as summarized in Table 4. The triplane resolution is a strategic hyperparameter that balances geometric fidelity with resource consumption. Our results demonstrate that while increasing the resolution from

32 \times 32

to

64 \times 64

offers marginal improvements in capturing fine-grained micro-textures, it significantly escalates the computational overhead of the Transformer decoder and the neural renderer, leading to an approximately 40% increase in GPU memory usage. Conversely, a lower resolution of

16 \times 16

fails to preserve complex geometric details, resulting in a noticeably higher Chamfer Distance (CD). Therefore, we adopted the

32 \times 32

resolution with an integrated factor-2 upsampler as the optimal ’sweet spot.’ This configuration ensures high-fidelity geometry preservation while maintaining a low enough resource footprint for the massive parallel processing required in large-scale archival deployment.

5. Discussion

5.1. Implications for Cultural Heritage Preservation

The proposed single-image-to-3D reconstruction framework offers practical implications for cultural heritage preservation and digital documentation. By enabling the generation of structurally coherent 3D models from a single RGB image, the method significantly reduces the dependence on expensive 3D scanning devices and multi-view image acquisition setups. This characteristic is particularly valuable in heritage scenarios where artifacts are fragile, inaccessible, or documented only through historical photographs and archival images. Moreover, the reduced requirement for physical interaction with cultural relics minimizes the risk of damage during the digitization process. The generated 3D models can support a wide range of heritage-related applications, including virtual exhibition, academic analysis, digital restoration planning, and long-term archival storage. As a result, the proposed framework provides an efficient and scalable solution for large-scale digital archiving projects, especially in contexts where traditional acquisition pipelines are infeasible. Furthermore, as the digitization of cultural relics scales across multiple institutions, the protection of data privacy and intellectual property becomes a paramount concern for museum archives. High-resolution original datasets are often sensitive and restricted from public sharing. To address this, future frameworks could integrate Federated Learning (FL) with secure aggregation mechanisms, as proposed in recent privacy-preserving studies [64]. By adopting a federated approach, multiple museums can collaboratively train and refine 3D generative models without the need to exchange or centralize their raw image datasets. This decentralized training paradigm not only ensures the security of digital heritage archives but also allows the model to learn from a more diverse global collection of artifacts, ultimately enhancing the robustness and universality of 3D reconstruction in cultural preservation.

5.2. Limitations

Despite its effectiveness, the proposed approach has several limitations.

First, the method may struggle when processing artifacts with highly reflective, transparent, or textureless surfaces, where visual cues are insufficient to infer accurate geometry. To overcome these constraints, we plan to implement a dual-path data augmentation strategy in our future research. (1) Integrating physically-based synthetic datasets: We are developing a pipeline to generate high-fidelity synthetic 3D models with complex material properties. By training our model on data that simulates metallic, specular, and refractive behaviors, we aim to enable the network to learn essential physical priors that resolve the ambiguities in shiny or metallic objects. (2) Incorporating multi-modal data constraints: We are exploring the fusion of advanced imaging modalities, such as polarization-based imaging or photometric stereo. Unlike standard RGB images, these modalities provide explicit constraints on surface normals and light-matter interaction patterns. By integrating such data, we expect to better guide the geometric reconstruction of challenging materials like jade and glazed pottery, where standard visual cues are insufficient to disambiguate depth and transparency.

Second, the current approach focuses primarily on geometric reconstruction and does not explicitly model material properties or fine-grained texture details, which are important for high-fidelity visual reproduction.

Thirdly, the current framework primarily focuses on recovering the geometric structure and surface RGB colors of cultural relics, but it does not explicitly model complex material properties, such as the Bidirectional Reflectance Distribution Function (BRDF). Cultural heritage artifacts, particularly metallic bronze ware or polished jade, exhibit intricate light-interaction characteristics, including specular highlights, subsurface scattering, and metallic luster. Since our model relies on a standard radiance field representation that entangles geometry and appearance, it may struggle to achieve high photorealism under varying lighting conditions. Future work will explore the integration of physically-based rendering (PBR) to decouple albedo, roughness, and metallicity, thereby enhancing the visual fidelity of shiny or semi-transparent relics.

Addressing these limitations may require incorporating additional priors, multimodal data, or limited user interaction in future work.

6. Conclusions

This paper presents a generative framework for reconstructing three-dimensional digital models of cultural relics from a single two-dimensional image. By leveraging transformer-based visual feature extraction and a hybrid explicit–implicit 3D representation, the proposed method enables accurate geometric reconstruction without requiring multi-view inputs or explicit camera calibration. Experimental results demonstrate that the proposed approach achieves improved reconstruction accuracy and surface completeness compared with existing methods, highlighting its robustness under limited data conditions. The framework provides a practical pathway for the digital preservation and restoration of cultural heritage artifacts, particularly in scenarios where only sparse visual records are available. Future work will explore the integration of richer appearance modeling and interactive refinement to further enhance reconstruction fidelity and applicability.

Author Contributions

Conceptualization, G.W.; Methodology, G.W.; Software, Y.W.; Validation, M.G.; Formal analysis, M.G.; Investigation, Y.W. and Y.C.; Resources, Y.W. and Y.C.; Data curation, Y.W. and Y.C.; Writing—original draft, G.W. and M.G.; Writing—review & editing, L.L.; Visualization, Y.W. and Y.C.; Supervision, L.L.; Project administration, L.L.; Funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the collaborative R&D project between Chongqing Cultural Relics and Archaeology Research Institute and Chongqing University (Grant No. CQS24C00512).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code is available at https://github.com/DannielGe/From-2D-to-3D (accessed on 30 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Del Giudice, M.; Osello, A. BIM for cultural heritage. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2013, 40, 225–229. [Google Scholar] [CrossRef]
Murphy, M.; McGovern, E.; Pavia, S. Historic building information modelling (HBIM). Struct. Surv. 2009, 27, 311–327. [Google Scholar] [CrossRef]
Remondino, F.; Rizzi, A. Reality-based 3D documentation of natural and cultural heritage sites—techniques, problems, and examples. Appl. Geomat. 2010, 2, 85–100. [Google Scholar] [CrossRef]
Luhmann, T.; Robson, S.; Kyle, S.; Boehm, J. Close-Range Photogrammetry and 3D Imaging; Walter de Gruyter GmbH & Co. KG: Berlin, Germany, 2023. [Google Scholar]
Vosselman, G.; Maas, H.G. Airborne and Terrestrial Laser Scanning; Whittles Publishing: Dunbeath, UK, 2010. [Google Scholar]
Salvi, J.; Fernandez, S.; Pribanic, T.; Llado, X. A state of the art in structured light patterns for surface profilometry. Pattern Recognit. 2010, 43, 2666–2680. [Google Scholar] [CrossRef]
Westoby, M.J.; Brasington, J.; Glasser, N.F.; Hambrey, M.J.; Reynolds, J.M. ‘Structure-from-Motion’photogrammetry: A low-cost, effective tool for geoscience applications. Geomorphology 2012, 179, 300–314. [Google Scholar] [CrossRef]
Kersten, T.P.; Lindstaedt, M. Image-based low-cost systems for automatic 3D recording and modelling of archaeological finds and objects. In Proceedings of the Euro-Mediterranean Conference; Springer: Berlin/Heidelberg, Germany, 2012; pp. 1–10. [Google Scholar]
Gallo, G.; Stanco, F.; Battiato, S. Digital Imaging for Cultural Heritage Preservation; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Wang, N.; Zhang, Y.; Li, Z.; Fu, Y.; Liu, W.; Jiang, Y.G. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 52–67. [Google Scholar]
Kato, H.; Ushiku, Y.; Harada, T. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 3907–3916. [Google Scholar]
Pintus, R.; Pal, K.; Yang, Y.; Weyrich, T.; Gobbetti, E.; Rushmeier, H. A survey of geometric analysis in cultural heritage. In Proceedings of the Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2016; Volume 35, pp. 4–31. [Google Scholar]
Barsanti, S.G.; Remondino, F.; Visintini, D. Photogrammetry and Laser Scanning for archaeological site 3D modeling–Some critical issues. In Proceedings of the 2nd Workshop on ‘The New Technologies for Aquileia’, Aquileia, Italy, 25 June 2012; Roberto, V., Fozzati, L., Eds.; Volume 1, pp. 1–10. [Google Scholar]
Chen, Z.; Zhang, H. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 5939–5948. [Google Scholar]
Park, J.J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 165–174. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Oechsle, M.; Peng, S.; Geiger, A. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 5589–5599. [Google Scholar]
Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 4460–4470. [Google Scholar]
Barron, J.T.; Mildenhall, B.; Tancik, M.; Hedman, P.; Martin-Brualla, R.; Srinivasan, P.P. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 5855–5864. [Google Scholar]
Barron, J.T.; Mildenhall, B.; Verbin, D.; Srinivasan, P.P.; Hedman, P. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 5470–5479. [Google Scholar]
Lin, C.H.; Ma, W.C.; Torralba, A.; Lucey, S. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 5741–5751. [Google Scholar]
Wang, Z.; Wu, S.; Xie, W.; Chen, M.; Prisacariu, V.A. NeRF–: Neural Radiance Fields Without Known Camera Parameters. arXiv 2021, arXiv:2102.07064. [Google Scholar]
Verbin, D.; Hedman, P.; Mildenhall, B.; Zickler, T.; Barron, J.T.; Srinivasan, P.P. Ref-nerf: Structured view-dependent appearance for neural radiance fields. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 9426–9437. [Google Scholar] [CrossRef] [PubMed]
Martin-Brualla, R.; Radwan, N.; Sajjadi, M.S.; Barron, J.T.; Dosovitskiy, A.; Duckworth, D. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 7210–7219. [Google Scholar]
Tewari, A.; Fried, O.; Thies, J.; Sitzmann, V.; Lombardi, S.; Sunkavalli, K.; Martin-Brualla, R.; Simon, T.; Saragih, J.; Nießner, M.; et al. State of the art on neural rendering. In Proceedings of the Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2020; Volume 39, pp. 701–727. [Google Scholar]
Xiao, W.; Cruz, R.S.; Ahmedt-Aristizabal, D.; Salvado, O.; Fookes, C.; Lebrat, L. Nerf director: Revisiting view selection in neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 20742–20751. [Google Scholar]
Prados, E.; Faugeras, O. Shape from shading. In Handbook of Mathematical Models in Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 375–388. [Google Scholar]
Blanz, V.; Vetter, T. A morphable model for the synthesis of 3D faces. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2; Association for Computing Machinery: New York, NY, USA, 2023; pp. 157–164. [Google Scholar]
Anguelov, D.; Srinivasan, P.; Koller, D.; Thrun, S.; Rodgers, J.; Davis, J. Scape: Shape completion and animation of people. In ACM Siggraph 2005 Papers; Association for Computing Machinery: New York, NY, USA, 2005; pp. 408–416. [Google Scholar]
Choy, C.B.; Xu, D.; Gwak, J.; Chen, K.; Savarese, S. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 628–644. [Google Scholar]
Fan, H.; Su, H.; Guibas, L.J. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 605–613. [Google Scholar]
Xie, H.; Yao, H.; Sun, X.; Zhou, S.; Zhang, S. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 2690–2698. [Google Scholar]
Hong, Y.; Zhang, K.; Gu, J.; Bi, S.; Zhou, Y.; Liu, D.; Liu, F.; Sunkavalli, K.; Bui, T.; Tan, H. Lrm: Large reconstruction model for single image to 3d. arXiv 2023, arXiv:2311.04400. [Google Scholar]
Tochilkin, D.; Pankratz, D.; Liu, Z.; Huang, Z.; Letts, A.; Li, Y.; Liang, D.; Laforte, C.; Jampani, V.; Cao, Y.P. Triposr: Fast 3d object reconstruction from a single image. arXiv 2024, arXiv:2403.02151. [Google Scholar] [CrossRef]
Li, J.; Tan, H.; Zhang, K.; Xu, Z.; Luan, F.; Xu, Y.; Hong, Y.; Sunkavalli, K.; Shakhnarovich, G.; Bi, S. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv 2023, arXiv:2311.06214. [Google Scholar]
Huang, Z.; Stojanov, S.; Thai, A.; Jampani, V.; Rehg, J.M. Zeroshape: Regression-based zero-shot shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 10061–10071. [Google Scholar]
Wang, Z.; Wang, Y.; Chen, Y.; Xiang, C.; Chen, S.; Yu, D.; Li, C.; Su, H.; Zhu, J. Crm: Single image to 3d textured mesh with convolutional reconstruction model. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 57–74. [Google Scholar]
Wei, X.; Zhang, K.; Bi, S.; Tan, H.; Luan, F.; Deschaintre, V.; Sunkavalli, K.; Su, H.; Xu, Z. Meshlrm: Large reconstruction model for high-quality meshes. arXiv 2024, arXiv:2404.12385. [Google Scholar]
Deitke, M.; Schwenk, D.; Salvador, J.; Weihs, L.; Michel, O.; VanderBilt, E.; Schmidt, L.; Ehsani, K.; Kembhavi, A.; Farhadi, A. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 13142–13153. [Google Scholar]
Collins, J.; Goel, S.; Deng, K.; Luthra, A.; Xu, L.; Gundogdu, E.; Zhang, X.; Vicente, T.F.Y.; Dideriksen, T.; Arora, H.; et al. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 21126–21136. [Google Scholar]
Barrile, V.; Bilotta, G.; Lamari, D. 3D models of Cultural Heritage. Int. J. Math. Model. Methods Appl. Sci. 2017, 11, 1–8. [Google Scholar]
Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 2088–2096. [Google Scholar]
Chan, E.R.; Lin, C.Z.; Chan, M.A.; Nagano, K.; Pan, B.; De Mello, S.; Gallo, O.; Guibas, L.J.; Tremblay, J.; Khamis, S.; et al. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 16123–16133. [Google Scholar]
Gu, J.; Liu, L.; Wang, P.; Theobalt, C. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv 2021, arXiv:2110.08985. [Google Scholar]
Zou, Z.X.; Yu, Z.; Guo, Y.C.; Li, Y.; Liang, D.; Cao, Y.P.; Zhang, S.H. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 10324–10335. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 139-1. [Google Scholar] [CrossRef]
Kania, K.; Yi, K.M.; Kowalski, M.; Trzciński, T.; Tagliasacchi, A. Conerf: Controllable neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 18623–18632. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 12009–12019. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 9650–9660. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. Dinov3. arXiv 2025, arXiv:2508.10104. [Google Scholar]
Wang, J.; Chen, M.; Karaev, N.; Vedaldi, A.; Rupprecht, C.; Novotny, D. Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 5294–5306. [Google Scholar]
Remondino, F. Heritage recording and 3D modeling with photogrammetry and 3D scanning. Remote Sens. 2011, 3, 1104–1138. [Google Scholar] [CrossRef]
Levoy, M.; Pulli, K.; Curless, B.; Rusinkiewicz, S.; Koller, D.; Pereira, L.; Ginzton, M.; Anderson, S.; Davis, J.; Ginsberg, J.; et al. The digital Michelangelo project: 3D scanning of large statues. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques; ACM Press: New York, NY, USA, 2000; pp. 131–144. [Google Scholar]
Schonberger, J.L.; Frahm, J.M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 4104–4113. [Google Scholar]
Lorensen, W.E.; Cline, H.E. Marching cubes: A high resolution 3D surface construction algorithm. In Seminal Graphics: Pioneering Efforts that Shaped the Field; Association for Computing Machinery: New York, NY, USA, 1998; pp. 347–353. [Google Scholar]
Liu, M.; Xu, C.; Jin, H.; Chen, L.; Varma T, M.; Xu, Z.; Su, H. One-2-3-45: Any single image to 3d mesh in 45 s without per-shape optimization. Adv. Neural Inf. Process. Syst. 2023, 36, 22226–22246. [Google Scholar]
Tang, J.; Chen, Z.; Chen, X.; Wang, T.; Zeng, G.; Liu, Z. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–18. [Google Scholar]
Xu, J.; Cheng, W.; Gao, Y.; Wang, X.; Gao, S.; Shan, Y. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv 2024, arXiv:2404.07191. [Google Scholar]
Kasula, V.K.; Yenugula, M.; Konda, B.; Yadulla, A.R.; Tumma, C.; Rakki, S.B. Federated learning with secure aggregation for privacy-preserving deep learning in IoT environments. In Proceedings of the 2025 IEEE Conference on Computer Applications (ICCA); IEEE: Piscataway, NJ, USA, 2025; pp. 1–7. [Google Scholar]

Figure 1. Three neural scene representation methods, (a) Implicit representation, (b) Explicit representation, (c) Hybrid representation.

Figure 2. The framework of our proposed method.

Figure 3. Qualitative comparison with LRM.

Figure 4. Visual presentation of other cultural relics.

Figure 5. Evaluate the comprehensive performance of the proposed method.

Figure 6. Visualization of Failure Cases.

Table 2. Quantitative comparison of different methods on test set. ↑ indicates that higher values are better; ↓ indicates that lower values are better. Purple cells indicate optimized results, while gray cells indicate suboptimal results. The asterisk (*) denotes that the improvement of our method over the strongest baseline (VGGT or LRM) is statistically significant with p < 0.05 based on a paired t-test.

Methods	CD↓	FS@0.1↑	FS@0.2↑	FS@0.5↑
One-2-3-45 [61]	0.378	0.299	0.597	0.776
ZeroShape [38]	0.223	0.423	0.665	0.809
TGS [47]	0.188	0.579	0.731	0.826
LGM [62]	0.263	0.445	0.583	0.671
InstantMesh [63]	0.177	0.606	0.752	0.833
CRM [39]	0.252	0.561	0.701	0.787
LRM [35]	0.177	0.599	0.755	0.831
VGGT [56]	0.170	0.607	0.740	0.838
Ours	0.163 *	0.622 *	0.756 *	0.848 *

Table 3. Ablation study for each module. Bold indicates the optimal result. ↑ indicates that higher values are better; ↓ indicates that lower values are better.

Configuration	CD↓	FS@0.1↑	FS@0.2↑	FS@0.5↑
w/o DINO	0.189	0.581	0.715	0.808
w/o Camera Estimation	0.215	0.542	0.678	0.789
w/o Triplane Resolution	0.295	0.575	0.705	0.801
Full Model	0.163	0.622	0.756	0.848

Table 4. Ablation study on triplane resolution. ↓ indicates that lower values are better.

Triplane Resolution	Inference Time (s)	Peak VRAM (GB)	CD↓
16 × 16	0.8	4.2	0.295
32 × 32 (Ours)	1.6	6.4	0.163
64 × 64	3.5	10.8	0.153

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, G.; Ge, M.; Wang, Y.; Chen, Y.; Liu, L. From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics. Appl. Sci. 2026, 16, 2678. https://doi.org/10.3390/app16062678

AMA Style

Wu G, Ge M, Wang Y, Chen Y, Liu L. From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics. Applied Sciences. 2026; 16(6):2678. https://doi.org/10.3390/app16062678

Chicago/Turabian Style

Wu, Guang, Mingyuan Ge, Yunxiang Wang, Youhao Chen, and Li Liu. 2026. "From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics" Applied Sciences 16, no. 6: 2678. https://doi.org/10.3390/app16062678

APA Style

Wu, G., Ge, M., Wang, Y., Chen, Y., & Liu, L. (2026). From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics. Applied Sciences, 16(6), 2678. https://doi.org/10.3390/app16062678

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics

Abstract

1. Introduction

2. Related Work

2.1. Neural Implicit Representations for 3D Reconstruction

2.2. Single-Image 3D Reconstruction: From Optimization to Feed-Forward Models

2.3. Triplane and Hybrid 3D Representations

2.4. Vision Transformers and Self-Supervised Learning

2.5. Digital Documentation and Traditional Reconstruction

3. Methodology

3.1. Task Definition and Motivation

3.2. Implicit 3D Representation

3.3. Image Encoder

3.4. Image-to-Triplane Decoder

3.5. Triplane Feature Sampling

3.6. Radiance Field Prediction

3.7. Camera-Agnostic Rendering

3.8. Surface Extraction and Texture Generation

4. Experimental Results

4.1. Dataset and Setup

4.2. Qualitative Evaluation

4.3. Quantitative Evaluation

4.4. Failure Case Analysis

4.5. Ablation Study

5. Discussion

5.1. Implications for Cultural Heritage Preservation

5.2. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI