GC-HG Gaussian Splatting Single-View 3D Reconstruction Method Based on Depth Prior and Pseudo-Triplane

Hua Gong; Peide Wang; Yuanjing Ma; Yong Zhang

doi:10.3390/a18120761

,

and

¹

School of Science, Shenyang Ligong University, Shenyang 110159, China

²

Liaoning Key Laboratory of Intelligent Optimization and Control for Ordnance Industry, Shenyang 110159, China

³

National Key Laboratory of Electromagnetic Space Security, Tianjin 300308, China

^*

Authors to whom correspondence should be addressed.

Algorithms2025, 18(12), 761;https://doi.org/10.3390/a18120761
(registering DOI)

This article belongs to the Special Issue Artificial Intelligence in Modeling and Simulation (2nd Edition)

Version Notes

Order Reprints

Abstract

3D Gaussian Splatting (3DGS) is a multi-view 3D reconstruction method that relies solely on image loss for supervision, lacking explicit constraints on the geometric consistency of the rendering model. It uses a multi-view scene-by-scene training paradigm, which limits generalization to unknown scenes in the case of single-view limited input. To address these issues, this paper proposes a Geometric Consistency-High Generalization (GC-HG), a single-view 3DGS reconstruction framework integrating depth prior and a pseudo-triplane. First, we utilize the VGGT 3D geometry pre-trained model to derive depth prior, back-projecting them into point clouds to construct a dual-modal input alongside the image. Second, we introduce a pseudo-triplane mechanism with a learnable Z-plane token for feature decoupling and pseudo-triplane feature fusion, thereby enhancing geometry perception and consistency. Finally, we integrate a parent–child hierarchical Gaussian renderer into the feed-forward 3DGS framework, combining depth and 3D offsets to model depth and geometry information, while mapping parent and child Gaussians into a linear structure through an MLP. Evaluations on the RealEstate10K dataset validate our approach, demonstrating improvements in geometric modeling and generalization for single-view reconstruction. Our method improves Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) metrics, demonstrating its advantages in geometric consistency modeling and cross-scene generalization.

Keywords:

single-view 3D reconstruction; Gaussian splatting; depth prior; geometry-aware; pseudo-triplane

1. Introduction

Three-dimensional (3D) reconstruction is a key task in computer vision, with critical applications in embodied intelligence and automatic driving. Traditionally, this requires multiple images taken from different viewpoints, or professional hardware devices such as RGB-D cameras and Lidar to capture geometry and appearance. However, in practical applications such as embodied intelligence, due to the small sample size, only single-view images can be accessed. Consequently, single-view 3D scene reconstruction emerges as a very challenging ill-posed problem of significant research value.

For multi-view 3D reconstruction, three primary neural rendering methods have emerged: Signed Distance Field (SDF), Neural Radiance Fields (NeRF), and 3D Gaussian Splatting(3DGS). Both SDF [] and NeRF [] employ continuous implicit functions, parameterized by neural networks, to model 3D scenes. SDF maps the spatial coordinates to a signed distance function, defining the object surface as its zero-level set. Conversely, NeRF maps a 3D position and viewing direction to volumetric density and color, rendering images by integrating sampled points along camera rays. In contrast, 3DGS [] uses 3D Gaussian point cloud primitives to model 3D scenes, rendering image by rasterizing the position, shape, color, and opacity of the point cloud onto a 2D plane. For single-view 3D reconstruction, NeRF-based methods like MINE [] achieve effective modeling of a 3D scene in the viewing frustum by constructing continuous depth generalized multi-plane images (MPI). Based on the 3DGS method Splatter Image [], a feed-forward 3DGS is proposed, which uses the depth network to regress the Gaussian point cloud attributes and directly reconstruct the 3D scene from a single-view image. Although the latest research of 3DGS shows its powerful ability in reconstructing realistic 3D, the key challenge is to provide rich information to ensure the consistency of reconstructed 3D geometry to generate coherent 3D scenes.

To address this in multi-view 3DGS, scholars have leveraged prior from pre-trained models [], effectively improving the reconstruction quality by incorporating the rich semantic and geometric prior knowledge embedded in large visual models. This kind of method can provide more accurate geometric structure prediction for the reconstruction process, so as to improve the realism and consistency of the rendered image, showing a wider application potential. For instance, DepthSplat [] explores the role of depth information in guiding 3DGS reconstruction by using a depth prior to optimize the geometric structure; at the same time, it leverages the rendering consistency of 3DGS to supervise the depth estimation model, ensuring its output is more aligned with the 3D structure of the real scene. MVSplat [] uses the idea of cost volume from traditional Multi-view Stereo (MVS) to more accurately predict the spatial position of Gaussian centers from sparse multi-view inputs, achieving efficient geometric modeling. PixelSplat [] guides the adjustment of Gaussian point depth by predicting the likelihood distribution of layered depth, and designs the epipolar transformer to infer the scale factor of the scene, which is used to solve the problem of scale ambiguity of camera posture in the real scene, and supports reliable reconstruction using only two frames of images. Some researchers further improve the 3D modeling ability through structural design. For example, EG3D [] first integrated a 3D-aware triplane representation into a Generative Adversarial Network (GAN) framework, endowing 2D generative models with 3D modeling capabilities. Subsequent work introduced a triplane diffusion model [] to generate NeRF, achieving excellent performance of diverse scenes. In addition, TriplaneGaussian [] integrates the triplane structure directly with 3DGS. Its triplane structure maps the point cloud sampling to three orthogonal planes, implicitly expresses the shape and structure of the object, and then processes complex geometry and details to render high-quality 3D reconstruction images.

To address this in single-view 3DGS, scholars have explored a variety of schemes guided by the diffusion model, such as text-driven [,], camera attitude change condition-driven [], and new perspective rendering feature image-driven [,,,]. Dreamfusion [] uses a pre-trained 2D text-to-image diffusion model as a prior to optimizing a 3D NeRF representation, enriching its geometry and appearance. Noise point growth and color perturbation are introduced to enhance the initial Gaussian function. Zero-1-to-3 [] fine-tune a pre-trained diffusion model with conditional constraints on relative camera pose, enabling viewpoint control. This enables the synthesis of multi-view images from a single RGB input, providing supervised signals for subsequent 3D reconstruction. VistaDream [] employs a two-stage pipeline, first constructing a coarse 3D scaffold via expanding view and inpainting, and then refining it with a multi-view consistency sampling algorithm to enhance image quality and consistency. ExScene [] uses a multimodal diffusion model with a panoramic prior to generating high-quality panoramic new perspective images for predicting depth information in the first stage. Then, they are combined to train an initial 3D Gaussian model. In the second stage, a dimension-aware Stable Video Diffusion method is used to restore videos rendered from the initial 3D scene, producing high-quality and consistent multi-view images. Then, the image is used to fine-tune the initial model to obtain more realistic visual effects. These diffusion-guided guidance methods can synthesize novel views to augment the limited input data. However, such methods usually require large-scale computing resources to fine-tune the model, and the training cost is very high, relying on dozens of GPUs to complete the task.

While the above research has improved 3DGS reconstruction, several limitations persist. First, the commonly used pre-trained models rely on 2D visual task training and lack prior learned directly from 3D data. Consequently, modeling the geometric structure of 3D objects and scenes accurately remains challenging [,,,,]. A common strategy to mitigate this involves extending 2D image or video generators to 3D tasks [,,,,]. Second, although the EG3D triplane representation [] encodes 3D geometry into 2D planes by sampling features for a 3D point from three orthogonal planes. The inverse process, mapping 2D features back to 3D structures, offers potential for restoring geometric information. We refer to this inverse mechanism as the “pseudo-triplane”. Inspired by these, this paper uses the 3D geometry pre-trained model Visual Geometry Grounded Transformer (VGGT) as the depth prior. Additionally, we introduce a novel pseudo-triplane mechanism that decouples 2D image features into 3D plane features to better fuse and perceive geometric information. Based on the feed-forward 3DGS end-to-end training framework, this paper proposes Geometric Consistency-High Generalization (GC-HG), a single-view 3DGS reconstruction method that integrates depth prior with a pseudo-triplane. The main contributions of this paper include the following:

Depth prior-guided point cloud initialization: we introduce the pre-trained VGGT 3D geometric model as a prior for depth perception, which is used to extract a high-quality depth map. An initial point cloud is then generated through a confidence sampling strategy, providing 3DGS with the initialization of structure perception ability. This prior can enhance the geometric representation ability and cross-scene generalization performance of the model.
Pseudo-triplane representation enhances geometric consistency: we construct a pseudo-triplane representation by integrating a learnable Z-plane token with deep features extracted by an image encoder. This decouples 2D image information into 3D feature planes on the XZ and YZ planes, enabling a specialized fusion mechanism for joint modeling. This enhances geometric consistency and structural details preservation.
Parent–child hierarchical Gaussian renderer modeling depth hierarchy: in the feed-forward 3DGS framework, we design a parent–child hierarchical Gaussian renderer to models multi-level spatial structure by combining depth and 3D offsets. A multi-layer perceptron (MLP) learns the linear mapping between parent and child Gaussians, enhancing the model’s ability to represent complex depth hierarchies.

2. Basic Principles

2.1. Feed-Forward 3DGS

This section discusses the core idea and algorithm architecture of 3DGS and feed-forward 3DGS. The overall structure of the 3DGS method mainly includes the following core modules: motion recovery structure (SfM), 3D Gaussian point cloud initialization, projection transformation, micro block rasterization rendering, and supervision loss calculation. In the training process, the model uses the adaptive density control mechanism to clone, split, and trim the Gaussian points, and gradually improve the reconstruction quality. The overall architecture of this method is shown in Figure 1.

Figure 1. 3DGS algorithm architecture.

The input of 3DGS is multi-view images, and a sparse point cloud (i.e., SfM point cloud

(x_{i}, y_{i}, z_{i})

represents the initial structure of 3D space) is extracted by the motion restoration structure algorithm. Then, the SfM point cloud is transformed into a Gaussian point cloud, that is, Gaussian attributes (including spatial position

(x_{i}, y_{i}, z_{i})

and shape

(R, S)

) are given to each point based on the original point, and its mathematical expression is:

G (x) = \exp [- \frac{1}{2} {(x - μ)}^{T} Σ (x - μ)] w i t h Σ = R S S^{T} R^{T} μ = {(x_{i}, y_{i}, z_{i})}^{T}

(1)

Among them,

Σ

is decomposed into the product of a rotation matrix

R

and a scaling matrix

S

to ensure its semi-positive definiteness and numerical stability.

In addition, each Gaussian point

P

is given a learnable opacity and a color value related to the viewing angle, that is, the radiation field in the 3D scene is defined on the point cloud. Its opacity and color modeling in the 3D Gaussian domain are as follows:

G_{i} (x) = o_{i} \cdot \exp [- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)]

(2)

C_{i} (x) = c_{i} \cdot o_{i} \cdot \exp [- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)]

(3)

where

o_{i}

and

c_{i}

represent the opacity and color of the Gaussian point cloud

{P}

, respectively. The final Gaussian parameters are as follows:

{μ, R, S, o_{i}, c_{i}}

(4)

The rendering process in 3DGS is the forward stage of the differential rendering optimization process. The basic idea is to project all Gaussian points onto the image plane according to the camera parameters to form a series of 2D plane Gaussians, then synthesize the image by superimposing and mixing these Gaussians on the 2D plane. To accurately project the 3D Gaussian distribution to the 2D Gaussian distribution, the ellipse weighted average (EWA) [] technology is adopted, which uses the local affine transformation model to approximate the projection, as follows:

μ^{'} = M_{p r o j} W {[μ, 1]}^{T}

(5)

Σ^{'} = J W Σ W^{T} J^{T}

(6)

where

W

represents the projection transformation from the world coordinate system to the camera coordinate system,

J

represents the first-order affine approximate Jacobian matrix of the viewpoint transformation matrix, and

M_{p r o j}

represents the standard perspective projection matrix.

Rasterization rendering is performed for each pixel, the depth of

N

2D Gaussian ellipsoids on the pixel is sorted, and then transparency blending (i.e., alpha blending) is performed from front to back to synthesize the color value of the pixel. As follows:

C = \sum_{i ϵ N} T_{i} α_{i} c_{i} w i t h \{\begin{array}{l} T_{i} = \prod_{j = 1}^{i - 1} (1 - α_{j}) \\ α_{i} = o_{i} \cdot \exp [- \frac{1}{2} {(x - μ)}^{T} \sum^{- 1} (X - μ)] \end{array}

(7)

where in,

c_{i}

represents the color contribution of Gaussian in the pixel;

α_{i}

represents the opacity contribution of Gaussian in the pixel;

T_{i}

represents the cumulative transmittance of all preceding Gaussians to the pixel.

Based on the previous analysis of the basic principles of 3DGS, its dependence on the initial geometry provided by multi-view inputs and SfM inherently limits its application in single-view reconstruction. To overcome this deficiency, this paper introduces feed-forward 3DGS. Feed-forward 3DGS architecture was first proposed by Szymanowicz [], which describes single-view reconstruction as a problem of designing neural networks. This architecture uses the powerful regression ability of the depth network to transform the information inherent in the 2D image plane into a 3D Gaussian point cloud. Unlike the traditional process that relies on multi-view reconstruction, feed-forward 3DGS does not rely on the initial point cloud or view information provided by SfM but directly predicts the Gaussian point set and its attributes in the scene through a single image to realize the direct conversion from 2D to 3D. The overall process of feed-forward 3DGS is shown in Figure 2.

Figure 2. Feed-forward 3DGS algorithm architecture.

Compared with the traditional 3DGS, the feed-forward 3DGS emphasizes learning depth directly from the image. The calculation formula of the Gaussian parameters and the 3D Gaussian point cloud position is as follows:

{d, μ, R, S, o_{i}, c_{i}}

(8)

P = K^{- 1} {[p, 1]}^{T} d

(9)

where

p

is the 2D coordinates,

P

is the 3D coordinates, and

K

is the camera internal parameter matrix.

In feed-forward 3DGS reconstruction, the main purpose of introducing a depth network is to regress the attribute set of Gaussian points. The input is an RGB image of

H \times W \times 3

, and the output is a tensor of

H \times W \times K

, in which each pixel corresponds to a dimensional feature vector, which is associated with the parameters of the learned Gaussian points. Experiments show that the depth parameters extracted from the pre-trained monocular depth estimation network have more advantages in accuracy and generalization ability than those obtained directly through network regression. Therefore, in the feed-forward 3DGS architecture, the depth attributes of Gaussian points are predicted by the pre-trained monocular depth network to enhance the geometric consistency and cross-scene generalization of reconstruction.

2.2. Algorithm in This Paper

While 3DGS shows excellent image realism and rendering quality in novel view synthesis, it still has obvious deficiencies in geometric structure modeling and hierarchical depth perception. The current methods mostly rely on the pixel-level supervision of the rendered image and indirectly adjust the Gaussian point attributes by optimizing the difference between the reconstructed image and the real image. However, this training paradigm, based on rendering error, lacks explicit modeling and strong constraints on the underlying 3D geometric structure, which makes the model pay more attention to visual consistency than geometric accuracy. Without explicit geometric guidance, the model struggles to resolve regions with close spatial location but small differences in geometric structure. This leads to excessive overlap or ambiguity in the Gaussian distributions, severely degrading the clarity and accuracy of the reconstructed geometry. In the rendered images, this lack of geometric modeling is similar to the phenomenon of “low-pass filtering”: jagged edges, blurred textures, high-frequency aliasing, and even boundary aliasing, which seriously weakens the visual realism and stereo sense. To solve the above problems, this paper proposes the GC-HG framework, a feed-forward 3DGS reconstruction method based on depth prior and a novel pseudo-triplane representation. By introducing the structure depth prior and triplane geometry perception mechanism, this method explicitly enhances the ability of 3D structure modeling and shows improvement in geometric consistency and generalization performance of the reconstruction results while maintaining the image realism. The 3D model and novel view synthesized by the algorithm are shown in Figure 3, showing a good reconstruction quality and structural clarity.

Figure 3. GC-HG single-view 3DGS reconstruction synthesis. (a) Input image. (b) Gaussian Point. (c) Novel-view.

Based on the feed-forward 3DGS architecture, the GC-HG framework introduces three key improvements: the initialization of a point cloud guided by depth prior, the plane decoupling through the pseudo-triplane mechanism with a learnable Z-plane token, and the parent–child hierarchical Gaussian renderer. Optimization is made in terms of prior knowledge integration, geometric perception, and hierarchical depth modeling, respectively.

Specifically, the overall framework of the GC-HG framework is shown in Figure 4. Firstly, the framework takes a single RGB image as inputs and employs the pre-trained VGGT 3D geometric model to extract a high-quality depth map, thereby establishing a robust geometric prior for the scene. This depth map is then back-projecting to generate a dense point cloud, aligning multi-modal information from the image and geometry to provide an initial structure for the subsequent Gaussian splatting. Next, these multi-modal features are sent to the pseudo-triplane geometry perception module. Based on the 2D features extracted by the image encoder, the module introduces a learnable Z-plane token to decouple the geometric information of XZ and YZ planes and constructs a multi-level residual structure to form a pseudo-triplane representation. Through the fusion mechanism, the model can realize the information interaction between 2D features and 3D structure and significantly enhance the ability of geometric perception and structure expression. In the rendering stage, a parent–child hierarchical Gaussian renderer is designed to depict the hierarchical relationship between depth and geometric information. Specifically, the module introduces depth and 3D spatial offsets into the Gaussian point attributes. An MLP learns the mapping between parent and child primitives, creating a hierarchical perceptual Gaussian point cloud representation. This design enhances the model’s ability to represent deep structures and improves the modeling accuracy of complex scene geometry. Finally, the model completes the rendering of novel views based on the learned Gaussian point attribute set and calculates the reconstruction loss with the real image to guide the network for end-to-end back propagation optimization.

Figure 4. The overall framework of the GC-HG single-view 3DGS reconstruction algorithm.

2.3. Depth Prior

In the feed-forward 3DGS framework, the initialization mode of Gaussian points and their distribution density in 3D space directly affect the final 3D reconstruction quality and image rendering fidelity. Therefore, this paper introduces the depth prior as the guide mechanism to improve the initialization accuracy and geometric rationality of the Gaussian point cloud. The process mainly includes two stages: a pre-trained 3D geometric model to predict the depth map, followed by depth map resampling and Gaussian point projection.

To establish a robust depth prior, we employ the pre-trained VGGT model to predict the initial depth map. The architecture is shown in Figure 5. Firstly, the fine-tuning DINO is used as the backbone network to tokenize the image, and then the feature tokens are extracted through the alternate attention mechanism, that is, the alternate application of global attention and frame-wise attention. These extracted tokens are subsequently fed into two heads: a camera head that estimates the intrinsic and extrinsic parameters, and a dense prediction transformer (DPT) head. The DPT module outputs a dense depth map, a dense point map, and dense features for point tracking of each image and provides a confidence score for each output.

Figure 5. Pre-trained VGGT []: Visual Geometry Localization Transformer Architecture.

The depth map resampling and Gaussian point projection strategy are adopted to further improve the quality and distribution density of the initial point cloud. Specifically, through the resampling of confidence, the low-confidence regions are filtered, and the high-confidence regions are resampled based on probability density, which effectively avoids the interference of unreliable depth predictions on Gaussian initialization and ensures that the key geometric regions have sufficient point density support. This strategy provides a stable basis for the accurate allocation of subsequent Gaussian attributes (color, opacity, position, shape, etc.) and improves both the stability of the overall rendering process and the ability to restore details. Finally, the process generates the dual-mode input information of the fusion image and point cloud. The image provides appearance information such as texture and color, while the deep back-projection point cloud provides the geometric structure prior. This fusion of cross-modal information has laid a solid foundation for attribute modeling and rendering optimization of the subsequent 3DGS network. The depth of a prior-driven back-projection process is shown in Figure 6.

Figure 6. Schematic diagram of the initial point cloud for back-projection.

2.4. Pseudo-Triplane Mechanism

The standard feed-forward 3DGS framework lacks explicit geometric modeling, which directly affects the final quality of 3D reconstruction. To address this problem, this paper introduces the pseudo-triplane mechanism as geometric perception, which mainly includes three stages: pixel-level regression, feature decoupling through a learnable Z-plane token, and feature fusion of pseudo-triplane.

In pixel-level regression, firstly, the RGB image and its corresponding depth map are fused in the channel dimension to construct a joint feature representation with appearance and depth information. Then, for each pixel position, the network learns its corresponding complete Gaussian attribute set and assigns the attribute set to the corresponding 3D points in the initial point cloud to achieve high-precision, point-by-point geometric perception. This regression is performed by a network employing ResNet-based encoder–decoder architecture. The encoder first extracts the features of the fused image through a series of ResNet blocks, and then the decoder performs up-sampling to gradually reconstruct the output. The ResNet architecture, with its skip connections design, can effectively alleviate the vanishing gradient problem in deep network training while preserving multi-scale image texture and contextual information. The features extracted by the encoder are represented as depth tokens, which are defined as follows:

v = ϕ_{e n c o d e r} {(I, D)}_{u}

(10)

where

ϕ_{e n c o d e r}

is the encoder of the Resnet block,

I

and

D

are the 2D image and depth image, respectively, which

v

are depth tokens.

The geometric modeling capability of these depth tokens is further enhanced through a feature decoupling process through a learnable Z-plane token. It aims to guide the model to perceive the local structure and geometric information in 2D feature space more effectively. As shown in Figure 7, these enhanced tokens are then used to decode the detailed attributes of each 3D Gaussian, as shown in the following formula:

e v = T r i p l a n e (v)

(11)

where

v

is the depth token, and

e v

is the enhanced depth token.

Figure 7. The self-attention mechanism and cross-attention mechanism used in the pseudo-triplane.

The feature decoupling module is a powerful module that can actively capture local and global feature information. It is also a structured modeling unit that can actively perceive and enhance 3D spatial features. Its core is to understand and enhance features from a real 3D perspective by constructing complex feature residual relationships on decoupled orthogonal planes (XY, XZ, YZ). To efficiently encode these multi-plane features and realize information interaction, self-attention mechanism and cross-attention mechanism are mainly used to model their dynamic interactions. The principles of these attention mechanisms are shown in Figure 7.

To fully capture the feature correlation of the input image in the XY plane (i.e., the image plane), the self-attention mechanism is introduced. This mechanism can effectively extract the internal context of the image, thereby significantly enhancing the internal feature representation

I_{X Y}

of the input image

I

. Its formal expression is:

F_{X Y} = S e l f A t t e n t i o n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(12)

The values of query, key, and value

Q, K, V

are tokens

v

.

To model the structural information in the direction of XZ and YZ planes, this paper introduces a learnable Z-plane token as the global context descriptor in the pseudo-triplane mechanism. The Z-plane token is not from the original image input, but as a trainable vector set, which learns the global 3D contextual features in the process of network training. To achieve efficient feature interaction between different planes, cross-attention mechanism is adopted, specifically:

F_{X Z} = C r o s s A t t e n t i o n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(13)

F_{Y Z} = C r o s s A t t e n t i o n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(14)

The query

Q

is the learned Z-plane token, and the key and value

K, V

are tokens

v

.

Feature fusion of pseudo-triplane: in this paper, three sets of feature relations are jointly modeled: the in-plane context in the XY plane, captured through self-attention; the cross-plane context across the XY and YZ planes, established by the Z-plane token; and a unified representation space integrating these three orthogonal planes. Together, these relations form a multi-scale and multi-dimensional joint feature representation. This pseudo-triplane joint feature not only retains the key local texture information in the image, but also integrates the spatial structure perception ability in the depth direction, which provides a strong geometric perception basis for subsequent Gaussian attribute regression and 3D scene reconstruction. The specific integration is as follows:

e v = F_{X Y} \otimes (F_{X Z} \oplus F_{Y Z})

(15)

The

\oplus

operation is used to fuse the

F_{X Z}

and

F_{Y Z}

features learned from the Z-plane token, which allows the direct superposition of information from different dimensions. While the

\otimes

operation concatenates the image-plane feature

F_{X Y}

with the fused features from the other two planes. This preserves their respective uniqueness and jointly forms a more comprehensive 3D geometric context. Through the designed pseudo-triplane feature decoupling and fusion mechanism, the 2D image features are effectively promoted to the deep feature representation with 3D perception ability.

2.5. Parent–Child Hierarchical Gaussian Renderer

Szymanowicz [] pointed out that in the task of 3D reconstruction and new perspective synthesis for a single object, there are a large number of background pixels that are not directly related to the object surface. These pixels can be “reused” by the model to infer and complete the part of the object that is not observed under the current perspective. However, when the task is extended to scene-level 3D reconstruction, scene reconstruction often includes multiple objects, complex geometric structures and large-scale spatial distribution, which puts forward higher requirements for the information utilization efficiency of each pixel in the image. Therefore, this paper introduces a parent–child hierarchical Gaussian renderer to enhance the modeling ability of the model in occluded areas, field boundaries, and even invisible areas, which mainly includes two parts: depth offset and 3D offset modeling, and MLP parent–child hierarchical relationship construction.

Depth offset and 3D offset are based on linear Gaussian points. Specifically, the model no longer predicts only one Gaussian point at each pixel position but learns a group of multiple Gaussian points in a linear arrangement for modeling the structural details along the depth direction. Each Gaussian point introduces a learnable depth offset based on the depth prior to model the depth structure; at the same time, the 3D offset is introduced to model the geometric structure, which helps to further capture the information of geometric details and non-field of view areas, to improve the modeling ability of occlusion and deep structure in complex scenes. The position of the

i

Gaussian can be expressed as:

P = K^{- 1} {[p, 1]}^{T} (d + \sum_{j}^{i} δ_{j}) + Δ

(16)

where

K

is camera internal parameter,

p

is 2D pixel,

d

is initial depth,

δ

is depth offset,

Δ

is 3D offset, and

P

is 3D Gaussian.

The MLP parent–child hierarchical relationship is used to more efficiently establish the attribute mapping relationship between parent and child Gaussian points and take the attribute set of the parent Gaussian point as the input to generate the corresponding child Gaussian attribute set. Specifically, we utilize the parent Gaussian’s position and scale as input, while the child Gaussian’s position and scale serve as output. In this way, the model can learn a structured linear hierarchical representation while maintaining parameter sharing and computational efficiency. This not only enhances the modeling ability of the model for 3D geometric continuity and local structural details, but also significantly reduces the generation of redundant Gaussian points, thus improving the overall rendering quality. In addition, the hierarchical structure can gradually adjust the connection weight and spatial dependence between parent and child Gaussian through end-to-end optimization in the training process, so as to achieve higher geometric consistency and visual fidelity. As shown in Figure 8.

Figure 8. Depth offset, 3D offset, and MLP schematic diagram.

3. Experimental Results and Analysis

3.1. Dataset and Parameter Settings

This paper conducts a large number of experiments on the public dataset RealEstate10K [] and further verifies the generalization ability of the proposed method on cross-scenario Internet data. RealEstate10K dataset is a large-scale real estate video collection, integrating rich real scene data from YouTube. This paper follows the standard training test partition strategy, using 67,477 scenes as the training set and 7289 scenes as the test set. The scenes in the RealEstate10K dataset typically include both indoor and outdoor environments, featuring complex geometric structures and diverse lighting conditions, which are well-suited for evaluating the model’s performance and generalization ability in real-world, complex scenes. Its diverse perspectives and scene content provide sufficient training and a verification basis for the model.

The GC-HG single-view 3DGS reconstruction model based on depth prior and pseudo-triplane in this paper is composed of a pre-trained VGGT [] 1B model for depth prior, a ResNet50 encoder for image and depth feature coding, multiple decoders and Gaussian attribute decoders for depth offset and 3D offset modeling, and a multi-level residual pseudo-triplane module, that is, the model has a total of 1.5B parameters. The experiment was carried out on Ubuntu 24.04 with a 48-core CPU, Intel (R) Xeon (R) Gold 6248R CPU @ 3.00GHz 2.99 GHz (2 processors), NVIDIA Quadro RTX 8000 GPU and 48GB video memory. In the Python 3.8 environment, based on PyTorch 1.8.0 deep learning framework, and relying on CUDA 11.8 parallel computing ability, the parallel acceleration algorithm for image rendering is implemented. Loss function is the combination of

L_{1}

loss and D-SSIM loss:

L = (1 - λ) L_{1} + λ L_{D - SSIM}

(17)

In all experiments, set the batch size to 4 and the number of iterations to 240,000. Convergence can be achieved in two days on a single RTX8000 GPU. In order to quantify the evaluation results, standard image quality evaluation indicators were used, including PSNR, SSIM, and LPIPS. PSNR aims to evaluate the pixel-level error between the reconstructed image and the original image. The higher the value, the better the performance of the reconstructed image in terms of signal fidelity and noise suppression. SSIM is responsible for measuring the structural similarity between images. A higher value means that the reconstructed image has better retention of the structural layout and texture features of the original image. LPIPS is dedicated to quantifying the perceptual differences between image blocks from the perspective of human eye perception. The smaller the value, the closer the fit between the reconstructed image and human visual habits.

3.2. Ablation Experiment

To further evaluate the role of each module in the overall model performance, we designed ablation experiments across four configurations: a baseline without additional modules, a model using only depth prior and the pseudo-triplane, a model using only the parent–child hierarchical Gaussian renderer, and the full integrated framework. Quantitative comparisons based on PSNR, SSIM, and LPIPS are summarized in Table 1. Results indicate that introducing depth prior and the pseudo-triplane geometric perception mechanism improved all three of the metrics. This suggests that the geometric prior provided by the pre-trained VGGT model compensate for the limited structural initialization in feed-forward 3DGS, thereby enhancing both geometric perception and new view synthesis quality. Conversely, when the parent–child hierarchical Gaussian renderer was introduced separately, the performance of the model is reduced. We attribute this to the increased model complexity inherent in the multi-layer Gaussian design. Without the support of our geometric perception module, this multi-layer representation can cause the chaotic superposition of Gaussian attributes, which weakens the clarity and consistency of the rendered image. However, the complete framework improves the overall performance. This demonstrates that with the support of geometric perception, the multi-layer Gaussian structure can be optimized, enabling representation of complex 3D scenes and better quality of new view reconstruction.

Table 1. The ablation experiment of novel view reconstruction on the RealEstate10K dataset.

3.2.1. Pseudo-Triplane Depth Prior Analysis

Firstly, the advantages of using pseudo-triplane and depth prior in improving the geometric accuracy of the scene are analyzed. Figure 9 shows the qualitative results of ablation experiments in two scenarios. The pseudo-triplane mechanism and depth prior can model the edge geometry of more complex objects. For example, as shown by the contours of the treadmill and chair, the geometric accuracy of the reconstructed objects is significantly improved, demonstrating the model’s enhanced ability of geometric perception. To quantify this improvement, we use edge density, which measures the proportion of edge pixels to reflect the richness of geometric contours, and high-frequency energy, which measures the richness of details and changes in an image. The higher the spectral energy, the more details and the stronger the geometric expression. According to the data in Table 2, when the pseudo-triplane and depth prior are not used, the edge density and high frequency energy are 0.0307 and 95.28, respectively. When the pseudo-triplane depth prior is used, the edge density and high frequency energy are 0.0485 (+57.98%) and 100.96 (+5.9%), respectively. These data indicators show that the pseudo-triplane and depth prior provide substantial geometric perception and enhancement abilities.

Figure 9. Schematic diagram of GC-HG geometric perception. (a) Input Image. (b) w/o pseudo-triplane and depth prior. (c) w/pseudo-triplane and depth prior.

Table 2. In Scene 1 of Figure 9, the geometric perception quantification metric.

To analyze the model’s ability to infer geometric properties from 2D information such as texture and edges, we present a qualitative analysis of Scene1 from Figure 9, with results shown in Figure 10a. Without the depth prior and pseudo-triplane, the baseline model produces sparse edge lines, misses significant contours, and fails to capture a coherent spatial structure, resulting in an incomplete object shape. In contrast, when the two modules are integrated, the geometric reconstruction is more complete and realistic, despite the presence of detail noise. Figure 10b visualizes the spectral energy distribution, obtained through a Fourier Transform (FFT) of the rendered depth map, where vertical and horizontal bright bands correspond to spatial frequencies in their respective directions. Without the depth prior and pseudo-triplane, the spectral energy is concentrated in low-frequency vertical bands, reflecting an overly smooth depth map that lacks texture and rich spatial geometric details. In contrast, when the two modules are integrated, the spectrum becomes more scattered, revealing more directional and high-frequency features, which indicates a better expression of geometric details and sharp edge transitions.

Figure 10. Geometric perception. (a) Edge density analysis; and (b) spectral energy distribution.

At the same time, starting from the basic principle of pixel value perception geometry, this paper explains the ability of geometry perception. The basic principle of pixel value perception geometry is that the brightness, color, texture, edge, and other information of the image reflect the projection expression of 3D geometry in 2D space. By analyzing the statistical laws, change patterns, and spatial distribution of these pixel features, the potential geometric perception ability can be indirectly inferred. To demonstrate this, we present an analysis of the geometric perception ability on Scene1 from Figure 9. First, we analyze the depth histogram shown in Figure 11. When the depth prior and pseudo-triplane are not used, the distribution is relatively concentrated, the depth width range is small, and the geometric depth perception is fuzzy in the range of 40–60 pixels. When using depth prior and pseudo-triplane, there is a clear dominant depth range, and there is a clear scene depth range in general. The perception is clear at 40–60 pixels, but poor at 175–200 pixels. Second, we analyze the depth gradient histogram shown in Figure 12. When the depth prior and pseudo-triplane are not used, the gradient histogram changes violently, the image depth is discontinuous, and the perception is fuzzy. When using depth prior and pseudo-triplane, the gradient distribution is smoother, and the depth geometric edge continuity is better.

Figure 11. Geometric perception—depth histogram.

Figure 12. Geometric perception—depth gradient histogram.

3.2.2. Parent–Child Hierarchical Gaussian Analysis

Secondly, the advantages of using the parent–child hierarchical Gaussian renderer in improving the geometric accuracy of the scene and the mechanism of complete scene reconstruction are analyzed. As shown in Figure 13, the rendered results of Gaussian opacity corresponding to two hierarchical depth layers,

D

and

D + δ

, are visualized to illustrate their distinct roles in scene rendering. The whiter the color in the figure, the higher the Gaussian opacity. The parent Gaussian layer (depth

D

) has the highest opacity at the object boundary (such as escalator and door), which indicates that these regions are the most useful for the depth prediction network. In contrast, for the child Gaussian layer (depth

D + δ

), the model pays more attention to the distant region (the distant doors and windows) and the image edge (the region where the camera angle is not visible).

Figure 13. Geometric enhancement using parent–child hierarchical Gaussian. (a) Input image. (b) Parent Gaussian. (c) Child Gaussian.

3.3. Comparative Experiment

To verify the efficacy of the algorithm, we first conducted an intra-domain evaluation and followed the same protocol established in []. We compared our GC-HG framework with several state-of-the-art (SOTA) single-view 3D reconstruction methods, including SV-MPI [], MINE [], Splatter Image [], and Flash3D [], assessing zero-shot reconstruction performance on large-scale scene data. The evaluation on the RealEstate10K dataset considers varying frame spacings between source and target views, where larger spacing corresponds to increased reconstruction difficulty. Table 3 confirms that our method yields SOTA results for all evaluated source-to-target view distances in this benchmark test. Compared to the feed-forward 3DGS baseline (Splatter Image []), our method improves PSNR, SSIM, and LPIPS by 2.27%, 0.55%, and 11.81%, respectively. Performance on classical scenes is summarized in Table 4. Across scenes such as Estate, Loft, Kitchen, and Balcony, GC-HG demonstrates robust performance at all frame intervals, though results vary by scene complexity, as shown in Table 4. In structurally complex scenes with variable lighting, such as Loft and Kitchen, GC-HG generally outperforms competing methods. However, it encounters performance degradation under large viewpoint extrapolation (U [−30, 30]): PSNR drops (e.g., from 23.79 dB to 20.90 dB in Loft), SSIM drops, and LPIPS rises to 0.173, indicating perceptual distortions. In contrast, in simpler scenes with fewer high-frequency details, such as Estate and Balcony, GC-HG achieves clear advantages in PSNR and SSIM, demonstrating strong geometric reconstruction capability. However, it slightly underperforms MINE on LPIPS, potentially stemming from MINE’s multi-level depth feature modeling, which better preserves fine-grained texture details.

Table 3. Overall comparison of novel view reconstruction on the RealEstate10K dataset.

Table 4. Scene-by-scene comparison of novel view reconstruction on the RealEstate10K dataset.

Table 5 details the convergence time and GPU usage for each method on the RealEstate10K dataset. Notably, Splatter Image, Flash3D, and GC-HG all complete training on a single GPU. However, their efficiency is varied: Splatter Image requires 14 days to converge, while Flash3D, despite employing a similar depth estimation backbone, still needs 4 days. In contrast, GC-HG achieves convergence in just two days, demonstrating the highest efficiency among the single-GPU methods. Comparatively, multi-GPU methods such as SV-MPI and MINE also require approximately two days to converge but demand more resources (16 and 48 GPUs, respectively). This highlights that GC-HG achieves training speeds comparable to multi-GPU frameworks but with lower hardware requirements. In this way, GC-HG offers better computational cost-effectiveness, balancing rapid convergence with minimal hardware requirements.

Table 5. Convergence speed of novel view reconstruction on the RealEstate10K dataset.

Remarkably, despite the integration of complex geometric prior and pseudo-triplane structures, GC-HG achieves faster convergence than existing methods. This efficiency is largely attributed to the optimized parent–child hierarchical rendering strategy. Specifically, the employment of VGGT depth prior ensures a robust geometric signal, enabling the initialization of an accurate point cloud. This reduces the search space required for geometric optimization, thereby accelerating 3DGS convergence and enhancing training stability. The visual display is shown in Figure 14. The first column is the input source image and the second column is the predicted ground truth (GT) image. Differences in the new view reconstruction quality are evident across methods. Compared to Flash3D, our GC-HG framework exhibits fewer aliasing artifacts, particularly in the window region of the first row (highlighted in red). In the second and third rows, our method reconstructs table edges with more consistent geometric appearance and fewer visual artifacts. Furthermore, the fourth row demonstrates that our model can also reasonably reconstruct occluded regions (e.g., the distant grassland), which has robust geometric inference ability.

Figure 14. Comparison experiment rendering effect of novel view reconstruction on the RealEstate10K dataset.

To demonstrate the scene reconstruction capability of our algorithm, Figure 15a–d visualize 3D models generated from both the RealEstate10K dataset and in-the-wild internet data. These visualizations highlight the reconstructed geometric structure of each scene. Compared to Flash3D, GC-HG has better reconstruction of the sofa in scene (a), maintaining higher fidelity in geometric details near the image edges. For the Yingxian Wooden Tower scene (b), our method preserves the consistency and integrity of floor edges, while Flash3D has point cloud fragmentation. The generated 3D models often need to be scaled and displayed according to the needs of different scenes or tasks, such as small-scale local detail analysis (1:1.2 scale in Figure 15a), large-scale urban modeling (1:2 scale in Figure 15b,c), and large military targets (1:3.3 scale in Figure 15d). These changes not only test whether the model can maintain global geometric stability but also test whether it can preserve the consistency of the edge and local details in the process of scaling. Results show that GC-HG preserves both the integrity of the overall scene structure and geometric consistency in fine-grained edge details.

Figure 15. Reconstruction drawing of the 3D model. (a) Indoor scenario. (b) Outdoor scenario. (c) Autonomous driving scenarios. (d) Infrared scenario.

To evaluate the cross-scene generalization of GC-HG in complex real-world scenes, we extended our experiments to indoor, architectural, autonomous driving, and military infrared scenes. As shown in Figure 16a–j, distinct performance characteristics are observed. In indoor scenes a,b, the rendering quality of Flash3D is acceptable, but there are obvious ring artifacts and ripple distortions in flat, low-frequency areas (e.g., near the wall and bedside cabinet). Additionally, the brightness is too high, resulting in overexposure in certain areas. In contrast, while GC-HG retains some textural details, its clarity is reduced, exhibiting slight blurring and dark angle effects. In architectural scenes c–f, Flash3D has geometric deformations on architectural appearances (e.g., domes, wooden towers, and house roofs), with distorted edges. While GC-HG improves geometric preservation, it encounters color smearing and purple color shifts. In autonomous driving scenes g–j, neither performance is ideal. Flash3D partially preserves the shape of vehicle body, while GC-HG has color shifts. Additionally, as the model has not been trained on text within this specific scenario, the reconstructed Chinese characters may appear blurred. In infrared military target scenes k–n(we employed iron-red color mapping to convert the single-channel infrared greyscale image into a color image), Flash3D maintains the target shape, but structural disorder and fuzzy diffusion occur in local areas (e.g., the tank barrel). GC-HG is more compact with fewer artifacts yet presents a certain “atomization” phenomenon. To sum up, Flash3D is prone to artifacts in the sharpening process, while GC-HG can maintain the geometric structure, but there are color shifts in outdoor architectural scenes.

Figure 16. Rendering of comparative experiment in cross-scene data reconstruction. (a,b) Indoor scenario. (c–f) Outdoor scenario. (g–j) Autonomous driving scenarios. (k–n) Infrared scenario.

Based on the comparative, ablation, and cross-scenario generalization experiments, two primary limitations of GC-HG are revealed. First, in scenes with complex geometry or drastic lighting variations, GC-HG struggles to maintain high-fidelity geometric and textural consistency under large-view extrapolation. Second, its bias toward modeling global geometric structure may compromise the recovery of local high-frequency details. This leads to artifacts such as color shifts and anomalous textures. Consequently, the resulting perceptual quality may fall short of methods that are explicitly optimized for perceptual metrics.

4. Conclusions

This paper presents GC-HG, a novel reconstruction framework designed to address the limited generalization and geometric perception in single-view 3DGS. By integrating depth prior with a pseudo-triplane representation, our method enhances the model architecture. Specifically, we leverage the pre-trained VGGT model for depth prior estimation, incorporate the pseudo-triplane module for geometric perception modeling, and utilize the parent–child hierarchical Gaussian renderer to model the hierarchical depth structures. These components improve the model’s geometric consistency and structural integrity. At the same time, the introduction of depth prior also improved the model’s cross-scenario generalization. Ablation studies validate the effectiveness of each module, improving 2.86% in PSNR, 0.99% in SSIM, and 3.06% in LPIPS over the baseline. Comparative experiment on the RealEstate10K public dataset further shows that our method outperforms existing approaches in new view reconstruction within the [−30°, +30°] range. Compared to the feed-forward 3DGS Splatter Image model, GC-HG achieves gains of 2.27% (PSNR), 0.55% (SSIM), and 11.81% (LPIPS), producing qualitative results with better geometric structure and detail preservation, but still exhibits limitations in color shift within autonomous driving scenarios and atomization in infrared scenarios.

Beyond theoretical contributions, GC-HG demonstrates practical potential. In digital twins, it reduces the cost and time required for 3D data acquisition by converting a single image into a high-fidelity 3D model. In autonomous driving simulation, it offers a low-cost and scalable solution for reconstructing geometrically accurate static scenes in virtual testing environments from a single-view image. In military simulation, integrated with physics-based behavior modeling, it supports the creation of realistic training scenarios, enhancing both training effectiveness and the reliability of tactical decision-making. Future work will explore integrating probabilistic generative models to augment the sparse input data. By leveraging prior of diffusion models to generate multi-view images, followed by feed-forward 3DGS reconstruction, we aim to further enhance reconstruction quality and structural integrity of Single-Image-to-3D tasks.

Author Contributions

Conceptualization, P.W. and H.G.; methodology, P.W. and H.G.; software, Y.M.; validation, Y.M.; formal analysis, Y.Z. and Y.M.; investigation, Y.M. and Y.Z.; resources, Y.M. and Y.Z.; data curation, P.W.; writing— original draft preparation, P.W.; writing—review and editing, P.W., H.G. and Y.M.; visualization, P.W. and H.G.; supervision, H.G. and Y.M.; project administration, H.G. and Y.M.; funding acquisition, H.G. and Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Liaoning Province Science and Technology Joint Plan (Key Research and Development Program Project)(2025JH2/101800394); the Opening Funding of National Key Laboratory of Electromagnetic Space Security (JCKY2024240C008); the scientific research support plan for introducing high-level talents in Shenyang Ligong University (1010147001133); and the Shenyang Xing-Shen Talents Plan Project for Master Teachers (Grant No. XSMS2206003).

Data Availability Statement

The data are available upon request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Park, J.J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 165–174. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
Li, J.; Feng, Z.; She, Q.; Ding, H.; Wang, C.; Lee, G.H. MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 12578–12588. [Google Scholar]
Szymanowicz, S.; Rupprecht, C.; Vedaldi, A. Splatter Image: Ultra-Fast Single-View 3D Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10208–10217. [Google Scholar]
Xu, H.; Peng, S.; Wang, F.; Blum, H.; Barath, D.; Geiger, A.; Pollefeys, M. DepthSplat: Connecting Gaussian Splatting and Depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 16453–16463. [Google Scholar]
Chen, Y.; Xu, H.; Zheng, C.; Zhuang, B.; Pollefeys, M.; Geiger, A.; Cham, T.-J.; Cai, J. MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 370–386. [Google Scholar]
Charatan, D.; Li, S.L.; Tagliasacchi, A.; Sitzmann, V. pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 19457–19467. [Google Scholar]
Chan, E.R.; Lin, C.Z.; Chan, M.A.; Nagano, K.; Pan, B.; De Mello, S.; Gallo, O.; Guibas, L.; Tremblay, J.; Khamis, S.; et al. Efficient Geometry-aware 3D Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 16123–16133. [Google Scholar]
Shue, J.R.; Chan, E.R.; Po, R.; Ankner, Z.; Wu, J.; Wetzstein, G. 3D Neural Field Generation using Triplane Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 20875–20886. [Google Scholar]
Zou, Z.X.; Yu, Z.; Guo, Y.C.; Li, Y.; Liang, D.; Cao, Y.P.; Zhang, S.H. Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10324–10335. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar]
Yi, T.; Fang, J.; Wang, J.; Wu, G.; Xie, L.; Zhang, X.; Liu, W.; Tian, Q.; Wang, X. GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 6796–6807. [Google Scholar]
Liu, R.; Wu, R.; Van Hoorick, B.; Tokmakov, P.; Zakharov, S.; Vondrick, C. Zero-1-to-3: Zero-shot One Image to 3D Object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 9298–9309. [Google Scholar]
Chan, E.R.; Nagano, K.; Chan, M.A.; Bergman, A.W.; Park, J.J.; Levy, A.; Aittala, M.; De Mello, S.; Karras, T.; Wetzstein, G. Generative Novel View Synthesis with 3D-Aware Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4217–4229. [Google Scholar]
Zhou, Z.; Tulsiani, S. SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 12588–12597. [Google Scholar]
Ma, B.; Gao, H.; Deng, H.; Luo, Z.; Huang, T.; Tang, L.; Wang, X. You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 2016–2029. [Google Scholar]
Gong, T.; Li, B.; Zhong, Y.; Wang, F. ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image. arXiv 2025, arXiv:2503.23881. [Google Scholar]
Li, J.; Tan, H.; Zhang, K.; Xu, Z.; Luan, F.; Xu, Y.; Hong, Y.; Sunkavalli, K.; Shakhnarovich, G.; Bi, S. Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model. arXiv 2023, arXiv:2311.06214. [Google Scholar]
Melas-Kyriazi, L.; Rupprecht, C.; Laina, I.; Vedaldi, A. RealFusion: 360° Reconstruction of Any Object from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 8446–8455. [Google Scholar]
Melas-Kyriazi, L.; Laina, I.; Rupprecht, C.; Neverova, N.; Vedaldi, A.; Gafni, O.; Kokkinos, F. IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation. arXiv 2024, arXiv:2402.08682. [Google Scholar]
Shi, Y.; Wang, P.; Ye, J.; Long, M.; Li, K.; Yang, X. MVDream: Multi-view Diffusion for 3D Generation. arXiv 2023, arXiv:2308.16512. [Google Scholar]
Zheng, C.; Vedaldi, A. Free3D: Consistent Novel View Synthesis without 3D Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 9720–9731. [Google Scholar]
Blattmann, A.; Dockhorn, T.; Kulal, S.; Mendelevitch, D.; Kilian, M.; Lorenz, D.; Levi, Y.; English, Z.; Voleti, V.; Letts, A.; et al. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. arXiv 2023, arXiv:2311.15127. [Google Scholar] [CrossRef]
Dai, X.; Hou, J.; Ma, C.Y.; Tsai, S.; Wang, J.; Wang, R.; Zhang, P.; Vandenhende, S.; Wang, X.; Dubey, A.; et al. Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack. arXiv 2023, arXiv:2309.15807. [Google Scholar] [CrossRef]
Girdhar, R.; Singh, M.; Brown, A.; Duval, Q.; Azadi, S.; Rambhatla, S.S.; Shah, A.; Yin, X.; Parikh, D.; Misra, I. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 205–224. [Google Scholar]
Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv 2023, arXiv:2307.01952. [Google Scholar] [CrossRef]
Rombach, R.; Esser, P.; Ommer, B. Geometry-Free View Synthesis: Transformers and no 3D Prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 14356–14366. [Google Scholar]
Zwicker, M.; Pfister, H.; Van Baar, J.; Gross, M. EWA Splatting. IEEE Trans. Vis. Comput. Graph. 2002, 8, 223–238. [Google Scholar] [CrossRef]
Wang, J.; Chen, M.; Karaev, N.; Vedaldi, A.; Rupprecht, C.; Novotny, D. VGGT: Visual Geometry Grounded Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 5294–5306. [Google Scholar]
Zhou, T.; Tucker, R.; Flynn, J.; Fyffe, G.; Snavely, N. Stereo Magnification: Learning View Synthesis using Multiplane Images. arXiv 2018, arXiv:1805.09817. [Google Scholar] [CrossRef]
Belghazi, M.I.; Baratin, A.; Rajeswar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, R.D. MINE: Mutual Information Neural Estimation. arXiv 2018, arXiv:1801.04062. [Google Scholar]
Tucker, R.; Snavely, N. Single-View View Synthesis with Multiplane Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–20 June 2020; pp. 551–560. [Google Scholar]
Szymanowicz, S.; Insafutdinov, E.; Zheng, C.; Campbell, D.; Henriques, J.F.; Rupprecht, C.; Vedaldi, A. Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image. arXiv 2024, arXiv:2406.04343. [Google Scholar]

Figure 1. 3DGS algorithm architecture.

Figure 2. Feed-forward 3DGS algorithm architecture.

Figure 3. GC-HG single-view 3DGS reconstruction synthesis. (a) Input image. (b) Gaussian Point. (c) Novel-view.

Figure 4. The overall framework of the GC-HG single-view 3DGS reconstruction algorithm.

Figure 5. Pre-trained VGGT []: Visual Geometry Localization Transformer Architecture.

Figure 6. Schematic diagram of the initial point cloud for back-projection.

Figure 7. The self-attention mechanism and cross-attention mechanism used in the pseudo-triplane.

Figure 8. Depth offset, 3D offset, and MLP schematic diagram.

Figure 9. Schematic diagram of GC-HG geometric perception. (a) Input Image. (b) w/o pseudo-triplane and depth prior. (c) w/pseudo-triplane and depth prior.

Figure 10. Geometric perception. (a) Edge density analysis; and (b) spectral energy distribution.

Figure 11. Geometric perception—depth histogram.

Figure 12. Geometric perception—depth gradient histogram.

Figure 13. Geometric enhancement using parent–child hierarchical Gaussian. (a) Input image. (b) Parent Gaussian. (c) Child Gaussian.

Figure 14. Comparison experiment rendering effect of novel view reconstruction on the RealEstate10K dataset.

Figure 15. Reconstruction drawing of the 3D model. (a) Indoor scenario. (b) Outdoor scenario. (c) Autonomous driving scenarios. (d) Infrared scenario.

Figure 16. Rendering of comparative experiment in cross-scene data reconstruction. (a,b) Indoor scenario. (c–f) Outdoor scenario. (g–j) Autonomous driving scenarios. (k–n) Infrared scenario.

Table 1. The ablation experiment of novel view reconstruction on the RealEstate10K dataset.

Designs	PSNR/dB↑	SSIM↑	LPIPS↓
Baseline	24.45	0.825	0.163
w/Pseudo-triplane Depth Prior	24.86 (+1.67%)	0.828 (+0.37%)	0.160 (+1.84%)
w/Parent–child hierarchical Gaussian	24.01 (−1.79%)	0.806 (−2.35%)	0.176 (−7.97%)
GC-HG	25.15 (+2.86%)	0.833 (+0.99%)	0.158 (+3.06%)

Table 2. In Scene 1 of Figure 9, the geometric perception quantification metric.

Metric	Edge Density↑	High-Frequency Energy↑
w/o Pseudo-triplane Depth Prior	0.0307	95.28
w/Pseudo-triplane Depth Prior	0.0485 (+57.98%)	100.96 (+5.9%)

Table 3. Overall comparison of novel view reconstruction on the RealEstate10K dataset.

	5 Frames			10 Frames			U [−30, 30] Frames
Model	PSNR/dB	SSIM	LPIPS	PSNR/dB	SSIM	LPIPS	PSNR/dB	SSIM	LPIPS
SV-MPI []	27.10	0.870	0.129	24.40	0.812	0.170	23.52	0.785	0.830
MINE []	28.45	0.897	0.111	25.89	0.850	0.150	24.75	0.820	0.179
Splatter Image []	28.15	0.894	0.110	25.34	0.842	0.144	24.15	0.810	0.177
Flash3D []	28.46	0.899	0.100	25.94	0.857	0.133	24.93	0.833	0.160
GC-HG	28.79	0.899	0.097	26.07	0.857	0.132	25.15	0.833	0.158

Table 4. Scene-by-scene comparison of novel view reconstruction on the RealEstate10K dataset.

		5 Frames			10 Frames			U [−30, 30] Frames
Datatype	Method	PSNR/dB	SSIM	LPIPS	PSNR/dB	SSIM	LPIPS	PSNR/dB	SSIM	LPIPS
Estate	MINE []	36.12	0.958	0.028	35.24	0.946	0.043	31.62	0.933	0.051
	Splatter Image []	35.47	0.955	0.035	34.00	0.939	0.051	31.21	0.930	0.062
	Flash3D []	36.25	0.957	0.033	34.89	0.942	0.049	31.25	0.929	0.066
	GC-HG	37.53	0.959	0.031	35.91	0.946	0.046	32.09	0.940	0.052
Loft	MINE []	23.43	0.911	0.080	22.50	0.863	0.150	20.34	0.808	0.177
	Splatter Image []	23.26	0.910	0.089	22.52	0.864	0.149	20.70	0.810	0.185
	Flash3D []	23.33	0.908	0.092	22.83	0.861	0.144	20.09	0.798	0.187
	GC-HG	23.79	0.916	0.074	22.96	0.870	0.147	20.90	0.821	0.173
Kitchen	MINE []	25.85	0.906	0.054	21.99	0.840	0.084	18.94	0.746	0.156
	Splatter Image []	25.76	0.906	0.053	21.79	0.837	0.085	18.68	0.738	0.152
	Flash3D []	25.61	0.901	0.057	21.33	0.820	0.089	18.09	0.699	0.159
	GC-HG	26.36	0.913	0.046	22.26	0.851	0.074	19.23	0.767	0.141
Balcony	MINE []	31.67	0.944	0.027	28.88	0.922	0.044	19.86	0.800	0.203
	Splatter Image []	31.32	0.940	0.029	28.61	0.916	0.046	19.90	0.799	0.206
	Flash3D []	31.35	0.939	0.032	28.52	0.915	0.048	19.32	0.790	0.210
	GC-HG	32.07	0.946	0.028	29.14	0.926	0.044	19.98	0.799	0.201

Table 5. Convergence speed of novel view reconstruction on the RealEstate10K dataset.

Model	GPU/Count	Convergence Rate/Day
SV-MPI []	16	2.0
MINE []	48	2.0
Splatter Image []	1	14.0
Flash3D []	1	4.0
GC-HG	1	2.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

GC-HG Gaussian Splatting Single-View 3D Reconstruction Method Based on Depth Prior and Pseudo-Triplane

Abstract

1. Introduction

2. Basic Principles

2.1. Feed-Forward 3DGS

2.2. Algorithm in This Paper

2.3. Depth Prior

2.4. Pseudo-Triplane Mechanism

2.5. Parent–Child Hierarchical Gaussian Renderer

3. Experimental Results and Analysis

3.1. Dataset and Parameter Settings

3.2. Ablation Experiment

3.2.1. Pseudo-Triplane Depth Prior Analysis

3.2.2. Parent–Child Hierarchical Gaussian Analysis

3.3. Comparative Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics