Gaussian Splatting-Based Color and Shape Deformation Fields for Dynamic Scene Reconstruction

Bao, Kaibin; Wu, Wei; Hao, Yongtao

doi:10.3390/electronics14122347

Open AccessArticle

Gaussian Splatting-Based Color and Shape Deformation Fields for Dynamic Scene Reconstruction

by

Kaibin Bao

¹,

Wei Wu

² and

Yongtao Hao

^1,*

¹

CAD Research Center, Tongji University, Shanghai 200092, China

²

Department of Geotechnical Engineering, Tongji University, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(12), 2347; https://doi.org/10.3390/electronics14122347

Submission received: 28 April 2025 / Revised: 3 June 2025 / Accepted: 3 June 2025 / Published: 8 June 2025

(This article belongs to the Special Issue 3D Computer Vision and 3D Reconstruction)

Download

Browse Figures

Versions Notes

Abstract

The 3DGS (3D Gaussian Splatting) series of works has achieved significant success in novel view synthesis, but further research is needed for dynamic scene reconstruction tasks. In this paper, we propose a new framework based on 3DGS for handling dynamic scene reconstruction problems involving color changes. Our approach employs a multi-stage training strategy combining motion and color deformation fields to accurately model dynamic geometry and appearance changes. Additionally, we design two modular components: the Dynamic Component for capturing motion variations and the Color Component for managing material and color changes. These components flexibly adapt to different scenes, enhancing our method’s versatility. Experimental results demonstrate that our method achieves real-time rendering at 80 FPS on an RTX 4090 and achieves higher reconstruction accuracy than baseline methods such as HexPlane and Deformable3DGS. Furthermore, it reduces training time by approximately 10%, indicating improved training efficiency. These quantitative results confirm the effectiveness of our approach in delivering high-fidelity 4D reconstruction of complex dynamic environments.

Keywords:

3D gaussian splatting; deformation field; dynamic scene reconstruction; neural network

1. Introduction

The research on novel view synthesis and scene reconstruction has broad applications in the real world, such as virtual reality (VR), augmented reality (AR), movie special effects, digital twins, and game development. In recent years, the NeRF (Neural Radiance Field) series of works has achieved excellent quality in rendering real scenes [1,2,3,4,5], but the complex training of neural networks and the high computational cost of volume rendering still incur a non-negligible time consumption. This issue becomes even more pronounced in the task of dynamic scene reconstruction.

Recently, 3D Gaussian Splatting (3DGS) has emerged as a significant technology in the field of novel view synthesis. It has replaced the original complex and cumbersome NeRF method by representing scenes as 3D Gaussians, significantly increasing rendering speed to real-time levels. There have been related studies applying it to the task of dynamic scene reconstruction [6,7].

However, current methods primarily model Gaussian rigid-body motion. All dynamic changes in the scene are attributed to Gaussian motion. This approach overlooks other complex changes that may exist in the scene, such as color, lighting, or non-rigid deformations, thereby limiting the expressive capabilities of dynamic scene reconstruction.

We have analyzed the causes of color changes in real-world scenarios. Changes in the color of objects as perceived by the human eye can be due to external conditions such as variations in the wavelength or intensity of light, or alterations in the geometry of the object (such as ripples on the surface of water), which modify the interaction between the object and light. Additionally, changes in the material properties of the object itself can also lead to color variations, although this is less common. For instance, in dynamic scenes, a PC screen or projection screen displaying video may exhibit color changes across its surface (assuming the screen is not self-emitting).

These color changes on a planar surface are often subtle and highly dependent on view direction, lighting, and material properties, making it necessary to represent the scene geometry with high precision to ensure accurate rendering [8,9]. Although a Gaussian color change in the scene can be entirely replaced by another Gaussian, doing so would significantly increase the number of points, computational costs, and training time.

Our objective is to utilize Gaussian scene representation for dynamic scene reconstruction while more meticulously handling the color changes of objects within the dynamic scene. Here, a deformation network is employed to represent the color changes of the Gaussian. The Gaussian components of an object exhibit correlations in their motion variations; however, there may be no correlation between the two frames in a video.

Figure 1 illustrates the overall architecture of our proposed GBC framework that leverages Gaussian Splatting and deformation fields to reconstruct scenes from a set of input images of dynamic scenes. The proposed framework has been evaluated on both synthetic datasets and real-world video playback scenarios, demonstrating real-time rendering capabilities and improved reconstruction accuracy compared to baseline methods.

Although generating a new Gaussian can handle such cases, we leverage the color deformation field to transform the Gaussian’s color, thereby reducing the need for dynamically generating new Gaussians.

The key novel contributions of this paper are as follows:

We propose a framework called GBC (Gaussian Splatting-Based Color and Shape Deformation Fields), which integrates motion deformation fields, color deformation fields, and 3DGS to handle dynamic scenes with material color variations.
We have successfully constructed a multi-stage training model and designed dynamic components and color components, which enables us to flexibly decompose and specialize in various different environments.
Our approach achieves real-time rendering while maintaining high-fidelity 4D scene reconstruction.

2. Related Works

In this section, we review relevant literature that has laid the foundation for dynamic scene reconstruction and novel view synthesis. We first discuss classical and recent advances in novel view synthesis, which is closely related to our work. Subsequently, we examine methods focusing on dynamic scene modeling, deformation techniques, and Gaussian-based representations.

2.1. Novel View Synthesis

Novel view synthesis aims to model a scene from a limited number of input images and render new views from unseen viewpoints, which has been a long-standing research problem in the fields of computer vision and graphics. Early methods utilized triangular meshes as representations of the scene, combined with knowledge such as light fields to render new views [10]. Further advancements involved the use of discrete voxel grids [11] and multi-plane representations [12] for gradient-based optimization, effectively enabling view synthesis.

Recently, the emergence of Neural Radiance Fields (NeRF) [13] has brought revolutionary progress to scene representation and novel view synthesis. Subsequent works have built upon this foundation, attempting to improve the quality of synthesized views [1,2], enhance reconstruction capabilities from sparse views [14,15], achieve more accurate geometric reconstruction [16,17], and improve rendering capabilities by combining voxel grids [18] or hash tables [19]. Additionally, there have been efforts to integrate with specific application domains, such as combining with generative models to generate 3D content [20,21,22,23,24].

More recently, 3D Gaussian Splatting (3DGS) [25] has emerged, starting from explicit point clouds to construct 3D Gaussians. It utilizes efficient differentiable splatting [26] to project 3D Gaussians onto a 2D image plane, rasterizing them to obtain rendered images, thereby achieving real-time rendering.

2.2. Dynamic Scene Representation

Unlike static novel view synthesis, dynamic novel view synthesis deals with scenes that change over time, including moving objects, lighting variations, and other complex factors, as demonstrated in prior works such as HyperNeRF [27] and D-NeRF [28]. Therefore, dynamic novel view synthesis not only needs to address the issue of spatial viewpoint changes but also has to handle temporal variations to ensure the consistency and smoothness of the generated views both spatially and temporally.

Some works model dynamic scenes based on the implicit representation of NeRF. For instance, D-NeRF [28] extends NeRF to a 6D coordinate system (spatial position, viewpoint direction, and time t). It consists of two network modules: a deformation MLP network that calculates the displacement of each point relative to the standard space at a specific time, and another MLP that inputs the position and viewpoint to output color and density. Other dynamic NeRF [27,29] algorithms similarly use deformation MLPs to process inputs at a specific time and then query the canonical NeRF. After applying NeRF to dynamic scene novel view synthesis tasks, due to the long training time and slow convergence, some methods have been proposed to accelerate the learning of dynamic radiance fields [30,31,32]. For example, NeRFPlayer [33] models dynamic scenes by dividing the 4D space (3D spatial coordinates + time) into a sequence of 3D spatial grids indexed by timestamps. Each grid corresponds to a specific time slice, enabling efficient encoding of temporal changes. However, this approach requires storing a large number of high-resolution 3D grids corresponding to different time steps in memory. As the temporal resolution increases, the number of 3D volumes grows rapidly, leading to significant memory overhead and limiting scalability.

In recent years, various other scene representations have emerged for dynamic scene modeling, such as k-planes [34] and hexplanes [35]. Following the success of more efficient explicit 3DGS methods, there has been increased research on their application to dynamic scene novel view tasks. For instance, 4DrotorGS [36] introduces N-dimensional rigid body motion [37] to model Gaussian representations based on rotors in 4D space, and slices along the time dimension to generate dynamic 3D Gaussians at each timestamp.

Our work leverages the efficient representation of 3D Gaussian Splatting (3DGS) and extends it to dynamic scenes by introducing color and shape deformation fields. In addition, we adopt a multi-stage, multi-component optimization strategy to better adapt to dynamic scene modeling. This enables not only faster training and rendering, but also more accurate modeling of time-varying appearance and geometry, which has not been sufficiently addressed in prior 3DGS-based methods. The improved efficiency stems from our design of deformable color and shape fields that are lightweight and decoupled from high-dimensional voxel representations. Combined with the compact structure of 3D Gaussians, this reduces the computational and memory burden during both training and rendering stages. We will continue to use MLPs to decouple deformation fields while considering changes in the object’s surface during deformation. Additionally, we will modularize the network structure, combining it into independent components to enhance the flexibility and adaptability of our work.

3. Method

The overview of the entire pipeline is illustrated in Figure 2. Upon inputting image frames and camera poses from a dynamic scene, the multi-stage network structure described in Section 3.2 is utilized to train scene Gaussians, which are further optimized geometrically through normals. Finally, the loss used during training is introduced in Section 3.3.

3.1. Preliminary

This section describes the methods we use in pipeline, including extensions to 3DGS, multi-stage network architecture, and loss function.

3.1.1. Original 3DGS

3D Gaussians represent a novel scene representation technique, which generates a sparse point cloud from structure-from-motion (SfM) methods and subsequently constructs a set of Gaussians based on this point cloud [25]. Each 3D Gaussian is defined by its center position x and a covariance matrix. The covariance matrix can be decomposed into a scaling matrix S and a rotation matrix R, expressed as

Σ = R S S^{T} R^{T}

. This decomposition is further represented using scaling factors

s \in R^{3}

and quaternions

r \in R^{4}

.

3DGS improves the geometry of 3D Gaussians through optimization of the scaling factors s, quaternions r, and positions x, along with techniques such as pruning and densification. During the rendering of new views, the view transformation matrix W and the Jacobian matrix J of the affine approximation of the projective transformation are used to obtain the covariance matrix in camera space, given by

Σ^{'} = J W Σ W^{T} J^{T}

. The color of each pixel is ultimately calculated using the following formula (1):

C = \sum_{i \in N} c_{i} α_{i} T_{i}

(1)

T_{i} = \prod_{j = 1}^{i - 1} (1 - α_{j})

is the transmittance, where

α_{i} = σ_{i} e^{- \frac{1}{2} x^{T} Σ^{'} x}

,

σ_{i}

is the opacity of the Gaussian, and

c_{i}

is the color of the Gaussian.

3.1.2. Geometric Normal Optimization

When simulating the surface of objects, the geometric shape of the Gaussian sphere is crucial, particularly for planar objects in the scene. During the training process, the normals of the Gaussians that make up the plane should maintain consistent directions to ensure the smoothness of the plane. We introduce the concept of Gaussian normals to describe and constrain the directional properties of Gaussians. As mentioned in GaussianShader [9], during training, Gaussians gradually flatten, and the shortest axis of the Gaussian is used as the initial normal, which is closest to the surface normal of the scene. Due to the dynamic nature of the scene, the shape of the Gaussian changes over time, and the shortest axis of the deformed Gaussian is constantly changing. Storing the normal as an attribute of the Gaussian would significantly increase storage costs and make it difficult to adapt to the changing needs of Gaussian normals over time in dynamic scenes. We calculate the normals approximated by the shortest axis during training using the following formula (2):

N o r m a l = F_{n o r m a l i z e} (F_{a l i g n} (F_{s o r t} (s_{t}), r_{t}, v_{t}))

(2)

The

F_{s o r t}

function is used to sort and obtain the shortest axis within the Gaussian, which is then adjusted in direction using the quaternion

r_{t}

. The function

f_{a l i g n}

is employed to align the normal to the outer surface, where

v_{t}

represents the viewing direction at time t. Finally,

f_{n o r m a l i z e}

is used to normalize the normal.

For the optimization of normals: First, the normal data is passed into the rasterizer to obtain the rendered depth map

n_{1}

. Then, based on the rendered depth map, the corresponding reference normal map

n_{2}

is calculated using gradient information. Minimising the difference between

n_{1}

and

n_{2}

using formula (3):

L_{n o r m a l} = {| | n_{1} - n_{2} | |}^{2}

(3)

This method exhibits good adaptability to dynamic changes in the scene because the calculation of normals is performed in real-time.

3.2. Multi-Stage Network Architecture

Current methods for reconstructing dynamic scenes typically involve predicting Gaussian rigid body motion, including parameters such as position, scaling, and rotation. Additionally, dynamic scenes not only undergo changes in geometry but also involve complex factors such as color and lighting. Relying solely on predicting motion to comprehensively model Gaussian deformations not only incurs significant computational overhead but also places a high reliance on data and models.

Here, a three-stage training strategy is introduced, starting from constructing scene geometry, to Gaussian motion deformation, and finally to joint adjustment of Gaussian motion and color. To enhance the flexibility of the method, the network structure of the above stages is divided into several independent components, allowing for the selection of specific geometric modeling components or dynamic adjustment modules based on scene requirements. Each stage is detailed below.

3.2.1. First Stage with 3DGS

In the first stage, we utilize the original 3D Gaussian Splatting (3DGS) to perform an initial scene reconstruction from the input images and camera poses. The objective is to quickly construct a rough geometric representation and basic attributes of the scene. Given the potential significant variability in the input image sequence, particularly in dynamic scenes where object shapes and attributes may change, the reconstruction results from the original 3DGS method may be suboptimal. Therefore, high-precision results are not pursued in this stage.

During this stage, we do not optimize normal information or fit detailed dynamics in the scene. This is because the shape changes and occlusions in dynamic scenes are complex, and optimizing normals at this stage would yield limited benefits and potentially introduce additional overhead. Instead, this stage employs photometric loss between rendered images and ground truth to guide the optimization of Gaussian positions, scales, rotations, and spherical harmonic function coefficients. The photometric loss is defined in Equation (4).

L_{c o l o r} = L_{1} (| I_{r e n d e r i m a g e} - I_{g t} |)

(4)

3.2.2. Dynamic Stage with Dynamic Component

After completing the rough Gaussian geometric reconstruction in the first stage, in the second stage, we referred to and modified the hexplane to continue processing dynamic scenes.

In dynamic scenes, the temporal dimension needs to be handled, and Gaussian coordinates are extended to four dimensions

(x, y, z, t)

. For a Gaussian at a position

(x, y, z)

at a certain time t, the hexplane projects

(x, y, z, t)

onto six planes, with each point being projected onto a pair of coordinate axes (such as

x y

and

z t

or

x z

and

y t

) to form a plane.

The feature vectors of the projected points on the paired planes are multiplied element-wise and then concatenated into a single feature vector, which serves as the feature for the point in the 4-dimensional space of

x y z t

, as shown in Equation (5) below.

\begin{matrix} f_{h} = h e x (x, y, z, t) & = (P_{x, y} ⊙ P_{z, t}) \\ + (P_{x, z} ⊙ P_{y, t}) \\ + (P_{x, t} ⊙ P_{y, z}) \end{matrix}

(5)

P_{x, y}

represents the feature corresponding to the point projected onto the

x y

plane, which is then element-wise multiplied with the feature of the paired plane

P_{z, t}

, and all the results are concatenated together.

Before inputting the Gaussian position coordinates

(x, y, z, t)

into the hexplane, we performed positional encoding to enhance the expression of positional features, thereby better capturing the complex changes in dynamic scenes, as shown in Equation (6) below.

E_{p} = (s i n (2^{k} π p), c o s (2^{k} π p)) k = 0, 1, . . . L - 1

(6)

To further capture the changes in Gaussian geometric features within dynamic scenes, we have designed a deformation decoder component based on a Multi-Layer Perceptron (MLP), which takes the feature vectors generated by hexPlane as input to predict the variations in Gaussian attributes. Within this component, a separate decoder is designed for each attribute’s change, mapping

f_{h}

to the corresponding change amount. For instance, the change amount for Gaussian position is denoted as

D_{p} = ϕ_{p} (h e x (x, y, z)) = ϕ_{p} (f_{h}) = (Δ x, Δ y, Δ z)

, the change for scaling is

D_{s} = ϕ_{s} (f_{h}) = Δ s

, and the change for rotation is

D_{r} = ϕ_{r} (f_{h}) = Δ r

.

In the second stage, we design the part of the network structure used for dynamic scene reconstruction as independent modular components (as shown in Figure 3). The decoder is implemented as an 8-layer fully connected neural network, with each layer followed by a ReLU activation. This modular design not only enhances the flexibility of the method but also allows for dynamic adjustment of the complexity of scene reconstruction based on requirements.

3.2.3. Color Stage with Color Component

After the second stage, Scene Gaussians have already demonstrated the ability to represent the motion of dynamic objects to a certain extent. The focus of the article is the reconstruction of planar video playback within the scene. The Dynamic Component predicts the motion deformation of Gaussians. While it is possible to reconstruct the content of the video by generating new Gaussians, the video primarily consists of color changes related to time and space and changes in unrelated objects, rather than significant geometric motion. This stage primarily focuses on color-related changes and drastic changes in objects.

We employ a deformable MLP to predict the color changes of objects, which requires the simultaneous input of position encodings from spherical harmonic coefficients and the spatiotemporal position features of Gaussians, as shown in formula (7). The color changes of Gaussians are related to time and their position. The MLP then decodes the features to obtain the changes in shs.

\begin{matrix} f_{s h s} & = Θ (E_{s h s} + f_{h}) \\ D_{s h s} & = ϕ_{s h s} (f_{s h s}) = Δ s h s \end{matrix}

(7)

Here,

Θ

represents the deformation MLP for shs,

E_{s h s}

is the encoded shs, and

ϕ_{s h s}

is the feature decoder for shs. Additionally, this stage performs normal optimization as described in Section 3.1, further refining the geometry of Gaussians and reducing surface discontinuities.

We combine the aforementioned network structure into the color component, as illustrated in Figure 3. The MLPs in the ColorComponent also adopt an 8-layer architecture with ReLU activation functions, and residual connections are introduced between layers to better capture complex appearance changes.

3.2.4. Shape Disruption

During the process of dynamic scene reconstruction, the phenomenon of motion propagation occurs. A Gaussian with motion

(Δ x, Δ y, Δ z)

indicates that nearby Gaussians, which are part of the same object, may also undergo similar movements. However, neural networks might overlook the influence of surrounding Gaussians during prediction. The absence of an appropriate propagation mechanism can result in unnatural or discontinuous overall motion of the object. Another issue is that the motion of some Gaussians may be too drastic. When there are insufficient training samples, the neural network may not predict changes accurately.

Our approach involves adding noise, which introduces shape disruption. At certain fixed iteration counts, the shape of the deformed Gaussians is adjusted, such as reducing the length of two major axes by half. This leads to the generation of more smaller Gaussians in subsequent training, helping to quickly adapt to the dynamic changes of the object. Additionally, it temporarily increases the sparsity of the surrounding space, offering more possibilities to enrich the details of the scene.

3.3. Loss Function

In the initial phase of the original 3DGS, we employ the L1 photometric loss described in Equation (4) for optimization.

Subsequently, in the second phase involving the hex component, we introduced an additional total-variational loss to enhance the spatiotemporal consistency of the Gaussians.The loss function in stage 2 is defined in Equation (8).

L_{s t a g e 2} = L_{c o l o r} + L_{t v}

(8)

In the third phase, we utilized L1 photometric loss and Lnormal supervision for training, as defined in Equation (9).

L_{s t a g e 3} = L_{c o l o r} + L_{n o r m a l}

(9)

4. Results and Discussion

In this section, we will conduct experimental evaluations of the proposed method. In Section 4.1, we describe the experimental setup and the datasets used. Section 4.2 presents the comparison results of our method with various datasets and recent approaches. In Section 4.3, ablation experiments are performed to demonstrate the effectiveness of our method. Finally, in Section 4.5, we analyze the limitations and conduct tests using data captured by ourselves.

4.1. Experimental Details

We implemented the entire framework using PyTorch 1.13.1 [38] and conducted experiments on both the synthetic dataset from D-NeRF [28] and the real dataset from HyperNeRF [27]. In the first training phase, we obtained a coarse static scene Gaussian after 3k iterations. Subsequently, we trained the dynamic and color components of the network for a total of 20 k iterations.

Optimization was performed using the Adam optimizer [39]. The learning rate for the Gaussians was set according to the original 3DGS [25], while the learning rate for the deformation network decayed exponentially during training, ranging from

1.6 \times 10^{- 4}

to

1.6 \times 10^{- 6}

. All experiments were completed on a single NVIDIA RTX 4090.

4.2. Experimental Results

4.2.1. Comparisons on Synthetic Dataset

In our experiments, we benchmarked our method using the monocular synthetic dataset from D-NeRF [28]. We selected representative scenes such as jumping jacks, LEGO, and hook. All images were resized to 800 × 800 resolution for consistency. Camera poses were provided with the dataset. The results from the synthetic dataset are summarized in Table 1, with some metric data sourced from the deformable3DGS [7], hexplane [35] and our reproduction results. Additionally, we provide qualitative results in Figure 4, and experimental image data for other methods were obtained from our own reproduction results. The Deformable3DGS demonstrates that 40k training iterations are required to achieve the results presented in the table. In contrast, our method achieves comparable rendering quality with significantly fewer training iterations on the synthetic dataset.

It is worth noting that in the LEGO scene experiments, the training and test sets capture different viewpoints and motions, which often leads to unsatisfactory experimental results for many conventional methods. This is mainly due to inaccuracies in camera pose estimation and insufficient quantity of images in the dataset, both of which increase the difficulty of accurate dynamic scene reconstruction. In contrast, the Deformable3DGS paper evaluates on the LEGO validation set, which is more consistent with the training data, resulting in better performance. Additionally, other methods also suffer from long training times. The results for the synthetic dataset are summarized in Table 2. The above experimental results demonstrate that our method exhibits superior performance.

4.2.2. Comparisons on Real-World Dataset

We compared our method against the baseline using the dataset from HyperNeRF [27]. We used scenes such as hand and cut-lemon. The dataset provided camera intrinsics and poses. We resized the images to a resolution of 800 × 800 and adopted the standard train/test split used by the authors. For real-world datasets, there is a significant amount of inaccurate camera poses, and the PSNR metrics for each test vary considerably. We have not conducted a quantitative analysis here, but qualitative analysis images are displayed in Figure 5.

It can be observed that our method achieves faster training speeds and reliable rendering quality.

4.2.3. Depth and Normal Maps Visualization

The depth maps and normal maps of several scenes from the synthetic dataset are visualized in Figure 6, illustrating the effectiveness of our method in modeling detailed scene geometry. Compared to baseline methods, our approach produces smoother and more accurate depth transitions, particularly around complex object boundaries and regions exhibiting significant motion or deformation. The enhanced smoothness in depth maps reduces artifacts often caused by abrupt changes in geometry. Moreover, the normal maps generated by our method exhibit more consistent and precise orientations, reflecting an improved capture of fine geometric details. This demonstrates that our deformation network effectively models both spatial and temporal variations in dynamic scenes, leading to more realistic and coherent reconstructions.

4.3. Ablation Study

Our ablation experiments are presented in Table 3. The experiment utilized the hook scenario of synthetic datasets. Given that the total number of training iterations was 20 k, the training time varied significantly when a certain component was omitted. There were shape disruption and normal optimizations during color stage training, so we conducted further experiments.

The effect of introducing normals is demonstrated in Figure 6, showing an improvement in the geometry of the scenes. We observe that introducing shape distortion during training increases the number of scene points and enhances adaptability to drastic changes, thereby improving the capture of object deformations.

4.4. Case Study

To demonstrate the applicability of our method in real-world scenarios, we present a case study using our own scene data, as shown in Figure 7. The case involves a dynamic scene where a video is being played on a computer screen. The video content consists of a set of images where a single image is modified to various hues, along with entirely different images.

4.5. Analysis and Limitations

Our approach has demonstrated rapid reconstruction and rendering capabilities in many moderately dynamic scenes. However, several limitations remain. First, when capturing fast motions using only a single camera, motion blur may occur, making it difficult to extract accurate geometric and appearance information. Additionally, large changes between consecutive frames may lead to insufficient temporal sampling, especially when only 1–2 training images are available at a given time step, resulting in poor reconstruction quality. Sparse training data further complicates dynamic scene modeling. Moreover, in large-scale scenes, a substantial number of Gaussians must be generated, which increases both computational and memory demands, thus requiring longer processing times.

4.6. Conclusions

In this paper, we propose GBC, a novel dynamic scene reconstruction framework that incorporates Gaussian Splatting with color and shape deformation fields. By jointly modeling the motion and appearance changes of Gaussians, GBC achieves higher reconstruction accuracy and real-time performance, with faster convergence compared to baseline methods. Our dual-module design enhances adaptability across diverse dynamic scenes. Experimental results demonstrate that our approach achieves real-time rendering speeds and improved reconstruction fidelity on both synthetic and real-world datasets. While the method performs well under moderate motion and lighting conditions, it currently faces challenges in handling extreme temporal changes and complex global illumination. Future work will focus on extending our framework to better handle complex lighting conditions and ensure long-term temporal consistency in real-world dynamic scenes.

Author Contributions

Conceptualization, K.B.; methodology, K.B.; software, K.B.; validation, K.B.; formal analysis, K.B.; investigation, K.B.; resources, Y.H.; data curation, K.B.; writing—original draft preparation, K.B.; writing—review and editing, K.B.; visualization, K.B.; supervision, W.W. and Y.H.; project administration, W.W. and Y.H.; funding acquisition, W.W. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Sichuan TransportationScience and Technology Program (No. 2018-ZL-02).

Data Availability Statement

The data presented in this study are openly available in D-NeRF at 10.1109/CVPR46437.2021.01018 and in HyperNeRF at 10.1145/3478513.3480487.

Acknowledgments

We thank the author Wu for his comments on the writing, Hao for his valuable comments and support on the article, and for the hardware facilities provided in his CAD laboratory.

Conflicts of Interest

The authors declare no conflict of interest.

References

Barron, J.T.; Mildenhall, B.; Tancik, M.; Hedman, P.; Martin-Brualla, R.; Srinivasan, P.P. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 5855–5864. [Google Scholar]
Barron, J.T.; Mildenhall, B.; Verbin, D.; Srinivasan, P.P.; Hedman, P. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5470–5479. [Google Scholar]
Zhang, J.; Yao, Y.; Li, S.; Liu, J.; Fang, T.; McKinnon, D.; Tsin, Y.; Quan, L. Neilf++: Inter-reflectable light fields for geometry and material estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3601–3610. [Google Scholar]
Yao, Y.; Zhang, J.; Liu, J.; Qu, Y.; Fang, T.; McKinnon, D.; Tsin, Y.; Quan, L. Neilf: Neural incident light field for physically-based material estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 700–716. [Google Scholar]
Liu, Y.; Wang, P.; Lin, C.; Long, X.; Wang, J.; Liu, L.; Komura, T.; Wang, W. Nero: Neural geometry and brdf reconstruction of reflective objects from multiview images. ACM Trans. Graph. 2023, 42, 1–22. [Google Scholar] [CrossRef]
Luiten, J.; Kopanas, G.; Leibe, B.; Ramanan, D. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; pp. 800–809. [Google Scholar]
Yang, Z.; Gao, X.; Zhou, W.; Jiao, S.; Zhang, Y.; Jin, X. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20331–20341. [Google Scholar]
Yan, Z.; Li, C.; Lee, G.H. Nerf-ds: Neural radiance fields for dynamic specular objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8285–8295. [Google Scholar]
Jiang, Y.; Tu, J.; Liu, Y.; Gao, X.; Long, X.; Wang, W.; Ma, Y. Gaussianshader: 3d gaussian splatting with shading functions for reflective surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5322–5332. [Google Scholar]
Wood, D.N.; Azuma, D.I.; Aldinger, K.; Curless, B.; Duchamp, T.; Salesin, D.H.; Stuetzle, W. Surface light fields for 3D photography. In Seminal Graphics Papers: Pushing the Boundaries; Association for Computing Machinery: New York, NY, USA, 2023; Volume 2, pp. 487–496. [Google Scholar]
Lombardi, S.; Simon, T.; Saragih, J.; Schwartz, G.; Lehrmann, A.; Sheikh, Y. Neural volumes: Learning dynamic renderable volumes from images. ACM Trans. Graph. 2019, 38, 65. [Google Scholar] [CrossRef]
Flynn, J.; Broxton, M.; Debevec, P.; DuVall, M.; Fyffe, G.; Overbeck, R.; Snavely, N.; Tucker, R. Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2367–2376. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4578–4587. [Google Scholar]
Chen, A.; Xu, Z.; Zhao, F.; Zhang, X.; Xiang, F.; Yu, J.; Su, H. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 14124–14133. [Google Scholar]
Wang, P.; Liu, L.; Liu, Y.; Theobalt, C.; Komura, T.; Wang, W. NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS’21), Red Hook, NY, USA, 6–14 December 2021. [Google Scholar]
Oechsle, M.; Peng, S.; Geiger, A. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5589–5599. [Google Scholar]
Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q.; Recht, B.; Kanazawa, A. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5501–5510. [Google Scholar]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 2022, 41, 1–15. [Google Scholar] [CrossRef]
Schwarz, K.; Liao, Y.; Niemeyer, M.; Geiger, A. Graf: Generative radiance fields for 3d-aware image synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 20154–20166. [Google Scholar]
Xie, J.; Ouyang, H.; Piao, J.; Lei, C.; Chen, Q. High-fidelity 3d gan inversion by pseudo-multi-view optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 321–331. [Google Scholar]
Lin, C.H.; Gao, J.; Tang, L.; Takikawa, T.; Zeng, X.; Huang, X.; Kreis, K.; Fidler, S.; Liu, M.Y.; Lin, T.Y. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 300–309. [Google Scholar]
Ouyang, H.; Zhang, B.; Zhang, P.; Yang, H.; Yang, J.; Chen, D.; Chen, Q.; Wen, F. Real-time neural character rendering with pose-guided multiplane images. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 192–209. [Google Scholar]
Wang, C.; Chai, M.; He, M.; Chen, D.; Liao, J. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3835–3844. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 139. [Google Scholar] [CrossRef]
Yifan, W.; Serena, F.; Wu, S.; Öztireli, C.; Sorkine-Hornung, O. Differentiable surface splatting for point-based geometry processing. ACM Trans. Graph. 2019, 38, 230. [Google Scholar] [CrossRef]
Park, K.; Sinha, U.; Hedman, P.; Barron, J.T.; Bouaziz, S.; Goldman, D.B.; Martin-Brualla, R.; Seitz, S.M. HyperNeRF: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph. 2021, 40, 238. [Google Scholar] [CrossRef]
Pumarola, A.; Corona, E.; Pons-Moll, G.; Moreno-Noguer, F. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10318–10327. [Google Scholar]
Park, K.; Sinha, U.; Barron, J.T.; Bouaziz, S.; Goldman, D.B.; Seitz, S.M.; Martin-Brualla, R. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 1–17 October 2021; pp. 5865–5874. [Google Scholar]
Fang, J.; Yi, T.; Wang, X.; Xie, L.; Zhang, X.; Liu, W.; Nießner, M.; Tian, Q. Fast dynamic radiance fields with time-aware neural voxels. In Proceedings of the SIGGRAPH Asia 2022 Conference Papers, Daegu, Republic of Korea, 6–9 December 2022; pp. 1–9. [Google Scholar]
Guo, X.; Chen, G.; Dai, Y.; Ye, X.; Sun, J.; Tan, X.; Ding, E. Neural deformable voxel grid for fast optimization of dynamic view synthesis. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 3757–3775. [Google Scholar]
Liu, J.W.; Cao, Y.P.; Mao, W.; Zhang, W.; Zhang, D.J.; Keppo, J.; Shan, Y.; Qie, X.; Shou, M.Z. Devrf: Fast deformable voxel radiance fields for dynamic scenes. Adv. Neural Inf. Process. Syst. 2022, 35, 36762–36775. [Google Scholar]
Song, L.; Chen, A.; Li, Z.; Chen, Z.; Chen, L.; Yuan, J.; Xu, Y.; Geiger, A. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Trans. Vis. Comput. Graph. 2023, 29, 2732–2742. [Google Scholar] [CrossRef] [PubMed]
Fridovich-Keil, S.; Meanti, G.; Warburg, F.R.; Recht, B.; Kanazawa, A. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12479–12488. [Google Scholar]
Cao, A.; Johnson, J. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 130–141. [Google Scholar]
Duan, Y.; Wei, F.; Dai, Q.; He, Y.; Chen, W.; Chen, B. 4D-Rotor Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes. In Proceedings of the ACM SIGGRAPH 2024 Conference Papers (SIGGRAPH’24), New York, NY, USA, 27 July– 1 August 2024. [Google Scholar] [CrossRef]
Bosch, M.T. N-dimensional rigid body dynamics. ACM Trans. Graph. 2020, 39, 55. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. Overview of our reconstruction framework using Gaussian Splatting and deformation fields.

Figure 2. The overall pipeline of our model. Given a set of dynamic scene images, the first stage uses original 3DGS to generate a coarse Gaussian representation. The second stage applies the shape deformation field from the hex component, along with position and temporal features, to estimate shape variations at time t. The third stage handles rapid color changes using the color deformation field to predict the Gaussian’s color updates.

Figure 3. Diagram of all components. The hex component is used in the second stage to derive the Gaussian’s spatiotemporal features and, through a multi-decoder, predicts the Gaussian’s shape changes. In the third stage, the spatiotemporal features are combined with the color component’s color features to predict the Gaussian’s color changes.

Figure 4. Qualitative comparison of baseline and our monocular synthetic dataset approach. Compared to other models [7,25,28,34,35], our approach achieves superior rendering quality on the dataset.

Figure 5. Qualitative comparisons of baselines and our method on real-world dataset. Experimental results demonstrate that our method achieves superior rendering quality on real-world datasets and is more efficient.

Figure 6. Depth and normal maps of selected scenes in synthetic dataset. The first row shows the depth map, and the second row shows the normal map.

Figure 7. Case Study. We captured actual scenes for testing, with the first row showing the real images, the second row presenting the rendered images from early training, and the third row displaying the final rendering results.

Table 1. Quantitative comparison on synthetic dataset. We utilized PSNR, SSIM, LPIPS (VGG) metrics, and mark the first, second, and third cells with the best results in different colors in each scenario. Our method can achieve the above effect in a much shorter time (less than 10 min), while all other methods take much longer, and Deformable3DGS takes more than 20 min to achieve the same effect.

	Bouncing Balls			Hell Warrior			Hook			Jumping Jacks
Method	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
3D-GS	23.20	0.9591	0.0600	24.53	0.9336	0.0580	21.71	0.8876	0.1034	20.64	0.9297	0.0828
D-NeRF	38.17	0.9891	0.0323	24.06	0.9440	0.0707	29.02	0.9595	0.0546	32.70	0.9779	0.0388
K-Planes	40.05	0.9934	0.0322	24.58	0.9520	0.0824	28.12	0.9489	0.0662	31.11	0.9708	0.0468
HexPlane	39.69	0.9915	0.0323	24.24	0.9443	0.0732	28.71	0.9572	0.0505	31.65	0.9729	0.0398
Deformable3DGS	41.01	0.9953	0.0093	41.54	0.9873	0.0234	37.42	0.9867	0.0144	37.72	0.9897	0.0126
Ours	43.98	0.9972	0.0108	41.60	0.9815	0.0223	37.77	0.9883	0.0120	39.02	0.9914	0.0103
	Lego			Mutant			Stand Up			Trex
Method	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓
3D-GS	22.10	0.9384	0.0607	24.53	0.9336	0.0580	21.91	0.9301	0.0785	21.93	0.9539	0.0487
D-NeRF	25.56	0.9363	0.8210	30.31	0.9672	0.0392	33.13	0.9781	0.0355	30.61	0.9671	0.0535
K-Planes	28.91	0.9695	0.0331	32.50	0.9713	0.0362	33.10	0.9793	0.0310	30.43	0.9737	0.0343
HexPlane	25.22	0.9388	0.0437	33.79	0.9802	0.0261	34.36	0.9839	0.0261	30.67	0.9749	0.0273
Deformable3DGS	33.07	0.9794	0.0183	42.63	0.9951	0.0052	44.62	0.9951	0.0063	38.10	0.9933	0.0098
Ours	26.60	0.9484	0.0507	41.18	0.9921	0.0170	42.61	0.9958	0.0052	39.13	0.9940	0.0248

Table 2. Quantitative results on the synthetic dataset. “Time” in the table stands for training times.

Model	PSNR ↑	SSIM ↑	LPIPS ↓	Time ↓	FPS ↑
3DGS	23.19	0.93	0.08	10 min	170
K-Planes	31.61	0.97	0.03	52 min	0.97
HexPlane	31.04	0.97	0.04	12 min	2.5
Deformable3DGS	33.29	0.98	0.02	12 min	65.8
Ours	34.15	0.98	0.02	10 min	80

Table 3. Ablation studies on synthetic datasets using our proposed methods. Shape disruption and normal behavior are more prominently reflected in the model’s ability to adapt to drastic changes and its geometric properties.

Model	PSNR ↑	SSIM ↑	LPIPS ↓	Time ↓
Ours w/o Dynamic stage	28.48	0.95	0.04	7 min
Ours w/o Color stage	33.18	0.97	0.03	9 min
Ours w/o only Shape disruption	35.60	0.98	0.02	12 min
Ours w/o only Normal	36.29	0.98	0.02	12 min
Ours	36.46	0.99	0.02	14 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bao, K.; Wu, W.; Hao, Y. Gaussian Splatting-Based Color and Shape Deformation Fields for Dynamic Scene Reconstruction. Electronics 2025, 14, 2347. https://doi.org/10.3390/electronics14122347

AMA Style

Bao K, Wu W, Hao Y. Gaussian Splatting-Based Color and Shape Deformation Fields for Dynamic Scene Reconstruction. Electronics. 2025; 14(12):2347. https://doi.org/10.3390/electronics14122347

Chicago/Turabian Style

Bao, Kaibin, Wei Wu, and Yongtao Hao. 2025. "Gaussian Splatting-Based Color and Shape Deformation Fields for Dynamic Scene Reconstruction" Electronics 14, no. 12: 2347. https://doi.org/10.3390/electronics14122347

APA Style

Bao, K., Wu, W., & Hao, Y. (2025). Gaussian Splatting-Based Color and Shape Deformation Fields for Dynamic Scene Reconstruction. Electronics, 14(12), 2347. https://doi.org/10.3390/electronics14122347

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gaussian Splatting-Based Color and Shape Deformation Fields for Dynamic Scene Reconstruction

Abstract

1. Introduction

2. Related Works

2.1. Novel View Synthesis

2.2. Dynamic Scene Representation

3. Method

3.1. Preliminary

3.1.1. Original 3DGS

3.1.2. Geometric Normal Optimization

3.2. Multi-Stage Network Architecture

3.2.1. First Stage with 3DGS

3.2.2. Dynamic Stage with Dynamic Component

3.2.3. Color Stage with Color Component

3.2.4. Shape Disruption

3.3. Loss Function

4. Results and Discussion

4.1. Experimental Details

4.2. Experimental Results

4.2.1. Comparisons on Synthetic Dataset

4.2.2. Comparisons on Real-World Dataset

4.2.3. Depth and Normal Maps Visualization

4.3. Ablation Study

4.4. Case Study

4.5. Analysis and Limitations

4.6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI