Next Article in Journal
Inverse Design of Chessboard Metasurface for Broadband Monostatic RCS Reduction Based on CNN-KAN with Attention Mechanism
Previous Article in Journal
Digital Twins in Poultry Farming: Deconstructing the Evidence Gap Between Promise and Performance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Relaxing Accurate Initialization for Monocular Dynamic Scene Reconstruction with Gaussian Splatting

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(3), 1321; https://doi.org/10.3390/app16031321
Submission received: 31 December 2025 / Revised: 19 January 2026 / Accepted: 20 January 2026 / Published: 28 January 2026
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Monocular dynamic scene reconstruction is a challenging task due to the inherent limitation of observing the scene from a single viewpoint at each timestamp, particularly in the presence of object motion and illumination changes. Recent methods combine Gaussian Splatting with deformation modeling to enable fast training and rendering; however, their performance in real-world scenarios strongly depends on accurate point cloud initialization. When such initialization is unavailable and random point clouds are used instead, reconstruction quality degrades significantly. To address this limitation, we propose an optimization strategy that relaxes the requirement for accurate initialization in Gaussian-Splatting-based monocular dynamic scene reconstruction. The scene is first reconstructed under a static assumption using all monocular frames, allowing stable convergence of background regions. Based on reconstruction errors, a subset of Gaussians is then activated as dynamic to model motion and deformation. In addition, an annealing jitter regularization term is introduced to improve robustness to camera pose inaccuracies commonly observed in real-world datasets. Extensive experiments on established benchmarks demonstrate that the proposed method enables stable training from randomly initialized point clouds and achieves reconstruction performance comparable to approaches relying on accurate point cloud initialization.

1. Introduction

Recently, 3D Gaussian Splatting (3DGS) [1] has emerged as a remarkably expressive scene representation, bringing great breakthroughs in novel view synthesis by enabling fast, high-fidelity model training and GPU/CUDA-friendly, real-time rendering. The position, covariance, opacity, and directional appearance (typically spherical harmonics) of Gaussians are optimized to capture the complex geometry and appearance of a scene. Due to its effectiveness, researchers explore extending 3DGS to other scene-related applications [2,3,4,5], including monocular dynamic scene reconstruction [6].
The datasets used in monocular dynamic scene reconstruction are usually categorized as two types: synthetic [7] and real-world [6,8]. As mentioned in 3DGS, the initialization plays an important role in real-world datasets, especially in scenes with complicated background and areas not well covered by training cameras [1]. We can infer that the accurate initialization (mostly referred to high-quality point clouds) becomes more vital in dynamic scenes, as the discontinuity is manifested both in time and space. High-quality point clouds in real-world datasets are commonly obtained by Structure-from-Motion (SfM) techniques [9]. The process for obtaining point clouds from the monocular dynamic real-world dataset involves annotating masks for the static part in each frame and inputting masked images into COLMAP. In this way, sparse point clouds representing the static part of scenes are generated. Some methods [10] generate dense point clouds by simply inputting all images into COLMAP and further implement image undistortion, depth maps calculation, and fusion, point cloud fusion, and surface reconstruction to improve the initialization quality. However, on the one hand, annotating mask is quite cumbersome, and there might be scenes where SfM techniques struggle to converge [11]. On the other hand, obtaining dense point clouds requires extra computational cost and makes the process more complex.
Recently, some methods focus on achieving monocular dynamic scene reconstruction without utilization of COLMAP. SplineGS [12] leverages masks from datasets to separate static and dynamic regions and proposes a motion-adaptive spline method to control the dynamic gaussians. GFlow [13] utilizes depth and optical flow to initialize the point cloud. However, additional information as assistance is necessary for these methods, which requires the consumption of extra computing resources or complex annotation efforts.
In this work, we propose a novel optimization strategy for monocular dynamic scene reconstruction with Gaussian Splatting, which relaxes the requirement of accurate initialization and supports random point cloud as initialization, as shown in Figure 1 and Figure 2. This allows us to bypass the reliance on precise initialization, which is often difficult to obtain, particularly in complex dynamic scenes. Inspired by the method of obtaining initialized sparse point cloud in monocular dynamic real-world datasets as mentioned above, we firstly take all the training views as input to reconstruct a 3D static scene starting from random initialization, without considering the dynamic transformation over time. This initial reconstruction serves as a coarse representation of the scene and results in a 3D scene where the static areas are reconstructed quite well, while the dynamic areas appear blurry due to the lack of consideration for temporal changes (i.e., inconsistent supervision would lead to results converging towards the average of all supervisions). In this situation, when evaluating each view, the discrepancy between dynamic areas and ground truth will be much greater than that of static areas. Therefore, in the next stage, the error between renderings and ground truth images for each Gaussian is calculated based on the former 3D reconstruction as performed in [14]. A portion of Gaussians with the highest error values are activated as dynamic, that is, their properties are conditioned on time. The static and dynamic areas in the scene are separated and optimized by different Gaussians to effectively capture the unique characteristics of each in this stage. Furthermore, since real-world datasets usually lack exact accuracy on camera poses, we utilize an annealing jitter regularization term to enhance the robustness of the model.
In summary, our main contributions are as follows:
  • We introduce an innovative optimization approach that relaxes the requirement of initial point cloud for monocular dynamic scene reconstruction, providing the possibility of using Gaussian Splatting in situations where obtaining accurate point clouds is challenging.
  • We propose an error-based method that separates the Gaussian representing static areas and dynamic ones, which helps with precise and diverse control over different Gaussians.
  • Extensive experiments on different datasets are conducted to demonstrate the effectiveness of our method, showing that our method using randomly initialized point clouds achieves comparable or even better results compared to methods trained with accurate point clouds.

2. Related Work

2.1. Dynamic Scene Reconstruction

Reconstructing dynamic scenes and synthesizing novel views as well as interpolating time intervals are challenging tasks in computer vision and graphics. Early works [16,17,18] in this field focus on constructing dynamic primitives or interpolation.
With the advent of Neural Radiance Field (NeRF) [19], it has demonstrated remarkable success in capturing the complex geometry and appearance of static scenes from multiple viewpoints. The ability of NeRFs has inspired researchers to extend their application to dynamic scenarios. One type of NeRF-based approaches [20,21,22,23,24,25] leverages time t as an additional input. However, this strategy entangles position and time variations, which constrain their adaptability for downstream applications. Another type of methods [6,7,26,27,28] utilizes deformation, which focuses on modeling the motions in the scene by applying deformations to an initial static reconstruction. These methods are particularly powerful for handling dynamic scenes where objects move and deform over time, representing the dynamic components of the scene as a series of deformations applied to a base one.
Three-dimensional Gaussian Splatting (3DGS) quickly attracts a lot of attention by enabling real-time radiance field rendering. Follow-up works [10,15,29] turn to leverage 3D Gaussians for dynamic scene reconstruction. DeformableGS [15] employs an MLP to model the deformation for Gaussians, while [10] introduces a hexplane-based encoder to enhance the efficiency of deformation query. GauFRe [29] separates the static and dynamic components and optimizes Gaussians for each, respectively. However, their reconstruction for real-world scenes is greatly dependent on accurate point clouds generated by SfM methods. Some recent works [12,13] utilize additional information (such as dynamic object masks and optical flow) to handle situation without using SfM, but they require extra computing resources or complex annotation efforts. There might be situations where SfM struggles to generate point clouds. To relax the initialization condition and not rely on other conditions, we propose an optimization strategy to start the reconstruction from random initialized point clouds.

2.2. Initialization for 3DGS

The initialization plays an important role in 3DGS as mentioned in their experiment [1]. In 3D Gaussian Splatting (3DGS), the initialization phase lays the groundwork for how the Gaussians are positioned and distributed within the scene, which directly influences the accuracy. A good initialization can significantly speed up the convergence of the model and improve the quality of the final reconstruction. For synthetic datasets, the optimization starts from random initialization as the camera captures the scene uniformly, and the background is pure white. However, for real-world datasets, the background is complex, and there are areas that the camera has not captured well. Using random initialized point clouds leads to performance drops. Thus, 3DGS utilized SfM point cloud as a prior for real-world scene reconstruction.
Nevertheless, since real-world scenes are far more complex, SfM methods may not successfully capture the geometry of the scene and fail to generate point clouds. RAIN-GS [30] proposes a strategy to train with random initialized point clouds and achieves performance on-par with or even better than 3DGS trained with accurate SfM point clouds.
In this paper, we explore relaxing the initialization for dynamic scenes, especially for monocular dynamic scenes, a challenging task inherently involves sparse reconstruction from a single viewpoint.

3. Preliminary

3.1. Three-Dimensional Gaussian Splatting

Three-dimensional Gaussian Splatting [1] proposes to fit a 3D scene with a collection of anisotropic 3D Gaussians, where each Gaussian is defined by its mean (i.e., center position) μ and covariance Σ and featured with its opacity α and a set of spherical harmonics (SH) coefficients. For rendering from a specific view, 3D Gaussians are projected to 2D via world-to-image transformation.
Σ = J W Σ W T J T ,
where Σ is the 2D covariance in camera coordinates, J is the Jacobian of the affine approximation of the projective transformation, and W is the viewing transformation matrix. With the projected 2D Gaussians, the color is calculated via alpha blending as follows:
C = i N c i α i j = 1 i 1 ( 1 α j ) .
Here, N is the number of Gaussians, c i denotes the view-independent color value of each 2D Gaussian calculated with SH coefficients, and α i is determined by multiplying the opacity α i of the i-th 3D Gaussian with the evaluation of the corresponding 2D Gaussian.
Given images captured from a scene, 3D Gaussians are optimized by minimizing the difference between renderings and ground truth images. During optimization, 3DGS utilizes an adaptive density control strategy to clone/split Gaussians in under-/over-reconstructed regions, which helps obtain representations with higher quality.

3.2. Deformation for Dynamics

Since there is only one frame available for each moment of a dynamic scene in monocular datasets, it is challenging to reconstruct the scene both spatially and temporally. Previous methods [6,8,10,15] seek to simplify the problem using deformation. Assuming the space at the starting time t 0 as canonical space, properties (e.g., position x, color c) representing the scene change along with time t:
x t = T x ( x t 0 , t ) , c t = T c ( c t 0 , t ) ,
where T x and T c represent deformation in time for position and color, respectively. Please note that Equation (3) is just an example of how properties are deformed. The deformation could be applied to any properties, primarily depending on what the method is concerned about.

4. Method

There exist scenes that are challenging for SfM techniques to converge [11]. In such situation, accurate point clouds cannot be obtained for initialization, causing difficulty in reconstruction using Gaussian Splatting methods. As analyzed in 3DGS, real-world scenes with complex background and areas not well covered from training views are challenging to be reconstructed from randomly initialized point clouds. Thus, our method focuses on relaxing precise initialization on real-world datasets in monocular dynamic scene reconstruction with Gaussian Splatting.
Inspired by point clouds provided in real-world datasets [6,8], which are generated by running COLMAP using images with background masks, we first regard the scene as a static one and reconstruct it using each frame from the monocular video. This allows us to initially achieve a good reconstruction of the static areas, which are essentially the background, while inherently ignoring areas of the scene that change over time. These dynamic regions tend to exhibit poorer reconstruction results, typically appearing blurred. Subsequently, we evaluate the error of Gaussians in each view, and activate a portion of Gaussians with the highest error values as dynamic. The scene is now divided into static and dynamic parts, which are optimized separately. An additional regularization term—annealing jitter—is applied to reduce the deviation caused by inaccurate poses when test images are rendered. We will first discuss the scene modeling process in Section 4.1, and then introduce the regularization term in Section 4.2.

4.1. Dynamic Scene Modeling

4.1.1. Initial Scene Modeling as Static

Due to the fact that positions, shapes, and appearances of some objects in dynamic scenes change over time, there are differences between each frame in monocular dynamic videos. The differences between each frame in monocular dynamic videos can be subtle or significant, depending on the nature of the motion and the time interval between frames. However, the variations represent only a small portion of the scene, as most areas in real-world datasets are static, commonly referred to as the background. Given inconsistent frames, we regard the scene as static for initial reconstruction, which means we forcibly treat dynamic objects as static ones as well. Through this approach, consistent areas from different viewpoints (the background) can be well reconstructed, but it inevitably leads to artifacts in the inconsistent areas (i.e., dynamic areas), as the true motion are not yet accounted for.
Following RAIN-GS [30] where the constraint for accurate initialization of 3DGS is removed, we randomly generate a point cloud with sparse-large-variance Gaussians and use it as our scene initialization. After a few epochs of optimization, the static areas of the scene converge quickly to a stable solution, with the differences between renderings and the ground truth images being smaller compared to the dynamic ones.

4.1.2. Error-Based Dynamic Activation

Based on the scene modelled above, we can render an image where static areas are well reconstructed, while the originally dynamic areas are in low quality. Thus, we can distinguish between dynamic and static areas based on the varying degrees of difference between renderings and ground truth images. However, the error is commonly calculated in pixel level, e.g., mean average error (i.e., L 1 loss), and each pixel error entangles the contribution of multiple Gaussians. This entanglement makes it challenging to accurately attribute the error to specific Gaussian primitives within the scene, as the influence of each Gaussian is intertwined across the entire image. Since all the optimizations are carried out on Gaussians in Gaussian Splatting, we need to figure out the error for each Gaussian to distinguish static and dynamic areas in Gaussian level.
It is worth emphasizing that the terms dynamic and static in our formulation do not strictly correspond to semantically dynamic or static objects in the scene. Instead, they describe whether a region exhibits consistent appearance over time under monocular observations. The core objective of our method is therefore not to explicitly disentangle static objects from dynamic ones, but to first stabilize optimization by reconstructing the scene under a static assumption without accurate initialization, and subsequently apply dynamic correction to regions that consistently violate this assumption as indicated by elevated reconstruction errors.
To propagate per-pixel errors to per-Gaussian errors, we rethink the alpha-blending process in Gaussian Splatting. In Equation (2), the alpha-compositing coefficient could be represented as
w i = α i j = 1 i 1 ( 1 α j ) .
When distributing per-pixel errors to each Gaussian, the process is reversed and the distribution should be proportional to the contribution of each Gaussian to the pixel, i.e., w i . Specifically, from a view π , the rendering error for pixel u is E π ( u ) . The error for each Gaussian g under view π could be calculated as followed:
E g π = u Pix E π ( u ) w g π ( u ) ,
where the sum runs over the image pixels. For each Gaussian, we track the maximum value of the error E g π across all training views, i.e.,
E g = max π View E g π .
We now obtain per-Gaussian errors for all the Gaussians in the former reconstructed scene. Gaussians with high error values are inherently considered to be the dynamic parts of the scene. We sort the Gaussians based on their error values and activate the top scoring Gaussians as dynamic.
Notably, this error-based criterion also effectively handles objects undergoing fast and large-displacement motion across the scene. When such objects are forced to conform to the static assumption during the initial reconstruction, they tend to form elongated, low-opacity structures distributed along their motion trajectories. These structures consistently produce large reconstruction errors across multiple views, causing the associated Gaussians to be stably ranked in the high-error regime and selected for dynamic activation.
After the activation, the scene is represented by static Gaussians for background and dynamic Gaussians for motion, respectively. The properties of dynamic Gaussians are conditioned on time using deformation described in Section 3.2. By applying deformations that are time-dependent, we can track the movement of each Gaussian, reflecting the actual motion of the objects they represent.

4.2. Annealing Jitter Regularization

The most challenging problem for real-world datasets compared to synthetic datasets is their lack of accuracy in camera pose estimation, which can significantly impact the quality of scene reconstruction [6]. Imprecise camera poses can lead to overfitting of the training views. As a result, the reconstructed scene becomes tailored to the particularities of the training views. Previous methods like [15] only take the degradation of rendering quality at interpolated times into account, without considering the quality of interpolated views.
To improve our model’s robustness when rendered from an interpolated view and time, we propose to utilize a regularization term to reduce the effect of pose inaccuracies on the reconstruction quality:
X t ( i ) = N ( 0 , 1 ) · β · Δ t · ( 1 i / τ ) , X p ( i ) = N ( 0 , 1 ) · β · Δ p · ( 1 i / τ ) .
Here, X t ( i ) and X p ( i ) represent the linearly decaying Gaussian noise at the i-th training iteration for time and camera pose, respectively, N ( 0 , 1 ) denotes the standard Gaussian distribution, β is an empirically determined scaling factor with a value of 0.1, Δ t is mean time interval and Δ p is mean pose interval, and τ represents the threshold iteration for annealing smooth training (empirically set to 20k).
We represent how the regularization term is applied to our model using position x as an example:
x t = T x ( x t 0 , t + X t ( i ) ) .
The deviation of the original camera pose p as p + X p ( i ) affects the rendering view with noise, simulating the inaccurate camera pose jitter and making the model more robust to inaccurate camera pose in training.
By incorporating this regularization term into our model, we can reduce the impact of pose inaccuracies on the reconstruction process. This leads to more reliable and accurate reconstructions, even when rendered from interpolated views and times. The result is a model that is not only robust to variations in the input data but also capable of generalizing well to new conditions, making it well suited for real-world dataset applications.

5. Results

5.1. Implementation Details

We incorporate our optimization strategy into two existing Gaussian-Splatting-based methods which use different deformation networks for dynamic scene reconstruction, i.e., Deformable-GS [15] utilizing MLP and 4DGS [10] utilizing hex planes. We replace their initialization with random point clouds, firstly consider the scene as static and then separate the scene into static and dynamic. Following RAIN-GS [30], we set our initial number of random Gaussians to N = 10 . After 5k epochs of training, we implement error-based dynamic activation and set Gaussians with top 15% error values as dynamic. These dynamic Gaussians are jointly trained with corresponding deformation network. The loss function and its hyper-parameters are consistent with the original version and we leave the adaptive density control unchanged. All the experiments are performed on a single NVIDIA RTX 3090 (24G) GPU. Our code will be released.

Datasets

We focus on real-world scenes. We utilize scenes provided by HyperNeRF [6] and NeRF-DS [8]. The division on training and testing sets and the image resolution align perfectly with the original paper.

5.2. Quantitative Results

We conduct quantitative comparisons with NeRF-based methods [6,8,21], which do not need any particular point cloud initialization, and Deformable-GS and 4DGStrained with SfM-initialized point cloud and with random-initialized point cloud using our optimization strategy.
We report PSNR, LPIPS, and SSIM as evaluation metrics between the reconstruction renderings and ground truth images as previous methods. These metrics provide a comprehensive assessment of the reconstruction quality by evaluating different aspects of the rendered images compared to the actual scene captures. Table 1 and Table 2 show that both Deformable-GS and 4DGS are highly dependent on the accuracy of initial point clouds. Their performances drop greatly when trained with random point cloud. In contrast, our strategy trained from random point cloud shows comparable or even better results compared to their methods trained with accurate SfM point clouds.

5.3. Qualitative Results

We show our results in Figure 3 and Figure 4 on NeRF-DS dataset implemented on DeformableGS and 4DGS, respectively. By comparing the last two columns, we demonstrate the effectiveness of our strategy, which reconstructs complete structure in the scene and removes high-frequency floaters or artifacts. In Figure 5, we show different timestamps and views of the scene peel-banana in HyperNeRF dataset carried out with DeformableGS. Thanks to our operation of training static and dynamic Gaussians separately, different Gaussians can be optimized for their respective roles within the scene. This allows us to refine the static Gaussians to accurately represent the unchanging parts of the environment, while the dynamic Gaussians are free to adapt and evolve in response to the motion and changes occurring over time. Moreover, by training static and dynamic Gaussians separately, we can more effectively manage computational resources during the training process. Static Gaussians, once converged, require less frequent updates, allowing us to allocate more computational power to the dynamic Gaussians that need more iterations to accurately model the evolving scene. A supplementary video demonstrating the qualitative results of our method is provided in the Supplementary Materials.

5.4. Ablation Study

In Table 3, we validate the effectiveness of each setting in our strategy trained in the NeRF-DS dataset [8] with randomly initialized point clouds. Our ablation study systematically evaluates the contribution of different components of our approach, providing insights into how each aspect influences the final reconstruction quality. We compare the ratio of Gaussians for dynamic activation in the first part and explore the most suitable proportion of dynamic Gaussians that are activated and optimized separately. We also compare the epochs for initial scene reconstruction. Too few epochs might lead to an underfit and result in the inaccuracy of subsequent error-based dynamic activation. Conversely, too many epochs can lead to overfitting, reducing the efficiency for convergence. The annealing jitter regularization term is applied to reduce the impact of camera pose inaccuracies and increase robustness for real-world scene reconstruction. We present the quantitative results of reconstruction with and without the regularization in the last two rows. All three ablations show that using our settings performs better and using all three settings achieves the best.
We note that the current choice of a fixed activation ratio is primarily motivated by simplicity, stability, and reproducibility. Nevertheless, the error-based formulation of our method naturally lends itself to more adaptive strategies. For example, the activation threshold could be determined based on the statistical properties of the error distribution, such as bimodal separation of per-Gaussian errors, adaptive percentile selection, or even learnable gating mechanisms. Exploring such adaptive or data-driven activation criteria is a promising direction for future work and may further improve robustness across scenes with varying levels of dynamic complexity.

6. Discussion

Our work was motivated by the hypothesis that the strong dependence of Gaussian-Splatting-based monocular dynamic scene reconstruction on accurate point cloud initialization is not intrinsic, but instead arises from jointly optimizing static geometry and dynamic deformation from the beginning of training. The experimental results support this hypothesis, showing that by first reconstructing the scene under a static assumption and subsequently activating dynamic Gaussians based on reconstruction errors, comparable performance can be achieved even when starting from randomly initialized point clouds. In contrast to prior methods such as DeformableGS and 4DGS, which rely heavily on SfM-derived geometric priors, the proposed strategy alleviates initialization sensitivity without introducing additional supervision or external motion cues.
From the perspective of previous studies on dynamic scene reconstruction, the effectiveness of the proposed approach can be attributed to the predominance of static content in real-world monocular datasets. Similar to observations in NeRF-based dynamic reconstruction, enforcing a static model in the early optimization stage enables stable convergence of background geometry, while dynamic regions consistently exhibit higher reconstruction errors. By exploiting these residuals, the error-based dynamic activation mechanism provides an implicit yet representation-aligned way to disentangle static and dynamic components at the Gaussian level. The ablation results further indicate that both the timing and proportion of activated Gaussians are important for balancing reconstruction accuracy and optimization stability.
Another important finding is that the proposed annealing jitter regularization improves robustness to camera pose inaccuracies, which are common in real-world monocular capture. While prior work primarily emphasizes temporal deformation modeling, these results suggest that controlled stochastic perturbations during training can effectively reduce overfitting to noisy poses and enhance generalization to interpolated views. This observation is consistent with broader trends in representation learning and highlights pose robustness as a complementary factor to geometric initialization in dynamic scene reconstruction.
In a broader context, the proposed strategy lowers the practical barrier to deploying Gaussian-Splatting-based methods by reducing reliance on SfM pipelines and dense point cloud preprocessing. This has implications for applications such as robotics, augmented reality, and free-viewpoint video, where monocular input is often the most accessible data modality. Nevertheless, the method assumes that dynamic regions produce consistently higher reconstruction errors under a static model, which may not hold for scenes with subtle motion. Future work may explore adaptive or learnable activation criteria, joint optimization of camera poses and scene representation, and tighter integration with Gaussian densification strategies to further improve robustness and scalability.

7. Conclusions

In this work, we present a novel optimization strategy for monocular dynamic scene reconstruction with Gaussian Splatting, which starts from randomly initialized point clouds. We initiate the reconstruction process by treating the scene as static across monocular video frames and using sparse-large-variance Gaussians to model it. We then assess the error of Gaussians and activate the most erroneous Gaussians as dynamic and use them to simulate the motions and deformations of the scene. An additional regularization term is introduced to mitigate the impact of camera pose inaccuracies. Extensive experimental results demonstrate the effectiveness of our method on real-world scenes and shows the strength of our approach against state-of-the-art in terms of using random point clouds as initialization.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16031321/s1.

Author Contributions

Conceptualization, X.W. and J.C.; methodology, X.W. and J.C.; software, X.W.; validation, L.Z., H.L. and W.X.; formal analysis, L.Z., H.L. and W.X.; investigation, X.W. and J.C.; resources, L.Z.; data curation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, J.C. and L.Z.; visualization, X.W.; supervision, L.Z.; project administration, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Zhejiang Province Program (2024C03263, 2025C01068, LZ25F020006), and Ningbo Science and Technology Plan Project (2025Z052, 2025Z062, 2022Z167, 2023Z137).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available; HyperNeRF [6] at https://github.com/google/hypernerf/releases/tag/v0.1, and NeRF-DS [8] at https://github.com/JokerYan/NeRF-DS/releases/tag/v0.1-pre-release all accessed before 19 January 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 139. [Google Scholar] [CrossRef]
  2. Chen, Z.; Wang, F.; Wang, Y.; Liu, H. Text-to-3d using gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21401–21412. [Google Scholar]
  3. Guédon, A.; Lepetit, V. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5354–5363. [Google Scholar]
  4. Tang, J.; Chen, Z.; Chen, X.; Wang, T.; Zeng, G.; Liu, Z. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2025; pp. 1–18. [Google Scholar]
  5. Yu, Z.; Chen, A.; Huang, B.; Sattler, T.; Geiger, A. Mip-splatting: Alias-free 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19447–19456. [Google Scholar]
  6. Park, K.; Sinha, U.; Hedman, P.; Barron, J.T.; Bouaziz, S.; Goldman, D.B.; Martin-Brualla, R.; Seitz, S.M. HyperNeRF: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph. (TOG) 2021, 40, 1–12. [Google Scholar] [CrossRef]
  7. Pumarola, A.; Corona, E.; Pons-Moll, G.; Moreno-Noguer, F. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10318–10327. [Google Scholar]
  8. Yan, Z.; Li, C.; Lee, G.H. Nerf-ds: Neural radiance fields for dynamic specular objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8285–8295. [Google Scholar]
  9. Snavely, N.; Seitz, S.M.; Szeliski, R. Photo tourism: Exploring photo collections in 3D. In ACM Siggraph 2006 Papers; Association for Computing Machinery: New York, NY, USA, 2006; pp. 835–846. [Google Scholar]
  10. Wu, G.; Yi, T.; Fang, J.; Xie, L.; Zhang, X.; Wei, W.; Liu, W.; Tian, Q.; Wang, X. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20310–20320. [Google Scholar]
  11. Bian, W.; Wang, Z.; Li, K.; Bian, J.W.; Prisacariu, V.A. Nope-nerf: Optimising neural radiance field with no pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4160–4169. [Google Scholar]
  12. Park, J.; Bui, M.Q.V.; Bello, J.L.G.; Moon, J.; Oh, J.; Kim, M. Splinegs: Robust motion-adaptive spline for real-time dynamic 3d gaussians from monocular video. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 26866–26875. [Google Scholar]
  13. Wang, S.; Yang, X.; Shen, Q.; Jiang, Z.; Wang, X. Gflow: Recovering 4d world from monocular video. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 7862–7870. [Google Scholar]
  14. Rota Bulò, S.; Porzi, L.; Kontschieder, P. Revising densification in gaussian splatting. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 347–362. [Google Scholar]
  15. Yang, Z.; Gao, X.; Zhou, W.; Jiao, S.; Zhang, Y.; Jin, X. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20331–20341. [Google Scholar]
  16. Broxton, M.; Flynn, J.; Overbeck, R.; Erickson, D.; Hedman, P.; Duvall, M.; Dourgarian, J.; Busch, J.; Whalen, M.; Debevec, P. Immersive light field video with a layered mesh representation. ACM Trans. Graph. (TOG) 2020, 39, 86. [Google Scholar] [CrossRef]
  17. Collet, A.; Chuang, M.; Sweeney, P.; Gillett, D.; Evseev, D.; Calabrese, D.; Hoppe, H.; Kirk, A.; Sullivan, S. High-quality streamable free-viewpoint video. ACM Trans. Graph. (ToG) 2015, 34, 69. [Google Scholar] [CrossRef]
  18. Dou, M.; Davidson, P.; Fanello, S.R.; Khamis, S.; Kowdle, A.; Rhemann, C.; Tankovich, V.; Izadi, S. Motion2fusion: Real-time volumetric performance capture. ACM Trans. Graph. (ToG) 2017, 36, 246. [Google Scholar] [CrossRef]
  19. Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
  20. Cao, A.; Johnson, J. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 130–141. [Google Scholar]
  21. Fang, J.; Yi, T.; Wang, X.; Xie, L.; Zhang, X.; Liu, W.; Nießner, M.; Tian, Q. Fast dynamic radiance fields with time-aware neural voxels. In Proceedings of the SIGGRAPH Asia 2022 Conference Papers, Daegu, Republic of Korea, 6–9 December 2022; pp. 1–9. [Google Scholar]
  22. Fridovich-Keil, S.; Meanti, G.; Warburg, F.R.; Recht, B.; Kanazawa, A. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12479–12488. [Google Scholar]
  23. Li, T.; Slavcheva, M.; Zollhoefer, M.; Green, S.; Lassner, C.; Kim, C.; Schmidt, T.; Lovegrove, S.; Goesele, M.; Newcombe, R.; et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5521–5531. [Google Scholar]
  24. Park, S.; Son, M.; Jang, S.; Ahn, Y.C.; Kim, J.Y.; Kang, N. Temporal interpolation is all you need for dynamic neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4212–4221. [Google Scholar]
  25. Wang, F.; Tan, S.; Li, X.; Tian, Z.; Song, Y.; Liu, H. Mixed neural voxels for fast multi-view video synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 19706–19716. [Google Scholar]
  26. Attal, B.; Huang, J.B.; Richardt, C.; Zollhoefer, M.; Kopf, J.; O’Toole, M.; Kim, C. Hyperreel: High-fidelity 6-dof video with ray-conditioned sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16610–16620. [Google Scholar]
  27. Park, K.; Sinha, U.; Barron, J.T.; Bouaziz, S.; Goldman, D.B.; Seitz, S.M.; Martin-Brualla, R. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5865–5874. [Google Scholar]
  28. Song, L.; Chen, A.; Li, Z.; Chen, Z.; Chen, L.; Yuan, J.; Xu, Y.; Geiger, A. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Trans. Vis. Comput. Graph. 2023, 29, 2732–2742. [Google Scholar] [CrossRef] [PubMed]
  29. Liang, Y.; Khan, N.; Li, Z.; Nguyen-Phuoc, T.; Lanman, D.; Tompkin, J.; Xiao, L. Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. arXiv 2023, arXiv:2312.11458. [Google Scholar]
  30. Jung, J.; Han, J.; An, H.; Kang, J.; Park, S.; Kim, S. Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting. arXiv 2024, arXiv:2403.09413. [Google Scholar] [CrossRef]
Figure 1. Given a sequence of monocular views from a dynamic scene, previous reconstruction methods, e.g., DeformableGS [15], shows high-quality reconstruction results with accurate initialization (SfM point cloud), as shown in the second column. When using random initialized point cloud, the performance drops greatly (the third column). Our strategy explores to achieve comparable performance starting with random point clouds.
Figure 1. Given a sequence of monocular views from a dynamic scene, previous reconstruction methods, e.g., DeformableGS [15], shows high-quality reconstruction results with accurate initialization (SfM point cloud), as shown in the second column. When using random initialized point cloud, the performance drops greatly (the third column). Our strategy explores to achieve comparable performance starting with random point clouds.
Applsci 16 01321 g001
Figure 2. An overview of our method. Our reconstruction starts from a set of randomly initialized sparse-large-variance Gaussians, and we first regard the scene as a static one. Subsequently, based on the above reconstruction, we implement dynamic activation according to the error value of Gaussians. Afterwards, static and dynamic Gaussians are optimized separately to obtain the final reconstruction.
Figure 2. An overview of our method. Our reconstruction starts from a set of randomly initialized sparse-large-variance Gaussians, and we first regard the scene as a static one. Subsequently, based on the above reconstruction, we implement dynamic activation according to the error value of Gaussians. Afterwards, static and dynamic Gaussians are optimized separately to obtain the final reconstruction.
Applsci 16 01321 g002
Figure 3. Qualitative comparisons on NeRF-DS dataset [8] using 4DGS [10]. Both dense and sparse SfM point clouds come from COLMAP, and the former one requires much more computation on the basis of the latter one. Our strategy provides the possibility of reconstruct on-par dynamic scenes with random initialized point clouds compared to methods using accurate initialization.
Figure 3. Qualitative comparisons on NeRF-DS dataset [8] using 4DGS [10]. Both dense and sparse SfM point clouds come from COLMAP, and the former one requires much more computation on the basis of the latter one. Our strategy provides the possibility of reconstruct on-par dynamic scenes with random initialized point clouds compared to methods using accurate initialization.
Applsci 16 01321 g003
Figure 4. Qualitative comparisons on NeRF-DS dataset [8] using DeformableGS [15]. DeformableGS trained with randomly initialized point clouds shows low-quality results with structure missing and high-frequency artifacts. Our strategy demonstrates comparable performance to DeformableGS trained with SfM initialized point clouds.
Figure 4. Qualitative comparisons on NeRF-DS dataset [8] using DeformableGS [15]. DeformableGS trained with randomly initialized point clouds shows low-quality results with structure missing and high-frequency artifacts. Our strategy demonstrates comparable performance to DeformableGS trained with SfM initialized point clouds.
Applsci 16 01321 g004
Figure 5. Qualitative comparisons on scene peel-banana in HyperNeRF dataset [6]. Our strategy is able to better capture the motion and deformation with random point cloud initialization.
Figure 5. Qualitative comparisons on scene peel-banana in HyperNeRF dataset [6]. Our strategy is able to better capture the motion and deformation with random point cloud initialization.
Applsci 16 01321 g005
Table 1. Quantitative comparisons on NeRF-DS dataset. NeRF-based methods like TiNeuVox [21], HyperNeRF [6], and NeRF-DS [8] are trained without points initialization. SfM 1 refers to sparse SfM points generated by COLMAP and SfM 2 refers to dense ones.
Table 1. Quantitative comparisons on NeRF-DS dataset. NeRF-based methods like TiNeuVox [21], HyperNeRF [6], and NeRF-DS [8] are trained without points initialization. SfM 1 refers to sparse SfM points generated by COLMAP and SfM 2 refers to dense ones.
MethodTiNeuVoxHyperNeRFNeRF-DSDeformableGSDeformableGS + Our Strategy4DGS4DGS
+ Our Strategy
Init. Points - - - SfM 1 Random Random SfM 2 SfM 1 Random Random
PSNR21.6123.4523.6024.1123.751323.8923.4721.0120.1022.6695
LPIPS0.27660.19900.18160.17690.18590.19010.16510.28870.34170.2259
SSIM0.82340.84880.84940.85240.84610.84910.82880.72430.66830.8203
Table 2. Quantitative comparisons on HyperNeRF vrig dataset. SfM 1 refers to sparse COLMA SfM pointsP and SfM 2 refers to dense ones. Our strategy shows on-par or even better performance compared to methods trained with SfM points.
Table 2. Quantitative comparisons on HyperNeRF vrig dataset. SfM 1 refers to sparse COLMA SfM pointsP and SfM 2 refers to dense ones. Our strategy shows on-par or even better performance compared to methods trained with SfM points.
3D Printer Broom
Method Init. Points PSNR LPIPS SSIM PSNR LPIPS SSIM
DeformableGSSfM 120.780.28460.657920.030.70060.2679
Random20.360.38710.616519.560.83250.2351
DeformableGS + Our StrategyRandom20.430.31370.634120.290.40740.3742
4DGSSfM 221.940.32270.708422.240.54510.3858
SfM 121.750.31440.679221.130.58020.3393
Random20.700.39150.674520.460.58210.3169
4DGS + Our StrategyRandom21.630.32030.703621.460.56930.3483
DeformableGSSfM 123.160.22860.627325.960.16030.8502
Random23.130.23220.627921.100.25020.7462
DeformableGS + Our StrategyRandom23.260.30630.653325.040.17510.8136
4DGSSfM 228.830.27710.815828.690.18420.8666
SfM 127.790.25660.785522.720.28050.7439
Random27.110.31640.715224.520.26170.7721
4DGS + Our StrategyRandom27.500.27290.773226.540.23690.8620
Table 3. Ablation study on the ratio of activation, the epochs to start up activation, and the impact of the annealing jitter regularization term in our strategy. ✓ indicates that the annealing jitter regularization term is enabled, whereas ✗ indicates that the regularization term is not applied.
Table 3. Ablation study on the ratio of activation, the epochs to start up activation, and the impact of the annealing jitter regularization term in our strategy. ✓ indicates that the annealing jitter regularization term is enabled, whereas ✗ indicates that the regularization term is not applied.
Activation
Ratio
Activation
Epoch
Reg.PSNRLPIPSSSIM
0.1500022.640.22120.8284
0.2500022.820.22850.8241
0.25500022.410.23510.8192
0.15300021.340.25630.7891
0.15400023.160.21700.8185
0.15600019.910.29220.7666
0.15700022.180.22910.8204
0.15500023.200.21750.8329
0.15500023.890.19010.8491
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Chen, J.; Xing, W.; Lin, H.; Zhao, L. Relaxing Accurate Initialization for Monocular Dynamic Scene Reconstruction with Gaussian Splatting. Appl. Sci. 2026, 16, 1321. https://doi.org/10.3390/app16031321

AMA Style

Wang X, Chen J, Xing W, Lin H, Zhao L. Relaxing Accurate Initialization for Monocular Dynamic Scene Reconstruction with Gaussian Splatting. Applied Sciences. 2026; 16(3):1321. https://doi.org/10.3390/app16031321

Chicago/Turabian Style

Wang, Xinyu, Jiafu Chen, Wei Xing, Huaizhong Lin, and Lei Zhao. 2026. "Relaxing Accurate Initialization for Monocular Dynamic Scene Reconstruction with Gaussian Splatting" Applied Sciences 16, no. 3: 1321. https://doi.org/10.3390/app16031321

APA Style

Wang, X., Chen, J., Xing, W., Lin, H., & Zhao, L. (2026). Relaxing Accurate Initialization for Monocular Dynamic Scene Reconstruction with Gaussian Splatting. Applied Sciences, 16(3), 1321. https://doi.org/10.3390/app16031321

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop