UDR-GS: Enhancing Underwater Dynamic Scene Reconstruction with Depth Regularization

: Representing and rendering dynamic underwater scenes present significant challenges due to the medium’s inherent properties, which result in image blurring and information ambiguity. To overcome these challenges and accomplish real-time rendering of dynamic underwater environments while maintaining efficient training and storage, we propose Underwater Dynamic Scene Reconstruction Gaussian Splatting (UDR-GS), a method based on Gaussian Splatting. By leveraging prior information from a pre-trained depth estimation model and smoothness constraints between adjacent images, our approach uses the estimated depth as a geometric prior to aid in color-based optimization, significantly reducing artifacts and improving geometric accuracy. By integrating depth guidance into the Gaussian Splatting (GS) optimization process, we achieve more precise geometric estimations. To ensure higher stability, smoothness constraints are applied between adjacent images, maintaining consistent depth for neighboring 3D points in the absence of boundary conditions. The symmetry concept is inherently applied in our method by maintaining uniform depth and color information across multiple viewpoints, which enhances the reconstruction quality and visual coherence. Using 4D Gaussian Splatting (4DGS) as a baseline, our strategy demonstrates superior performance in both RGB novel view synthesis and 3D geometric reconstruction. On average, across multiple datasets, our method shows an improvement of approximately 1.41% in PSNR and a 0.75% increase in SSIM compared with the baseline 4DGS method, significantly enhancing the visual quality and geometric fidelity of dynamic underwater scenes.


Introduction
Creating high-quality reconstructions and realistic renderings of dynamic scenes from a sequence of input images is crucial for applications such as AR/VR, 3D content creation, and entertainment.Previous methods for modeling these dynamic scenes largely relied on mesh-based representations, as evidenced by methods described in [1][2][3][4].However, these strategies often face inherent limitations, such as a lack of detail and realism, the absence of semantic information, and difficulty in adapting to topological changes.The introduction of neural rendering technologies has significantly shifted this paradigm.Implicit scene representations, particularly those implemented by NeRF [5], have demonstrated commendable efficacy in tasks like novel view synthesis, scene reconstruction, and light decomposition.
The scene is described explicitly using 3D Gaussians.Utilizing this approach, 3D Gaussian Splatting (3DGS) [6] has markedly improved rendering speed, achieving real-time performance.The original NeRF's cumbersome volumetric rendering has been substituted with efficient differentiable splatting [7], which projects the details of 3D Gaussian points directly onto a 2D plane.3DGS not only delivers real-time rendering speeds but also offers a more distinct scene representation, thereby simplifying scene manipulation.
However, 3DGS focuses on static scenes.Applying 3DGS to temporally continuous scenes introduces additional challenges that must be addressed.These include handling temporal variations to ensure smooth transitions between frames, maintaining consistency in scene representation across different time points, and addressing the increased computational complexity that comes with processing sequences of images.To handle real-world 4D scenes, dynamic Gaussian splatting methods [8,9] use efficient Gaussian deformation field networks to represent Gaussian motion and shape changes, transforming the original Gaussians to new positions with new shapes.These methods have achieved encouraging performance in simulating above-water scenes.However, can Gaussian splatting accurately simulate dynamic underwater scenes?The answer is no.
First, light in underwater environments is significantly attenuated and scattered by water, leading to degraded image quality and distorted color information.In contrast, light usually travels farther in above-water scenes with less attenuation and scattering.Additionally, color information in underwater environments is easily affected by the absorption of different wavelengths of light by water, particularly as red light travels a shorter distance.As the distance between objects and the camera increases, objects become increasingly difficult to observe, leading to the most critical dynamic factor in underwater scenes: depth-related visibility.
Moreover, underwater environments often contain numerous plankton and suspended particles, and the underwater ecosystem is filled with marine plants and animals, which cause image blurring and increased noise, making reconstruction based solely on color and brightness information more challenging.This introduces additional dynamic artifacts.Furthermore, underwater images typically have low contrast, poor visibility, and blurred object boundaries, further complicating the geometric ambiguity rarely encountered in above-water scenes.The interplay of these dynamic factors creates a complex and multifaceted environment that current GS models struggle to accurately understand and represent.Therefore, developing methods to effectively manage this environment is crucial for a neural representation of underwater scenes.
In this paper, we propose a method specifically designed for dynamic underwater scenes, utilizing prior information from a pre-trained depth estimation model [10] and smoothness constraints between adjacent images.We employ the estimated depth as a geometric prior to aid in color-based optimization, reducing artifacts and achieving accurate geometric information.We observe that depth guidance strongly aids in reconstructing geometric information in GS.Additionally, to achieve higher stability, we employ smoothness constraints between adjacent images, ensuring similar depth for neighboring 3D points in the absence of boundary conditions.Our method leverages the concept of symmetry by maintaining consistent depth and color information across multiple views, ensuring balanced and uniform reconstruction quality throughout the scene.We use 4DGS as a baseline and compare its performance.Our strategy produces more reasonable results in not only RGB novel view synthesis but also in 3D geometric reconstruction.
In summary, our contributions are as follows: • We propose the first method specifically designed for reconstructing dynamic underwater scenes.

•
We introduce a depth guidance strategy to enhance the Gaussian splatting optimization process, resulting in a more accurate geometric estimation and improved reconstruction performance.

•
Our experimental results demonstrate that our method achieves state-of-the-art performance across various challenging scenarios, including those with large movements, indistinguishable backgrounds and moving objects, small movements, and multiple moving objects.

Related Works 2.1. Neural Rendering Techniques for Dynamic Environments
Neural rendering has garnered significant attention in academia due to its remarkable capabilities in generating photorealistic images.Neural Radiance Fields (NeRFs) [5], which employ Multi-Layer Perceptrons (MLPs), have notably advanced research in novel view synthesis.Subsequent studies have extended NeRFs to a variety of applications, including mesh reconstruction from image collections [11,12], relighting through inverse rendering [13][14][15], camera parameter optimization [16][17][18], and few-shot learning [19,20].Among these, constructing radiance fields for dynamic scenes is a crucial branch of NeRF development, holding significant relevance for real-world applications.
Rendering dynamic scenes inherently involves sparse reconstruction from single viewpoints, where encoding and effectively utilizing temporal information is a primary challenge, especially in monocular dynamic scene reconstruction.One approach to addressing this challenge involves modeling scene deformations by incorporating time t as an additional input to the radiance field.However, this method couples positional changes due to temporal variations with the radiance field, lacking geometric priors on the effect of time on the scene.As a result, extensive regularization is required to ensure temporal consistency in the rendered outputs.
An alternative approach [21][22][23] introduces a deformation field to decouple time and the radiance field.This method maps point coordinates to a canonical space corresponding to time t, facilitating the learning of noticeable rigid-body movements while remaining flexible enough to accommodate scenes with topological changes.Other strategies aim to improve the quality of dynamic neural rendering by segmenting static and dynamic objects within the scene [24,25], incorporating geometric priors through depth information [26], introducing 2D Convolutional Neural Networks (CNNs) to encode scene priors [27,28], and leveraging redundancy in multi-view videos for keyframe compression storage to expedite rendering [29].
Despite advancements in accelerating training speeds, real-time rendering of dynamic scenes remains challenging, particularly with monocular inputs.This challenge is further exacerbated in underwater scenes due to the medium's impact on light propagation, which makes geometric information more difficult to obtain.Without geometric priors, the accuracy of reconstruction is significantly compromised.Our approach aims to establish an efficient training and rendering pipeline for underwater dynamic scenes, ensuring highquality outputs even under the influence of the medium, thus addressing the prevalent challenges in such environments.

Underwater Imaging
For several decades, the computer vision community has focused extensively on analyzing underwater images.Commonly, physics-based scene modeling is utilized to address the impact of water on light propagation, as explained by the radiative transfer equation [30].However, executing a full computation with Monte Carlo simulations is too costly for realtime rendering.Consequently, many physics-based approaches incorporate specific priors to simplify the problem, such as the dark channel prior [31], white balance [32], or haze line prior [33], to separate backscatter and transmission components.Some methods address the challenge for a single water type with a constant attenuation coefficient [34,35], while others employ data-driven optimization techniques to mitigate image degradation [36].Despite these advancements, removing the effects of water from 2D images remains a challenging and ill-posed problem.
A more physics-grounded model concerning light propagation has been proposed for underwater image restoration [37,38], but it still requires known depth information.Recent advancements in underwater scene reconstruction have extended the application of underwater images to three dimensions.Sethuraman et al. [39] and Zhou et al. [40] developed methods to estimate medium parameters from histogram-equalized images, learn color distributions, and restore underwater images independently of rendering.
Zhang et al. [41], in their neural-sea framework, proposed a physical model for underwater robots with integrated light sources, addressing and correcting color distortions by considering specific lighting conditions and analyzing coefficients of the underwater physical model across different distances.Additionally, Zhang et al. [42] modularized the underwater 3D scene reconstruction system to enhance both color restoration and structural reconstruction.Levy et al.'s SeaThru-NeRF [43] demonstrates the significant benefits of incorporating diverse scattering media models into the rendering equation, which simultaneously reconstructs the scene and its 3D structure.
However, it is important to note that existing underwater image processing methods still primarily focus on 2D dimensions, and 3D scene reconstruction methods center on static scenes, overlooking the dynamic information frequently present in real-world underwater data.These dynamic elements are crucial and are often encountered in the underwater environments we aim to capture and reconstruct.

Preliminaries
In this section, we provide a concise review of the representation and rendering process of 3DGS [6] in Section 3.1 along with the formulation of dynamic NeRFs in Section 3.2.

3D Gaussian Splatting
3D Gaussian Splatting (3DGS) [6] is a point-based technique employing anisotropic 3D Gaussians to represent scenes, facilitating fast and precise rendering.Each Gaussian primitive is characterized by its central position x = (x, y, z), opacity σ, and a 3D covariance matrix Σ.
Each 3D Gaussian is also associated with spherical harmonic (SH) coefficients to represent view-dependent appearance characteristics.During rendering, these SH coefficients reconstruct color and luminance according to the viewing direction, enabling dynamic adjustment of each point's appearance through Gaussian splatting.This approach merges detailed volumetric rendering with real-time performance via a CUDA-optimized, differentiable Gaussian rasterization process.The mathematical representation of a Gaussian is defined by the following [44]: To facilitate the learning of 3D Gaussians, Σ is divided into two learnable components: the quaternion [45] r represents rotation, and the 3D vector s represents scaling.These elements are subsequently converted into the respective rotation and scaling matrices R and S. The resulting Σ can be expressed as follows: 3D Gaussians are mapped to 2D through a 2D covariance matrix Σ ′ , as defined in the following [44]: where J denotes the Jacobian matrix of the affine approximation of the projective transformation, and V represents the view matrix that maps world coordinates to camera coordinates.Similar to other point-based rendering techniques [46][47][48][49], 3DGS uses alpha blending to compute the color C of each pixel as follows: where α i is the alpha value of the i-th point, σ i is its opacity, δ i is the distance between points, and T i is the accumulated transmittance.

Dynamic Scene Rendering with Deformation Fields
NeRFs dealing with dynamic scenes are mainly categorized into two classes: canonicalmapping volume rendering [21][22][23]50,51] and time-aware volume rendering [52][53][54][55][56].All dynamic NeRF algorithms so far can be represented as follows: where MLP is a mapping that transforms the 6D space (x, d, t) into the 4D space (c, σ).Canonical-mapping volume rendering maps each sampled point to a canonical space using a deformation network ϕ t : (x, t) → ∆x.Then, a canonical network ϕ c is used to compute the volume density and view-dependent RGB color for each ray.
Drawing inspiration from NeRF methods, a 4D Gaussian splatting framework introduces an innovative rendering technique to enable real-time dynamic scene rendering.This method applies a Gaussian deformation field network F at time t to transform 3D Gaussians, followed by differentiable splatting.
In essence, given a view matrix M = [R, T] and a timestamp t, the 4D Gaussian splatting framework incorporates 3D Gaussians G and a Gaussian deformation field network F .A novel-view image Î is generated using differentiable splatting [7] DS as described below: where In particular, the deformation of 3D Gaussians, ∆G, is introduced through the Gaussian deformation field network as follows: In this framework, the spatial-temporal structure encoder E captures both temporal and spatial characteristics of the 3D Gaussians as follows: The multi-head Gaussian deformation decoder D then processes these encoded features to predict the deformation of each 3D Gaussian as follows: The deformed 3D Gaussians, denoted as G ′ , are subsequently utilized for rendering the final image.
This method enables efficient and accurate dynamic scene rendering by directly incorporating temporal and spatial deformations of 3D Gaussians.However, this approach is designed for above-water scenes.Compared with above-water scenes, underwater scenes are affected by distance-dependent light propagation due to the water medium, the presence of diverse marine life, and the resulting color detail information.Therefore, simple color constraints are insufficient for underwater dynamic scene reconstruction.We propose a depth-regularized optimization method for underwater dynamic scene reconstruction, UDR-GS, which utilizes depth information independent of the water medium to achieve precise and high-fidelity underwater dynamic scene reconstruction.

Methods
In this chapter, we detail our methodology for optimizing and rendering dynamic underwater scenes.Our approach starts with preprocessing steps, including structure from motion (SfM) to obtain essential camera parameters and a sparse point cloud.We then employ state-of-the-art models to estimate the depth for each image at the current pose (Section 4.1).These depths are integrated into the Gaussian splatting process to enhance geometric accuracy.We employ color rasterization to render depth from Gaussian splats and enforce a depth constraint using the dense depth prior (Section 4.2).Additionally, we incorporate smoothness constraints for the depths of adjacent pixels to ensure higher stability (Section 4.3).
Figure 1 presents an overview of our method.Our approach optimizes for a set of underwater images {I i } k−1 i=0 , I i ∈ [0, 1] H×W×3 .Similar to 3DGS, we first run structure-frommotion (SfM) methods like COLMAP to obtain camera poses R i ∈ R 3×3 , t i ∈ R 3 , intrinsic parameters K i ∈ R 3×3 , and a sparse point cloud P ∈ R n×3 .Our method is built on 4DGS, which extends 3DGS.The Gaussian splats are optimized based on the rendered image through a color loss function L color and a D-SSIM loss L D-SSIM .
Figure 1.Overview of UDR-GS.Similar to 3DGS, a set of multi-view images is used as input, and the initial point cloud and camera poses are obtained using structure from motion (SfM).The point cloud is initialized as 3D Gaussian primitives, and the scene state at time t is derived following the method of 4DGS.During scene optimization, the depth predicted by Depth Anything supervises the rendered depth, and the depths of adjacent frames are smoothed.

Enhanced Explanation with Depth Anything Model Integration
Gaussian splats exhibit locality and are insufficient for accurately guiding the synthesis of reasonable geometric shapes.Additionally, underwater images typically suffer from low contrast, poor visibility, and blurred object boundaries.This issue is exacerbated when there are moving objects in the scene, as object motion and deformation further lead to inaccurate geometry, increasing the difficulty of reconstruction.Depth information helps better define object boundaries and improves reconstruction accuracy.However, constructing such information poses significant challenges.Structure-from-motion (SfM) methods can obtain sparse depth maps, but in underwater environments, it is difficult to capture densely populated images of the scene.Consequently, the effective number of points is limited, making it challenging to achieve dense depth information.
To obtain global geometric information for accurate geometric reconstruction and to further guide the learning of color information.We utilize a state-of-the-art monocular depth estimation model, Depth Anything, to offer dense guidance for optimization.Depth Anything is designed to perform robust monocular depth estimation under diverse conditions by leveraging a vast amount of unlabeled data.By scaling up the dataset and employing challenging optimization targets and auxiliary supervision from pre-trained encoders, Depth Anything [10] achieves impressive generalization ability across various unseen scenes.
From the training image I i at pose R i , the monocular depth estimation model D θ produces dense depth as follows: We utilize this dense depth information to regularize the optimization of Gaussian splatting.To integrate the depth information effectively, we calculate the depth loss L depth as follows: where N is the number of pixels, D depth (I i ) is the estimated depth map from the Gaussian splatting model, and D θ (I i ) is the depth map provided by the Depth Anything model.This regularization ensures that the geometric structure of the scene is accurately captured, which is crucial for reconstructing underwater scenes with moving objects and challenging visibility conditions.

Depth Rendering through Rasterization
3D Gaussian splatting employs a rasterization pipeline to render the disconnected and unstructured splats, leveraging the parallel architecture of GPUs.Using differentiable point-based rendering techniques, the splats are rasterized through α-blending to render an image.Point-based approaches utilize a similar equation to NeRF-style volume rendering, rasterizing a pixel color with ordered points that cover that pixel as follows: where C represents the pixel color, c denotes the color of the splats, and α is the learned opacity multiplied by the covariance of the 2D Gaussian.This formulation gives priority to the color c of opaque splats that are positioned closer to the camera, significantly influencing the final color C. Drawing inspiration from the depth implementation in NeRF, we utilize the rasterization pipeline to render the depth map of Gaussian splats as follows: where D represents the rendered depth and d i is the depth of each splat from the camera.Equation ( 15) allows for the direct use of α i and T i calculated in Equation ( 14), facilitating rapid depth rendering with minimal computational load.

Unsupervised Smoothness Constraint
In underwater dynamic scenes, complex lighting conditions and moving objects introduce noise and discontinuities in depth map estimation.Accurately capturing the edges and details of objects in such dynamic environments is a crucial challenge.Inspired by [57], we propose an unsupervised depth smoothness constraint that minimizes the depth value differences between adjacent pixels.This approach reduces discontinuities between neighboring pixels and ensures that points in similar 3D positions have consistent depths on the image plane, thereby enhancing geometric consistency.To avoid incorrect regularization in boundary areas, we use the Canny edge detector [58] as a mask, ensuring that regions with significant depth differences along boundaries remain unregularized.
For a depth d i and its adjacent depth d j , we regularize the difference between them as follows: where 1 ne is an indicator function that indicates whether both depths are not within edge regions.

Loss Function
We finalize the loss terms by incorporating the depth loss from Equation ( 13) and the smoothness loss from Equation ( 16) with their respective hyperparameters λ depth and λ smooth as follows: where the preceding two loss terms L color , L D-SSIM correspond to the original 3D Gaussian splatting losses [6].The hyperparameters are set to λ depth = 0.05 and λ smooth = 0.1.

Experiments
5.1.Experiment Settings 5.1.1.Datasets Our method, primarily designed for dynamic underwater scenes, was validated using the following datasets: four representative underwater dynamic scene clips captured from the internet (data source: https://www.youtube.com/watch?v=jBAV9pdzjfQ, accessed on 31 July 2024): (1) the Robot dataset, representing the most common case of a single moving object; (2) the Fish dataset, where the fish have colors very similar to the seabed background; (3) the Coral dataset, characterized by very small movements; (4) the Streaks dataset, which includes multiple moving objects within the scene simultaneously.
All data were adjusted to a resolution of 560 × 360 pixels, with each sequence containing between 30 and 110 images, depending on the number of images obtained from the clips.Examples of the data are illustrated in Figure 2; these data will be open-sourced at https://drive.google.com/drive/folders/12xmcnH6pZPV3jGsRsrrUl76XhDRnIM6E(accessed on 31 July 2024).

Implementation Details
To ensure a fair comparison, we processed the entire image of each scene using COLMAP to obtain consistent reference camera poses.The experiments were conducted on a single NVIDIA GeForce RTX 3090 GPU.Our optimization parameters were fine-tuned according to the configurations specified in 3DGS.We employed a coarse-to-fine training strategy, with the coarse stage consisting of 3k iterations, and we report the results from 14k iterations for the fine stage.

Experimental Results
Unfortunately, there are currently no methods specifically designed for handling underwater dynamic scenes.Therefore, we compared our approach with the state-of-the-art static scene reconstruction method, 3DGS, and the state-of-the-art dynamic reconstruction method, 4DGS.To bridge the gap in dynamic underwater scene reconstruction, we propose UDR-GS.As illustrated in Figure 3, the qualitative results indicate that our method significantly outperforms the baseline methods across nearly all metrics on datasets representing various scenarios.Particularly in depth rendering, our method demonstrates superior depth rendering results in various scenarios, including single moving objects, multiple moving objects, small movements, and cases where moving objects are difficult to distinguish from the background.In contrast, 4DGS often results in depth blurring or outright depth rendering errors.This highlights the superiority of our approach.The quantitative results are detailed in Table 1, which compares our method with the baselines across the four proposed datasets using commonly used metrics: PSNR, SSIM, and LPIPS.Our method outperforms the baselines on nearly all metrics.

Ablation Studies
To assess the effectiveness of each proposed component, we performed ablation experiments with 4DGS as the baseline.We specifically tested the impact of the depth-guided component and the depth smoothness constraint.Table 2 presents the ablation results on the Coral dataset, analyzing the metrics of PSNR, SSIM, and LPIPS.The introduction of the depth-guided component significantly improves all metrics for this dataset, and the addition of the smoothness constraint further enhances the performance metrics.Figure 4 illustrates the qualitative results of this process.In the RGB images, the improvement can be observed by comparing the red animal's antennae in our results with those in the ground truth (GT) images.For depth rendering, comparing the GT depth with the rendered depths shows that 4DGS produces erroneous depth estimates.As we incrementally incorporate our constraints, the rendered depths progressively align more closely with the GT.Although there is still room for improvement in rendering distant depths, we attribute this to defocusing effects in the dataset for images at greater distances.In future work, we plan to introduce physical priors of underwater imaging to address this issue.These ablation experiments highlight the necessity of the proposed components.Additionally, for scenarios represented by the Coral dataset where object movement is minimal, we observed depth estimation errors in the upper-left corner across all methods.This issue arises due to changes in focal length.The Coral dataset is approximately captured using macro photography, leading to background blurriness, which subsequently causes depth blurriness during rendering.

Discussion
Our method outperforms the current state-of-the-art static and dynamic scene reconstruction techniques on almost all commonly used metrics across four datasets, demonstrating the feasibility of depth guidance for dynamic underwater scene reconstruction.However, it has certain limitations.

Limitation
First, it is highly dependent on the performance of the monocular depth estimation model.If the depth estimation model is not accurate or robust enough, it can negatively impact the overall reconstruction quality.This dependency means that advancements in monocular depth estimation are crucial for improving our method.Second, due to the often-limited perspectives in underwater optical data, complex parameter learning may lead to overfitting on the training set.This is especially problematic in underwater environments where obtaining diverse and representative training data is challenging.Overfitting can reduce the generalizability of the model to new, unseen data, limiting its practical applicability.Moreover, if the dataset contains defocused data, it can also result in inaccurate depth rendering.Underwater images often suffer from defocus due to the scattering and absorption properties of water, which can further complicate depth estimation and reconstruction tasks.Lastly, COLMAP may fail to obtain camera parameters on smooth, textureless surfaces or challenging surfaces, affecting the reconstruction accuracy.This limitation is particularly evident in underwater scenes where many surfaces lack distinct features or textures for reliable camera parameter estimation.Consequently, the accuracy and reliability of the reconstructed scenes can be compromised.

Future Work
In the future, we plan to explore incorporating the physical properties of water flow or underwater objects into dynamic underwater scene reconstruction to achieve more reliable geometric and color consistency.

Conclusions
Reconstructing dynamic underwater scenes is of great significance in marine biology, environmental monitoring, and underwater robotic operations.The accurate reconstruction of dynamic underwater environments is crucial for advancing our understanding and preservation of underwater ecosystems.In this paper, we extended 4D Gaussian splatting to represent dynamic underwater environments, addressing key challenges such as distance-related variations, dynamic objects, and ambiguous geometric cues.Our method demonstrates the effectiveness of geometric guidance through the application of depth information.Specifically, we implemented depth constraints and smoothness constraints, which are essential for underwater scene reconstruction, particularly for dynamic underwater scenes.
Our experiments show that these constraints significantly enhance the reconstruction performance of dynamic underwater scenes based on Gaussian splatting.The introduction of geometric priors allows for more accurate and robust reconstructions, mitigating issues such as depth blurring and rendering errors that are common in previous methods.Although having a larger dataset would provide more comprehensive insights, it would also significantly increase the experimental duration.Therefore, we chose to conduct experiments on the most representative dataset to validate our conclusions.This paper pioneers the application of depth and smoothness constraints in Gaussian splatting for dynamic underwater scene reconstruction, pushing the boundaries of underwater scene reconstruction and unlocking new opportunities for research and applications across various underwater domains.

Figure 2 .
Figure 2. Examples of underwater dynamic scene datasets.

Table 1 .
Comparison of different methods on various datasets.The arrows indicate performance trends, where an upward arrow (↑) denotes that higher values indicate better performance, and a downward arrow (↓) denotes that lower values indicate better performance.

Table 2 .
Ablation study results on the Coral dataset.
Figure 4. Ablation results on testing sets.