Gaussian-UDSR: Real-Time Unbounded Dynamic Scene Reconstruction with 3D Gaussian Splatting
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIn this article, the authors propose Gaussian-UDSR, a novel 3D Gaussian- based representation that efficiently reconstructs and renders high-quality, unbounded dynamic scenes in real-time. The article presents an interesting topic, but it requires some improvements:
- It is not usual to present terms in bold in the abstract.
- The problem addressed needs to be clearly defined.
- The contributions presented in the introduction summarise results. In this section of the article, the contributions expected from the research's development are presented prior to the results. They need to focus on the academic/practical point of view.
- The theoretical support for the methodological decisions of the article is superficial. The authors should provide more explanations and justifications for their choices from a methodological point of view, explaining the algorithm and defined parameters.
- The way the data was processed should be better explained.
- Tables 1 to 3 and figures 3 to 8 of the article are not explained sufficiently. The figures and tables need to be further detailed and discussed.
- Section 4.4 is superficial. It is necessary to present greater detail on the research applications, comment on the advantages, and compare it with other similar studies to demonstrate where this research differs and advances from a scientific point of view. - The conclusion seems to exaggerate a bit, extrapolating the real results obtained from the research.
- What were the limitations of the research? This needs to be clear in the conclusion.
- Based on the results, are there future research perspectives? This could be addressed in the conclusion.
Author Response
Thank you very much for your guidance and comments on our paper. Please find the following
detailed responses to your comments and suggestions.
Comments 1: It is not usual to present terms in bold in the abstract.
Response 1: We were really sorry for our careless mistakes. Thank you for your reminder. we sincerely thank the reviewer for careful reading. Therefore, we have removed the bold formatting from the terms in the abstract to follow standard academic writing conventions.
Location in revised manuscript: Abstract, Page 1, Lines 17–24.
[Revised sentence:]
In this paper, we propose Gaussian-UDSR, a novel 3D Gaussian-based representation that efficiently reconstructs and renders high-quality, unbounded dynamic scenes in real-time. Our approach fuses LiDAR point clouds and Structure-from-Motion (SfM) point clouds obtained from an RGB camera, significantly improving depth estimation and geometric accuracy. To address dynamic appearance variations, we introduce a Gaussian color feature prediction network, which adaptively captures global and local feature information, enabling robust rendering under changing lighting conditions. Additionally, a pose-tracking mechanism ensures precise motion estimation for dynamic objects, enhancing realism and consistency.
Comments 2: The problem addressed needs to be clearly defined.
Response 2: Thank you for your valuable feedback, we have revised the abstract and introduction to explicitly define the problem: the challenge of reconstructing unbounded dynamic scenes in real time with high fidelity in the presence of environmental complexities such as motion, lighting variation, and sensor noise.
Location in revised manuscript: Abstract, Page 1, Lines 14–17; Introduction, Page 2, Lines 39–42.
[Abstract revised sentence:]
However, existing methods struggle to reconstruct dynamic scenes in unbounded outdoor environments due to challenges such as lighting variation, object motion, and sensor limitations, leading to inaccurate geometry and low rendering fidelity.
[Introduction revised sentence:]
These capabilities require us to reconstruct 3D scenes from captured environmental information efficiently and render high-quality novel views in real time, which remains a challenge in unbounded dynamic environments.
Comments 3: The contributions presented in the introduction summarise results. In this section of the article, the contributions expected from the research's development are presented prior to the results. They need to focus on the academic/practical point of view.
Response 3: Thank you for your valuable observation. We have rewritten the contributions in the introduction to emphasize the expected academic and practical significance, avoiding direct mention of results.
Location in revised manuscript: Page 2, Lines 69-86.
[Revised sentence:]
In this work, we propose Gaussian-UDSR, a novel representation framework designed to tackle the core challenge of real-time reconstruction and rendering of unbounded dynamic scenes, which is essential for autonomous driving and other large-scale dynamic environments. Our key idea is to adopt 3D Gaussian splatting as a unified representation to jointly model static backgrounds and dynamic foreground objects. This representation is lightweight, differentiable, and well-suited for real-time processing. It also integrates LiDAR and SfM point cloud data, leveraging the precise geometry from LiDAR and dense texture information from SfM, thus enhancing scene reconstruction quality in unbounded outdoor environments.
Our second contribution is the introduction of a Gaussian feature prediction network, which replaces traditional spherical harmonic basis functions with a learnable module. This network effectively captures both global contextual information and local object-specific features, enabling robust appearance modeling under varying lighting conditions and occlusions.
Finally, we conduct comprehensive experiments on the Waymo and KITTI datasets. The results demonstrate that our method significantly outperforms existing state-of-the-art approaches in terms of both rendering quality and speed, validating its practical effectiveness in real-world autonomous driving scenarios[9, 12].
Comments 4: The theoretical support for the methodological decisions of the article is superficial. The authors should provide more explanations and justifications for their choices from a methodological point of view, explaining the algorithm and defined parameters.
Response 4: We appreciate this important suggestion. We have expanded Section 3 (Methodology) by supplementing the theoretical basis for our selection of the Gaussian model and prediction network, and provided detailed explanations of the selection of key parameters such as quaternions and Jacobian matrices. Key parameters are visually explained through Figures 2 and 3. We also elaborate on the principle of the gradient feedback-based adaptive density control module, which optimizes the distribution direction and shape of Gaussian ellipsoids to ensure the model's good coverage in scene regions. A detailed comparison of the model computational complexity between NeRF and our method is also presented.
We further clarify the description of the dynamic mask generator, with Section 3.3 providing detailed specifications of its input, output, internal processing logic, and mechanism. Additionally, the structure of the Gaussian Color Feature Prediction Network, the dimensions of its output features, processing procedures, and regularization strategies are explained in greater detail. Explanations of defined parameters have been added to Section 4.1.
Location in revised manuscript: Pages 5-11, Lines 171-379
[Revised sentence:]
Static background model. The static background model is represented as a set of three-dimensional Gaussian points in the world coordinate system, and the covariance matrix, the positional mean, the opacity, and the color jointly influence its shape expression:
Meanwhile, the covariance matrixcan be decomposed into scaling matrixand rotation matrix, whereis represented by diagonal elements andis represented by unit quaternion. Quaternions are a compact and numerically stable representation of 3D rotations, defined by a four-dimensional vector (,,,) subject to the unit norm constraint. Compared to Euler angles, quaternions avoid gimbal lock and provide smooth interpolation (e.g., via SLERP), which is crucial for continuous pose tracking in dynamic scenes. Moreover, quaternions are more efficient and numerically stable than rotation matrices, as they require fewer parameters and avoid the need for orthonormalization. These advantages make quaternions particularly suitable for representing and optimizing camera and object orientations in our dynamic scene reconstruction framework. As shown in Figure 2, the unit sphere in the left figure is transformed into the ellipsoid in the right figure, and Eq.2 describes this transformation process. The rotation matrixchanges the orientation of the unit sphere, while the scaling matrixscales it. Through rotation and scaling operations, the originally isotropic unit sphere is transformed into an anisotropic ellipsoid, which can more flexibly describe the multivariate correlations in complex environments. The covariance matrixcan be expressed as:
Figure 2. Visualization of linear transformations of a sphere
In addition to this, the rendering needs to project each Gaussian ellipsoid from the 3D space to the 2D view plane in order to obtain the image in the specified view direction, and the process of changing the projection is a nonlinear approximation, which can be expressed as a local linear approximation of the multivariate function at a point by the Jacobian matrix. As shown in Figure 3, the Jacobian matrix(Eq.3) in the projection model characterizes the local linear transformation relationship from 3D spatial coordinates to a 2D projection plane, with its elements consisting of the partial derivatives of the projection coordinates with respect to the spatial coordinates. Specifically, the elements and in the matrix describe the reciprocal relationship between the scaling factors in the x and y directions and the depth z, reflecting the linear response of the projection coordinates to the spatial positions. In contrast, and capture the nonlinear perspective contraction effect of depth changes on the projections in the x and y directions, whose absolute values increase as the spatial points move away from the optical center (i.e., as z decreases). Additionally,quantifies the attenuation of the scaling factor by depth, revealing the nonlinear degradation of depth information during the projection process. The following is the local linear approximation of the multivariate function at a point:
Figure 3. Visualization of Jacobian Matrix in Projection Model
whereis the focal length in the X-axis direction,is the focal length in the Y-axis direction, andis set to 0, disregarding the Z-axis direction; is the transformation matrix from the world coordinate system to the camera coordinate system.
Gaussians with confidence intervals more significant than 99% are retained based on the view cone, invalid Gaussians are eliminated, and the remaining Gaussians are quickly sorted based on the depth in the camera space, and then the attributes of each 2D Gaussian are queried for pixel-by-pixel rendering based on the Point-based α-blending:
where represents the opacity value of the current point , represents the opacity value of each point before. Multiplying withas the color weight, representing that the more transparent all the previous points , the greater the contribution of the color of the point to the rendering. The scene semantics and depth can be derived from the rendering, Eq.5.
Figure 4. 3D Gaussians Splatting schematic.
The adaptive density control module copies the Gaussian distribution of the under-reconstructed region and optimizes the direction and shape of ellipsoids based on gradient feedback, ensuring adequate region coverage. When expressing the scene, a large number of Gaussian distributions are superimposed to describe the scene area, and the Gaussian distribution cannot accurately express the scene if it is insufficiently reconstructed or over reconstructed. The adaptive density control module copies the Gaussian distribution of the under-reconstructed region into two Gaussian distributions, and optimizes the Gaussian sphere according to the direction and shape of the returned gradient to ensure that the Gaussian ellipsoid can fill the region well. For the over-reconstructed region, the larger Gaussian distribution is first split into two, and then the two Gaussian ellipsoids are scaled down by the scaling factor, and similarly the position and shape of the Gaussian ellipsoid is optimized according to the returned gradient to ensure that the Gaussian ellipsoid can fill the region well.
Dynamic object modeling. Dynamic objects interact with the environment differently at different moments, and their scaling matrix and opacity are kept consistent with the static background model. The difference, however, is that the dynamic object's pose is defined in the object's local coordinate system. In order to transform it into the world coordinate system (i.e., the static background coordinate system) and introduce time , we propose the pose tracking mechanism. Expressly, in the world coordinate system, we represent the change of the dynamic object's position in the time-flow field by a set of time-dependent translation vectorsand rotation matrices:
where denotes the maximum value of time (i.e., the time range of the change of the object's position over time).
At the same time, we add optimizable parameters, which describe minor variations of the object's positional attitude, aiming to reduce motion estimation errors and tracker noise:
The following equation gives the dynamic object's positional representation in the world coordinate system:
whereand are dynamic objects defined in the object's local coordinate system in terms of its covariance matrix:
The computational complexity of NeRF mainly comes from two aspects: the evaluation of the neural network for each query point in the volume rendering process and the construction of the neural network itself. Let's assume that in a traditional NeRF-based method, the neural network haslayers with neurons in the-th layer (). For a single query point, the forward-pass computation in the neural network has a complexity of . In the volume rendering process, if we consider a scene with volume elements (voxels) and rays for rendering, the overall computational complexity is approximately .
The Gaussian Splatting reduces the number of elements that need to be processed compared to the voxel-based approach in NeRF. Specifically, we represent the scene with Gaussian primitives, where . The evaluation of the Gaussian-based model for a single ray has a complexity of . Moreover, our deep learning network is designed in a more lightweight way. Suppose our network has layers with neurons in the -th layer (), and 、 for most . The forward-pass computation for a single query point in our network has a complexity of . Considering the same number of rays R for rendering, the overall computational complexity of our method is approximately .
By comparison, it is evident that our method significantly reduces the computational complexity. In practical scenarios, we have observed that the reduction in the number of elements () and the lightweight design of the neural network lead to a decrease in computational complexity by at least several times compared to traditional NeRF-based methods, thus achieving higher efficiency in dynamic scene reconstruction for autonomous driving.
3.3. Gaussian color Feature Prediction network
Illumination affects the expression of object color, i.e., the spherical harmonic coefficient. In contrast, the default ambient illumination in the original 3DGS is constant, so the object's color will be optimized incorrectly once the illumination changes. Hence, the spherical harmonic coefficient in the original 3DGS only applies to static scenes. Therefore, we use a neural network to predict the color features instead of the traditional spherical harmonic coefficients. The color prediction consists of five parts: feature extractor, dynamic mask generator, dynamic feature sampler, feature fusion network and color decoder.
The feature extractor and dynamic feature mask generator model are based on the U-net [39] architecture. As shown in Figure 3, ResNet [40] is used as the basis to capture the multi-level feature information of the image, introduces residual connections to solve the gradient disappearance and explosion problems in the network training, and extracts multiple intermediate layer features from ResNet and fuses them in the decoder to recover finer-grained spatial information in the scene and ensure the combination of detail information in the lower layer, local semantic information in the middle layer and global semantic information in the higher layer, to improve the segmentation accuracy between dynamic objects and static background. The input to the dynamic mask generator includes multi-scale feature maps extracted by a U-Net from the input image, as well as the 2D projections of each Gaussian point. The purpose of this module is to identify which areas of the image correspond to dynamic objects and to generate a binary dynamic mask. For each Gaussian point, the corresponding feature responses are sampled from multiple feature map layers based on its image-plane position. These responses are then fused using a residual network and semantic decoder to determine whether the point lies within a dynamic region. The output is a binary mask aligned with the image space, highlighting the areas associated with dynamic elements. This mask helps the feature prediction network to focus on dynamic-specific cues, improving the accuracy of color and motion estimation and enhancing the overall fidelity of dynamic scene reconstruction.
The dynamic feature sampler enables Gaussian points to dynamically capture global and local dynamic appearance feature information on feature map slices by introducing Gaussian point position attributes, focusing on meaningful regions on the feature map. Gaussian points are sampled on multiple feature maps denoted as:
where denotes is the predicted feature of the i-th sampled point on the k-th feature map.denotes the coordinates of the learnable sampling point on the m-th feature map obtained by the camera transform and learning mechanism.andare weight functions for weighting according to the and positions,and, are the horizontal and vertical coordinates of the sampled point on the m-th feature map.are indexed from 1 to 2, which are used to calculate the weighted average of the neighborhood around the sampled points.
The predicted features of the i-th Gaussian point sampled from thefeature maps are stitched together to represent the dynamic appearance of the i-th Gaussian point:
The Gaussian Color Feature Prediction Network uses an encoder-decoder structure that outputs a 64-dimensional Gaussian appearance feature vector during the color prediction phase. As shown in Figure 5,this representation is fused with positional and view direction embeddings before being decoded into final RGB values. To prevent overfitting and promote dynamic feature sparsity, we apply dropout (p=0.1) before encoding and use an entropy loss to regularize the dynamic mask output. This helps encourage binary-like confidence and improves separation of dynamic and static regions.
Specifically, Dropout with a rate of 0.1, applied to the input of the encoder when dropout=True. Entropy-based sparsity regularization applied to the dynamic mask, formulated as:
where∈[0,1] is the predicted dynamic mask value for the-th Gaussian andis a small constant to prevent numerical instability.
This regularization encourages the network to produce confident (close to 0 or 1) binary dynamic masks, which improves segmentation quality and downstream color modeling.
The quality of the color decoder is evaluated indirectly through rendering-based perceptual metrics, including PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index Measure), and LPIPS (Learned Perceptual Image Patch Similarity). These metrics compare the rendered images, which incorporate the color decoder’s outputs, with the corresponding ground-truth images. A higher PSNR and SSIM, and a lower LPIPS, indicate better performance of the color decoder in predicting accurate and perceptually consistent color information under varying lighting and viewpoint conditions.
Figure 5. Gaussian Color Feature Prediction Network
3.4 Model training
Loss function. We use L1 loss and structural similarity index (SSIM) loss to compute the reconstruction loss between the rendered and authentic images.is the L1 loss between the render depth and the depth generated by the sparse LiDAR points projected onto the camera plane, and is the scaling loss. We also introduce the entropy regularization termand the perceptual loss, and our objective loss function is formulated as:
- Experimental evaluation
This section highlights the contribution of each component of our proposed method. We present quantitative and qualitative results to validate the performance of our approach compared to state-of-the-art methods.
4.1. Experimental setup
We preprocessed LiDAR and camera data by aligning their timestamps, transforming the camera images into point clouds via SfM, and fusing them with LiDAR point clouds using iterative closest point (ICP) for spatial alignment. We implement our method in Python using the Pytorch framework [41] and train the proposed neural network using the Adam optimizer [42]. In our experiments, we set the position decay exponentially to 0.01, the opacity learning rate to 0.05, and the densification gradient threshold to, reset the opacity every 3000 iterations to remove redundant points. We also set the feature learning rate to, the number of feature map slices, the optimizable parameterand for object position change, and the hyperparameter settings including ,,,,,. We set the scene resolution to 1024 to capture high-frequency details in the sky, and the rest of the parameters were set based on 3DGS [31]. All our experiments were performed on a system equipped with an Intel Xeon(R) Silver 4214R CPU and an Nvidia RTX 3090 GPU for 30,000 iterations.
Comments 5: The way the data was processed should be better explained.
Response 5: Thank you for pointing this out. We have revised Section 4.1 to explain our data preprocessing steps, including how LiDAR and SfM data are fused, how frames are selected, and how ground truth alignment is achieved.
Location in revised manuscript: Page 11, Lines 366-368
[Revised sentence:]
We preprocessed LiDAR and camera data by aligning their timestamps, transforming the camera images into point clouds via SfM, and fusing them with LiDAR point clouds using iterative closest point (ICP) for spatial alignment.
Comments 6: Tables 1 to 3 and Figures 3 to 8 of the article are not explained sufficiently. The figures and tables need to be further detailed and discussed.
Response 6: Thank you for highlighting this. We have provided more comprehensive explanations for Tables 2–4 (formerly Tables 1–3) and Figures 5–10 (formerly Figures 3–8), clarifying the experimental design, evaluation metrics, and their underlying significance.
Table2(formerly Tables 1, Location in revised manuscript: Page 12, Lines 430.)
[Table2 revised sentence:]
Table 2 compares our method with the baseline method regarding rendering quality and speed. We use PSNR, SSIM, and LPIPS [48] as metrics for evaluating rendering quality. Our method achieves the best overall performance across all metrics. Specifically, Gaussian-UDSR attains real-time rendering speeds of 128 FPS on Waymo and 136 FPS on KITTI, significantly outperforming most learning-based methods such as Mars, EmerNeRF, and SUDS, which operate below 0.1 FPS and are impractical for real-time deployment. While 3DGS and PVG also support fast rendering, their reconstruction quality is substantially lower than ours. Our approach achieves the highest PSNR (36.43 on Waymo and 35.63 on KITTI) and SSIM (0.971 and 0.964, respectively), indicating superior fidelity and structural accuracy. Furthermore, we obtain the lowest LPIPS scores (0.047 on Waymo and 0.013 on KITTI), demonstrating that our reconstructions are the most perceptually faithful to the ground truth. For all the metrics, our model achieves the best performance among all the methods with an 8.8% improvement in PSNR, 75% reduction in LPIPS, and four orders of magnitude improvement in rendering speed over the Nerf-based methods [8, 46], which completed the whole training process in about one hour. Although 3DGS renders faster than us, it can only be applied to static scenes, and the rendering effect under dynamic scenes decreases significantly. These results validate that Gaussian-UDSR not only provides high-quality rendering but also enables real-time performance, making it particularly well-suited for dynamic scene reconstruction in autonomous driving applications.
Table3 and Table4(formerly Tables 2-3, Location in revised manuscript: Page 12, Lines 431-432.)
[Table3 and Table4 revised sentence:]
We also selected Emernerf and StreetSurf for PSNR comparison of dynamic and static scenes respectively, as shown in Tables 3 and 4. We conducted a comprehensive comparison between our Gaussian-UDSR method and two state-of-the-art approaches, EmerNeRF and StreetSurf, on the tasks of image reconstruction and novel view synthesis. The results clearly demonstrate the superior performance of our method. Compared with EmerNeRF across seven sequences, our method achieves a significantly higher average PSNR of 35.33 vs. 28.59 in image reconstruction, and 33.15 vs. 28.29 in novel view synthesis, indicating improvements of 6.74 dB and 4.86 dB, respectively. Similarly, when compared with StreetSurf on another set of seven sequences, our method achieves the same average PSNR of 35.33, while StreetSurf only reaches 28.59, again showing a notable improvement of 6.74 dB. These consistent gains highlight the effectiveness of our 3D Gaussian-based representation and dynamic feature modeling in both preserving image fidelity and synthesizing novel views, even in challenging dynamic and unbounded environments.
Figure 5 (formerly Figure 3, Location in revised manuscript: Page 9-10, Lines 328–351.)
[Figure 5 revised sentence:]
The Gaussian Color Feature Prediction Network uses an encoder-decoder structure that outputs a 64-dimensional Gaussian appearance feature vector during the color prediction phase. As shown in Figure 5,this representation is fused with positional and view direction embeddings before being decoded into final RGB values. To prevent overfitting and promote dynamic feature sparsity, we apply dropout (p=0.1) before encoding and use an entropy loss to regularize the dynamic mask output. This helps encourage binary-like confidence and improves separation of dynamic and static regions.
Specifically, Dropout with a rate of 0.1, applied to the input of the encoder when dropout=True. Entropy-based sparsity regularization applied to the dynamic mask, formulated as:
where∈[0,1] is the predicted dynamic mask value for the-th Gaussian andis a small constant to prevent numerical instability.
This regularization encourages the network to produce confident (close to 0 or 1) binary dynamic masks, which improves segmentation quality and downstream color modeling.
The quality of the color decoder is evaluated indirectly through rendering-based perceptual metrics, including PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index Measure), and LPIPS (Learned Perceptual Image Patch Similarity). These metrics compare the rendered images, which incorporate the color decoder’s outputs, with the corresponding ground-truth images. A higher PSNR and SSIM, and a lower LPIPS, indicate better performance of the color decoder in predicting accurate and perceptually consistent color information under varying lighting and viewpoint conditions.
Figure 6 (formerly Figure 4, Location in revised manuscript: Page 12-13, Lines 433–441.)
[Figure 6 revised sentence:]
Figure 6 presents qualitative comparison results of our method (Ours) with Mars and 3DGS [8, 31] on dynamic scenes from the Waymo dataset. In complex dynamic environments such as urban streets and highways, our method is capable of accurately reconstructing fine details of moving objects—for example, the text and structure on the orange sightseeing bus and the contours of vehicles on the road. In contrast, Mars and 3DGS suffer from significant blurring and distortion, especially when handling fast-moving objects, with 3DGS failing to recover the object appearance in many cases. Compared to the Ground Truth, our method produces images that are closer in visual quality and structural consistency.
Figure 7 (formerly Figure 5, Location in revised manuscript: Page 13, Lines 442–448.)
[Figure 7 revised sentence:]
Figure 7 shows additional comparisons on the KITTI dataset, further demonstrating the robustness of our approach. In scenes with multiple moving vehicles, our method successfully reconstructs object poses and edge details, yielding sharp and natural results. In comparison, MARs exhibits evident motion blur and ghosting, while 3DGS struggles to reconstruct fast-moving objects. These two sets of experiments consistently indicate that our method significantly outperforms state-of-the-art baselines in handling dynamic scenes, preserving and restoring complex motion-related details more effectively.
Figure 8 (formerly Figure 6, Location in revised manuscript: Page 13-14, Lines 451–464.)
[Figure 8 revised sentence:]
In our dynamic sampling strategy, Gaussian points are dynamically distributed across feature map slices to capture both global and local dynamic appearance features. The number of feature map slices, denoted as k, influences the final dynamic appearance characteristics. To assess its effect, we controlled for other variables and performed experiments with varying k values through linear transformations. Figure 8 illustrates the impact of varying the number of dynamic feature maps k on model performance, evaluated using PSNR, SSIM, LPIPS, and FPS. As the number of feature maps k increases, both PSNR and SSIM peak at k=4, indicating optimal image reconstruction accuracy and structural consistency. Meanwhile, the LPIPS value is relatively low at this point, reflecting better perceptual quality. However, FPS gradually decreases as the number of feature maps k increases, showing that more feature maps k introduce greater computational overhead and reduce real-time rendering speed. Overall, using four dynamic feature maps achieves the best trade-off between image quality and rendering efficiency, representing the optimal comprehensive performance, and thus, we selected this value for further analysis.
Figure 9 (formerly Figure 7, Location in revised manuscript: Page 14-15, Lines 493–511.)
[Figure 9 revised sentence:]
Figure 9 is a visualization comparison diagram of the ablation experiments, aiming to explore the roles of the feature prediction and pose tracking modules in our research method. The experiment sets up four groups of comparisons: "Ours" (the complete method), "Without Feature prediction" (the method with the feature prediction module removed), "Without Pose tracking" (the method with the pose tracking module removed), and "Ground Truth" (the real - world scene).
The first - row images depict an intersection scene where it is dark and the ground is wet and reflective. In the "Ours" image, objects are clear with rich details; in the "Without Feature prediction" image, it is blurry, and object outlines and details are missing; in the "Without Pose tracking" image, vehicles have obvious trailing. The second - row urban street scene images show similar results. The "Without Feature prediction" image has reduced clarity and lost texture details, and the "Without Pose tracking" image has blurry and ghosted vehicles.
From this, it is evident that the feature prediction module is of great significance for image clarity and detail restoration, and the pose tracking module is indispensable for the accurate representation of dynamic objects. Our complete "Ours" method can effectively avoid these problems and better restore the real - world scene. This not only validates the effectiveness of these two modules but also provides a solid foundation for the overall performance of our proposed method.
Figure 10 (formerly Figure 8, Location in revised manuscript: Page 15-16, Lines 516–529.)
[Figure 10 revised sentence:]
Figure 10 shows the editing operations on the Waymo dataset, including four parts: Reconstruct scene, Static background, Dynamic objects, and Deep rendering. The Reconstruct scene presents the overall visual effect. The Static background and Dynamic objects demonstrate the method's ability to separate scene elements, while the Deep rendering shows depth information through color - coding. In terms of applications, this research method can conveniently edit the behaviors of dynamic and static objects in autonomous driving scene editing, providing diverse scenarios for algorithm training. In sensor simulation, the deep rendering data helps optimize sensor configuration and algorithms. Compared with traditional methods, it has the advantages of high efficiency, accuracy, and data - driven flexibility. In terms of innovation, this research is the first to integrate deep learning and geometric reconstruction techniques. Through the collaborative work of multiple modules, it addresses the deficiencies of existing methods in handling dynamic scenes, offering a new and effective solution for autonomous driving scene simulation and analysis.
Comments 7: Section 4.4 is superficial. It is necessary to present greater detail on the research applications, comment on the advantages, and compare it with other similar studies to demonstrate where this research differs and advances from a scientific point of view.
Response 7: We fully agree. Section 4.4 has been substantially revised to discuss potential applications such as scene editing and trajectory planning, as well as how our method improves over related methods in dynamic scene handling.
Location in revised manuscript: Page 15, Lines 516–529.
[Revised sentence:]
Figure 10 shows the editing operations on the Waymo dataset, including four parts: Reconstruct scene, Static background, Dynamic objects, and Deep rendering. The Reconstruct scene presents the overall visual effect. The Static background and Dynamic objects demonstrate the method's ability to separate scene elements, while the Deep rendering shows depth information through color - coding. In terms of applications, this research method can conveniently edit the behaviors of dynamic and static objects in autonomous driving scene editing, providing diverse scenarios for algorithm training. In sensor simulation, the deep rendering data helps optimize sensor configuration and algorithms. Compared with traditional methods, it has the advantages of high efficiency, accuracy, and data - driven flexibility. In terms of innovation, this research is the first to integrate deep learning and geometric reconstruction techniques. Through the collaborative work of multiple modules, it addresses the deficiencies of existing methods in handling dynamic scenes, offering a new and effective solution for autonomous driving scene simulation and analysis.
Comments 8: The conclusion seems to exaggerate a bit, extrapolating the real results obtained from the research.
Response 8: Thank you for this critical note. We have rewritten the conclusion to maintain a realistic tone and focus on verified achievements.
Location in revised manuscript: Page 16, Lines 532–539.
[Revised sentence:]
In this study, a method is proposed for the unbounded dynamic 3D scenes that autonomous driving cars encounter. This method innovatively utilizes the 3D Gaussian Splatting technique and introduces a deep learning network on this basis. Through LiDAR - SfM point cloud fusion, the Gaussian color feature prediction network, and the pose tracking mechanism, certain achievements have been made in autonomous driving scene reconstruction. Experimental results show that this method performs well in key metrics. For example, in metrics such as PSNR, it approaches the baseline method using ground - truth poses, validating the effectiveness of modules like the pose tracking mechanism.
Comments 9: What were the limitations of the research? This needs to be clear in the conclusion.
Response 9: Thank you for your important advice! We have added a paragraph in the conclusion explicitly discussing limitations, such as the need for high-quality synchronized sensor data and challenges in the pose initialization of dynamic objects.
Location in revised manuscript: Page 16, Lines 540–546.
[Revised sentence:]
However, this research has certain limitations. Firstly, the current method relies on the precise spatio - temporal synchronization of LiDAR and cameras. In monocular or low - frame - rate sensor scenarios, due to the lack of sufficient depth information and continuous observations in the time dimension, the performance may decline. Secondly, the pose initialization of dynamic objects still requires manual intervention and has not achieved full automation, which will increase labor and time costs in large - scale data processing and practical applications.
Comments 10: Based on the results, are there future research perspectives? This could be addressed in the conclusion.
Response 10: Thank you for your suggestions. We have added future research directions in the conclusion, including plans for lightweight deployment and the extension of Gaussian models to multimodal data.
Location in revised manuscript: Page 16, Lines 547–555.
[Revised sentence:]
Based on these results, future research can be carried out in the following directions. On the one hand, there are plans to expand the Gaussian model to multi - modal data, integrating data from more sensors such as IMUs (Inertial Measurement Units) and millimeter - wave radars, so as to enhance the robustness in complex environments (such as extreme weather and heavily occluded scenes). On the other hand, exploring lightweight network design, through techniques such as model compression and pruning, to reduce computational complexity and support real - time deployment on edge devices, thus promoting the widespread use of autonomous driving simulators in practical application scenarios.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper presents Gaussian-UDSR, a novel real-time reconstruction framework for unbounded dynamic scenes based on 3D Gaussian splatting. The system integrates LiDAR and Structure-from-Motion (SfM) point clouds to build a hybrid static/dynamic scene model and introduces a Gaussian color feature prediction network to improve rendering under varying lighting conditions. A pose-tracking module further enhances dynamic object realism. Experiments on the Waymo and KITTI datasets demonstrate rendering quality improvements of up to 8.8% in PSNR and a 75% reduction in LPIPS, along with rendering speeds up to 136 FPS, significantly outperforming prior methods. The manuscript is generally well written, clearly structured, and demonstrates strong experimental results of interest to researchers working on neural rendering, autonomous driving, and 3D scene reconstruction. I have, however, the following comments:
- The paper could benefit from a more explicit breakdown of computational complexity compared to traditional NeRF-based methods to better quantify the gains in efficiency.
- Figure 1 outlines the Gaussian-UDSR architecture well, but the dynamic mask generator remains somewhat opaque; a clearer explanation or illustrative pseudocode would improve accessibility.
- Section 3.3 describes the Gaussian color feature network, but it would be helpful to specify the dimensionality of the learned features and whether any regularization was used during training.
- Given the reliance on LiDAR and RGB fusion, reporting the relative contributions of each sensor to final scene quality (e.g., via ablation) would clarify their individual roles.
- Releasing the code and trained models used in the study would significantly enhance reproducibility and encourage further exploration and adoption.
Author Response
We sincerely thank you for your detailed and constructive suggestions. Your feedback has helped us to improve the quality, clarity, and scientific rigor of our manuscript. Please find below our point-by-point responses.
Comments 1: The paper could benefit from a more explicit breakdown of computational complexity compared to traditional NeRF-based methods to better quantify the gains in efficiency.
Response 1: Thank you very much for your insightful suggestion. In Section 3.2, we have added a detailed comparison of the computational complexity of the models between NeRF and our method. Meanwhile, in Section 4.2, we have supplemented the processing time of our method.
Location in revised manuscript: Section 3.2: Page 8, Lines 261–283. Section 4.2: Page 11, Lines 407–415.
[Section 3.2 Revised sentence:]
The computational complexity of NeRF mainly comes from two aspects: the evaluation of the neural network for each query point in the volume rendering process and the construction of the neural network itself. Let's assume that in a traditional NeRF-based method, the neural network has layers with neurons in the -th layer (). For a single query point, the forward-pass computation in the neural network has a complexity of . In the volume rendering process, if we consider a scene with volume elements (voxels) and rays for rendering, the overall computational complexity is approximately .
The Gaussian Splatting reduces the number of elements that need to be processed compared to the voxel-based approach in NeRF. Specifically, we represent the scene with Gaussian primitives, where . The evaluation of the Gaussian-based model for a single ray has a complexity of . Moreover, our deep learning network is designed in a more lightweight way. Suppose our network has layers with neurons in the -th layer (), and 、 for most . The forward-pass computation for a single query point in our network has a complexity of . Considering the same number of rays R for rendering, the overall computational complexity of our method is approximately .
By comparison, it is evident that our method significantly reduces the computational complexity. In practical scenarios, we have observed that the reduction in the number of elements () and the lightweight design of the neural network lead to a decrease in computational complexity by at least several times compared to traditional NeRF-based methods, thus achieving higher efficiency in dynamic scene reconstruction for autonomous driving.
[Section 4.2 Revised sentence:]
For all the metrics, our model achieves the best performance among all the methods with an 8.8% improvement in PSNR, 75% reduction in LPIPS, and four orders of magnitude improvement in rendering speed over the Nerf-based methods [8, 46], which completed the whole training process in about one hour. Although 3DGS renders faster than us, it can only be applied to static scenes, and the rendering effect under dynamic scenes decreases significantly. These results validate that Gaussian-UDSR not only provides high-quality rendering but also enables real-time performance, making it particularly well-suited for dynamic scene reconstruction in autonomous driving applications.
Comment 2: Figure 1 outlines the Gaussian-UDSR architecture well, but the dynamic mask generator remains somewhat opaque; a clearer explanation would improve accessibility.
Response 2:
Thank you for your thoughtful comment. We agree that the description of the dynamic mask generator needed clarification. In response, we have revised Section 3.3 to provide a clearer explanation of its input, output, internal processing, and significance.
Location in revised manuscript: Page 9, Lines 301–311.
[Revised sentence:]
The input to the dynamic mask generator includes multi-scale feature maps extracted by a U-Net from the input image, as well as the 2D projections of each Gaussian point. The purpose of this module is to identify which areas of the image correspond to dynamic objects and to generate a binary dynamic mask. For each Gaussian point, the corresponding feature responses are sampled from multiple feature map layers based on its image-plane position. These responses are then fused using a residual network and semantic decoder to determine whether the point lies within a dynamic region. The output is a binary mask aligned with the image space, highlighting the areas associated with dynamic elements. This mask helps the feature prediction network to focus on dynamic-specific cues, improving the accuracy of color and motion estimation and enhancing the overall fidelity of dynamic scene reconstruction.
Comment 3: Section 3.3 describes the Gaussian color feature network, but it would be helpful to specify the dimensionality of the learned features and whether any regularization was used during training.
Response 3:
Thank you for your helpful suggestion. We have revised Section 3.3 to clearly describe the structure and behavior of the Gaussian color feature network. Specifically, we now clarify that the learned appearance feature vector produced by the encoder-decoder pipeline is 64-dimensional. This representation is fused with positional and view direction embeddings before being decoded into final RGB values. To improve generalization and reduce overfitting, we applied multiple regularization techniques during training.
Location in revised manuscript: Page 9, Lines 328–343
[Revised sentence:]
The Gaussian Color Feature Prediction Network uses an encoder-decoder structure that outputs a 64-dimensional Gaussian appearance feature vector during the color prediction phase. This representation is fused with positional and view direction embeddings before being decoded into final RGB values. To prevent overfitting and promote dynamic feature sparsity, we apply dropout (p=0.1) before encoding and use an entropy loss to regularize the dynamic mask output. This helps encourage binary-like confidence and improves separation of dynamic and static regions.
Specifically, Dropout with a rate of 0.1, applied to the input of the encoder when dropout=True. Entropy-based sparsity regularization applied to the dynamic mask, formulated as:
where∈[0,1] is the predicted dynamic mask value for the-th Gaussian andis a small constant to prevent numerical instability.
This regularization encourages the network to produce confident (close to 0 or 1) binary dynamic masks, which improves segmentation quality and downstream color modeling.
Comments 4: Given the reliance on LiDAR and RGB fusion, reporting the relative contributions of each sensor to final scene quality (e.g., via ablation) would clarify their individual roles.
Response 4: We fully agree with your suggestion and appreciate your insightful feedback. Accordingly, we have added a new ablation study to analyze the individual contributions of LiDAR and SfM (RGB) inputs to the final scene reconstruction quality. The corresponding results have been added in the revised manuscript as Table 5.
Location in revised manuscript: Page 14-15, Lines 468-492,512–513.
[Revised sentence:]
In this section, we conduct ablation studies to evaluate the individual contributions of key components within our proposed method. In particular, we analyze the impact of LiDAR depth, SfM geometry, their fusion module, the feature prediction network, and the pose tracking mechanism. To validate the effectiveness of each module, we selected eight sequential scenarios from the Waymo dataset, covering diverse conditions such as rainy and foggy weather, high traffic with many moving objects, sunny days, and cloudy weather. These experiments allow us to assess the robustness and generalization of each component under various dynamic and challenging environments.
Table 5 presents the quantitative results of ablation studies, evaluating the impact of removing key components from our method. Removing the LiDAR depth input ("w/o lidar depth") slightly decreases PSNR to 36.22 and increases LPIPS to 0.050, indicating that LiDAR's precise geometry is crucial for fine-grained depth accuracy, though the overall structure remains robust. Omitting the SfM input ("w/o SfM") leads to a more pronounced degradation, with PSNR dropping to 34.65, SSIM decreasing to 0.959, and LPIPS rising to 0.059. This shows that the camera-based geometric cues provided by SfM are essential for enhancing structural consistency and compensating for LiDAR sparsity, especially in distant or texture-poor regions. Omitting the feature prediction module ("w/o Feature prediction") also leads to significant degradation: PSNR drops to 34.91, SSIM falls to 0.962, and LPIPS rises to 0.056. This confirms that the feature prediction network is vital for capturing dynamic appearance variations under changing lighting, as its absence causes color inconsistencies and texture blurring. Removing the pose tracking mechanism ("w/o Pose tracking") results in a PSNR of 36.14 and LPIPS of 0.049, showing moderate performance decline due to motion estimation errors in dynamic objects, though static background reconstruction remains relatively stable. Our full method ("Ours") achieves the highest PSNR (36.43) and SSIM (0.971) with the lowest LPIPS (0.047), demonstrating that the combination of LiDAR-SfM fusion, feature prediction, and pose tracking is essential for achieving high-fidelity dynamic scene reconstruction.
Table 5. Ablation study on the effects of Gaussian-UDSR.
|
PSNR ↑ |
SSIM ↑ |
LPIPS↓ |
w/o lidar depth |
36.22 |
0.970 |
0.050 |
w/o SfM |
34.65 |
0.959 |
0.059 |
w/o Feature prediction |
34.91 |
0.962 |
0.056 |
w/o Pose tracking |
36.14 |
0.964 |
0.049 |
Ours |
36.43 |
0.971 |
0.047 |
Comment 5: Releasing the code and trained models used in the study would significantly enhance reproducibility and encourage further exploration and adoption.
Response 5: Thank you for your important suggestion. We completely agree that open-source code is critical for reproducibility and further community adoption. In response, we have made our full implementation, along with pretrained models and detailed instructions, publicly available at: https://github.com/zhouyue270/Gaussian-UDSR
We have also updated the manuscript to include this link in the conclusion section.
Location in revised manuscript: Page 16, Lines 556–558.
[Revised sentence:]
To promote reproducibility and encourage future research, we have publicly released the source code and pretrained models at: https://github.com/zhouyue270/Gaussian-UDSR.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsSection 2 is shallow, please include a more detailed description of state-of-the-art methods in a comparative table.
The captions in the figures should provide a more complete description. Please don't forget to include unexplained acronyms in the same statistics and/or captions.
Gaussian equation (1), does it require some kind of normalization?
In Line 177, the authors use the term quaternions; it would be interesting to include some foundations of quaternions' properties, and which are the advantages of their use?.
A detailed plot should be used to describe equations (2) and (3).
How was the color decoder quality measured in the output image in Figure 2?
The output image is not clearly presented in Figure 3.
Please, make a double-check of the equations.
The code should be uploaded to a repository for reproducibility purposes.
The label of Section 4.3 is moved into Figure 6. Using LaTeX, we can avoid these mistakes.
A subsection focused on "Discussions" should be included in Section 4.
The processing times or the computing complexity of the proposed method are missing.
Author Response
We sincerely thank you for your detailed and constructive suggestions. Your feedback has helped us to improve the quality, clarity, and scientific rigor of our manuscript. Please find below our point-by-point responses.
Comment 1: Section 2 is shallow, please include a more detailed description of state-of-the-art methods in a comparative table.
Response 1:Thank you for your suggestion. We have significantly expanded Section 2 by adding a comparative table (Table 1) that summarizes key state-of-the-art methods in dynamic scene reconstruction, including their core contributions, key assumptions, and limitations.
Location in revised manuscript: Page 3, Lines 126–133.
[Revised sentence:]
Table 1 summarizes the representative dynamic scene reconstruction approaches based on NeRF and 3DGS, comparing them across input modality, scene scale, dynamic modeling capability, rendering speed, and quality. While methods like SC-GS and GauFRe show promise in small-scale scenes, they are limited in rendering resolution and real-time applicability. In contrast, our proposed Gaussian-UDSR targets large-scale unbounded dynamic scenes with improved motion separation and real-time rendering, tailored for autonomous driving scenarios.
Table 1. Comparison of Dynamic Scene Reconstruction Methods
Method |
Input Modality |
Scene Scale |
Dynamic Modeling |
Rendering Speed |
Rendering Quality |
D-NeRF[25] |
RGB |
Bounded (small) |
MLP-based |
Slow |
Medium |
4DGS[32] |
RGB + TriPlane |
Bounded (small) |
plane features |
Medium |
High |
SC-GS[38] |
RGB + Control Points |
Bounded (small) |
control points + KNN |
Medium |
Medium |
GauFRe[37] |
RGB + MLP |
Bounded (small) |
separated Gaussians |
Medium |
High |
Ours |
LiDAR + RGB + MLP |
Unbounded(large) |
motion separation |
Real-time (Fast) |
High |
Comment 2: The captions in the figures should provide a more complete description. Please don't forget to include unexplained acronyms in the same statistics and/or captions.
Response 2:Thank you for pointing this out. We have thoroughly reviewed all charts and made revisions.We have provided more comprehensive explanations for Tables 2–4 (formerly Tables 1–3) and Figures 5–10 (formerly Figures 3–8), clarifying the experimental design, evaluation metrics, and their underlying significance.
Table2(formerly Tables 1, Location in revised manuscript: Page 12, Lines 430.)
[Table2 revised sentence:]
Table 2 compares our method with the baseline method regarding rendering quality and speed. We use PSNR, SSIM, and LPIPS [48] as metrics for evaluating rendering quality. Our method achieves the best overall performance across all metrics. Specifically, Gaussian-UDSR attains real-time rendering speeds of 128 FPS on Waymo and 136 FPS on KITTI, significantly outperforming most learning-based methods such as Mars, EmerNeRF, and SUDS, which operate below 0.1 FPS and are impractical for real-time deployment. While 3DGS and PVG also support fast rendering, their reconstruction quality is substantially lower than ours. Our approach achieves the highest PSNR (36.43 on Waymo and 35.63 on KITTI) and SSIM (0.971 and 0.964, respectively), indicating superior fidelity and structural accuracy. Furthermore, we obtain the lowest LPIPS scores (0.047 on Waymo and 0.013 on KITTI), demonstrating that our reconstructions are the most perceptually faithful to the ground truth. For all the metrics, our model achieves the best performance among all the methods with an 8.8% improvement in PSNR, 75% reduction in LPIPS, and four orders of magnitude improvement in rendering speed over the Nerf-based methods [8, 46], which completed the whole training process in about one hour. Although 3DGS renders faster than us, it can only be applied to static scenes, and the rendering effect under dynamic scenes decreases significantly. These results validate that Gaussian-UDSR not only provides high-quality rendering but also enables real-time performance, making it particularly well-suited for dynamic scene reconstruction in autonomous driving applications.
Table3 and Table4(formerly Tables 2-3, Location in revised manuscript: Page 12, Lines 431-432.)
[Table3 and Table4 revised sentence:]
We also selected Emernerf and StreetSurf for PSNR comparison of dynamic and static scenes respectively, as shown in Tables 3 and 4. We conducted a comprehensive comparison between our Gaussian-UDSR method and two state-of-the-art approaches, EmerNeRF and StreetSurf, on the tasks of image reconstruction and novel view synthesis. The results clearly demonstrate the superior performance of our method. Compared with EmerNeRF across seven sequences, our method achieves a significantly higher average PSNR of 35.33 vs. 28.59 in image reconstruction, and 33.15 vs. 28.29 in novel view synthesis, indicating improvements of 6.74 dB and 4.86 dB, respectively. Similarly, when compared with StreetSurf on another set of seven sequences, our method achieves the same average PSNR of 35.33, while StreetSurf only reaches 28.59, again showing a notable improvement of 6.74 dB. These consistent gains highlight the effectiveness of our 3D Gaussian-based representation and dynamic feature modeling in both preserving image fidelity and synthesizing novel views, even in challenging dynamic and unbounded environments.
Figure 5 (formerly Figure 3, Location in revised manuscript: Page 9-10, Lines 328–351.)
[Figure 5 revised sentence:]
The Gaussian Color Feature Prediction Network uses an encoder-decoder structure that outputs a 64-dimensional Gaussian appearance feature vector during the color prediction phase. As shown in Figure 5,this representation is fused with positional and view direction embeddings before being decoded into final RGB values. To prevent overfitting and promote dynamic feature sparsity, we apply dropout (p=0.1) before encoding and use an entropy loss to regularize the dynamic mask output. This helps encourage binary-like confidence and improves separation of dynamic and static regions.
Specifically, Dropout with a rate of 0.1, applied to the input of the encoder when dropout=True. Entropy-based sparsity regularization applied to the dynamic mask, formulated as:
where∈[0,1] is the predicted dynamic mask value for the-th Gaussian andis a small constant to prevent numerical instability.
This regularization encourages the network to produce confident (close to 0 or 1) binary dynamic masks, which improves segmentation quality and downstream color modeling.
The quality of the color decoder is evaluated indirectly through rendering-based perceptual metrics, including PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index Measure), and LPIPS (Learned Perceptual Image Patch Similarity). These metrics compare the rendered images, which incorporate the color decoder’s outputs, with the corresponding ground-truth images. A higher PSNR and SSIM, and a lower LPIPS, indicate better performance of the color decoder in predicting accurate and perceptually consistent color information under varying lighting and viewpoint conditions.
Figure 6 (formerly Figure 4, Location in revised manuscript: Page 12-13, Lines 433–441.)
[Figure 6 revised sentence:]
Figure 6 presents qualitative comparison results of our method (Ours) with Mars and 3DGS [8, 31] on dynamic scenes from the Waymo dataset. In complex dynamic environments such as urban streets and highways, our method is capable of accurately reconstructing fine details of moving objects—for example, the text and structure on the orange sightseeing bus and the contours of vehicles on the road. In contrast, Mars and 3DGS suffer from significant blurring and distortion, especially when handling fast-moving objects, with 3DGS failing to recover the object appearance in many cases. Compared to the Ground Truth, our method produces images that are closer in visual quality and structural consistency.
Figure 7 (formerly Figure 5, Location in revised manuscript: Page 13, Lines 442–448.)
[Figure 7 revised sentence:]
Figure 7 shows additional comparisons on the KITTI dataset, further demonstrating the robustness of our approach. In scenes with multiple moving vehicles, our method successfully reconstructs object poses and edge details, yielding sharp and natural results. In comparison, MARs exhibits evident motion blur and ghosting, while 3DGS struggles to reconstruct fast-moving objects. These two sets of experiments consistently indicate that our method significantly outperforms state-of-the-art baselines in handling dynamic scenes, preserving and restoring complex motion-related details more effectively.
Figure 8 (formerly Figure 6, Location in revised manuscript: Page 13-14, Lines 451–464.)
[Figure 8 revised sentence:]
In our dynamic sampling strategy, Gaussian points are dynamically distributed across feature map slices to capture both global and local dynamic appearance features. The number of feature map slices, denoted as k, influences the final dynamic appearance characteristics. To assess its effect, we controlled for other variables and performed experiments with varying k values through linear transformations. Figure 8 illustrates the impact of varying the number of dynamic feature maps k on model performance, evaluated using PSNR, SSIM, LPIPS, and FPS. As the number of feature maps k increases, both PSNR and SSIM peak at k=4, indicating optimal image reconstruction accuracy and structural consistency. Meanwhile, the LPIPS value is relatively low at this point, reflecting better perceptual quality. However, FPS gradually decreases as the number of feature maps k increases, showing that more feature maps k introduce greater computational overhead and reduce real-time rendering speed. Overall, using four dynamic feature maps achieves the best trade-off between image quality and rendering efficiency, representing the optimal comprehensive performance, and thus, we selected this value for further analysis.
Figure 9 (formerly Figure 7, Location in revised manuscript: Page 14-15, Lines 493–511.)
[Figure 9 revised sentence:]
Figure 9 is a visualization comparison diagram of the ablation experiments, aiming to explore the roles of the feature prediction and pose tracking modules in our research method. The experiment sets up four groups of comparisons: "Ours" (the complete method), "Without Feature prediction" (the method with the feature prediction module removed), "Without Pose tracking" (the method with the pose tracking module removed), and "Ground Truth" (the real - world scene).
The first - row images depict an intersection scene where it is dark and the ground is wet and reflective. In the "Ours" image, objects are clear with rich details; in the "Without Feature prediction" image, it is blurry, and object outlines and details are missing; in the "Without Pose tracking" image, vehicles have obvious trailing. The second - row urban street scene images show similar results. The "Without Feature prediction" image has reduced clarity and lost texture details, and the "Without Pose tracking" image has blurry and ghosted vehicles.
From this, it is evident that the feature prediction module is of great significance for image clarity and detail restoration, and the pose tracking module is indispensable for the accurate representation of dynamic objects. Our complete "Ours" method can effectively avoid these problems and better restore the real - world scene. This not only validates the effectiveness of these two modules but also provides a solid foundation for the overall performance of our proposed method.
Figure 10 (formerly Figure 8, Location in revised manuscript: Page 15-16, Lines 516–529.)
[Figure 10 revised sentence:]
Figure 10 shows the editing operations on the Waymo dataset, including four parts: Reconstruct scene, Static background, Dynamic objects, and Deep rendering. The Reconstruct scene presents the overall visual effect. The Static background and Dynamic objects demonstrate the method's ability to separate scene elements, while the Deep rendering shows depth information through color - coding. In terms of applications, this research method can conveniently edit the behaviors of dynamic and static objects in autonomous driving scene editing, providing diverse scenarios for algorithm training. In sensor simulation, the deep rendering data helps optimize sensor configuration and algorithms. Compared with traditional methods, it has the advantages of high efficiency, accuracy, and data - driven flexibility. In terms of innovation, this research is the first to integrate deep learning and geometric reconstruction techniques. Through the collaborative work of multiple modules, it addresses the deficiencies of existing methods in handling dynamic scenes, offering a new and effective solution for autonomous driving scene simulation and analysis.
Comment 3: Gaussian equation (1), does it require some kind of normalization?
Response 3: Thank you for the technical question. The expression:
is an unnormalized Gaussian function, which is commonly used in the context of 3D Gaussian Splatting to represent the spatial influence of each Gaussian primitive. It does not require normalization, since it is not used as a probability density function, but rather as a weighting kernel in rendering. The normalization factor is typically omitted to reduce computational cost and because relative rather than absolute values are sufficient for visual effects.
Comment 4: In Line 177, the authors use the term quaternions; it would be interesting to include some foundations of quaternions' properties, and which are the advantages of their use?
Response 4: Thank you for the thoughtful suggestion. We have added a brief background paragraph on quaternions, their mathematical properties (e.g., rotation representation without gimbal lock), and why they are preferred in pose tracking over Euler angles.
Location in revised manuscript: Page 5, Lines 178–185.
[Revised sentence:]
Quaternions are a compact and numerically stable representation of 3D rotations, defined by a four-dimensional vector (,,,) subject to the unit norm constraint. Compared to Euler angles, quaternions avoid gimbal lock and provide smooth interpolation (e.g., via SLERP), which is crucial for continuous pose tracking in dynamic scenes. Moreover, quaternions are more efficient and numerically stable than rotation matrices, as they require fewer parameters and avoid the need for orthonormalization. These advantages make quaternions particularly suitable for representing and optimizing camera and object orientations in our dynamic scene reconstruction framework.
Comment 5: A detailed plot should be used to describe equations (2) and (3).
Response 5:Thank you for the suggestion. We have added Figure 2 and Figure 3 in Section 3.2 to fully explain the geometric relationships in Equation 2 and Equation 3, specifically describing how the Gaussian centers undergo spatio-temporal transformations.
Location in revised manuscript: Figure 2 : Page 5, Lines 185–192, Figure 3: Page 6, Lines 197-210.
[Figure 2 Revised sentence:]
As shown in Figure 2, the unit sphere in the left figure is transformed into the ellipsoid in the right figure, and Equation 2 describes this transformation process. The rotation matrix changes the orientation of the unit sphere, while the scaling matrixscales it. Through rotation and scaling operations, the originally isotropic unit sphere is transformed into an anisotropic ellipsoid, which can more flexibly describe the multivariate correlations in complex environments.
Figure 2. Visualization of linear transformations of a sphere
[Figure 3 Revised sentence:]
As shown in Figure 3, the Jacobian matrix(Eq.3) in the projection model characterizes the local linear transformation relationship from 3D spatial coordinates to a 2D projection plane, with its elements consisting of the partial derivatives of the projection coordinates with respect to the spatial coordinates. Specifically, the elements and in the matrix describe the reciprocal relationship between the scaling factors in the x and y directions and the depth z, reflecting the linear response of the projection coordinates to the spatial positions. In contrast, and capture the nonlinear perspective contraction effect of depth changes on the projections in the x and y directions, whose absolute values increase as the spatial points move away from the optical center (i.e., as z decreases). Additionally,quantifies the attenuation of the scaling factor by depth, revealing the nonlinear degradation of depth information during the projection process.
Figure 3. Visualization of Jacobian Matrix in Projection Model
Comment 6: How was the color decoder quality measured in the output image in Figure 2?
Response 6: Thank you for the comment. In Section 3.3, we have clarified that the quality of the color decoder output was evaluated using three standard perceptual metrics: PSNR, SSIM, and LPIPS. This clarification is now explicitly stated in the main text.
Location in revised manuscript: Page 10, Lines 344–351.
[Revised sentence:]
The quality of the color decoder is evaluated indirectly through rendering-based perceptual metrics, including PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index Measure), and LPIPS (Learned Perceptual Image Patch Similarity). These metrics compare the rendered images, which incorporate the color decoder’s outputs, with the corresponding ground-truth images. A higher PSNR and SSIM, and a lower LPIPS, indicate better performance of the color decoder in predicting accurate and perceptually consistent color information under varying lighting and viewpoint conditions.
Comment 7: The output image is not clearly presented in Figure 3.
Response 7: Thank you for pointing this out. We have redrawn Figure 5 (formerly Figure 3) to more clearly illustrate the input, output, and processing steps of the framework.
Figure 5. Gaussian Color Feature Prediction Network
Comment 8: Please, make a double-check of the equations.
Response 8: Thank you for the reminder. We have thoroughly reviewed and verified the correctness and formatting of all equations in the manuscript. Minor typos and misalignments have been corrected.
Location in revised manuscript: Throughout Sections 3 and 4.
Comment 9: The code should be uploaded to a repository for reproducibility purposes.
Response 9: Thank you for your important suggestion. We completely agree that open-source code is critical for reproducibility and further community adoption. In response, we have made our full implementation, along with pretrained models and detailed instructions, publicly available at: https://github.com/zhouyue270/Gaussian-UDSR
We have also updated the manuscript to include this link in the conclusion section.
Location in revised manuscript: Page 16, Lines 556–558.
[Revised sentence:]
To promote reproducibility and encourage future research, we have publicly released the source code and pretrained models at: https://github.com/zhouyue270/Gaussian-UDSR.
Comment 10: The label of Section 4.3 is moved into Figure 6. Using LaTeX, we can avoid these mistakes.
Response 10: Thank you for noticing this layout issue. We have corrected the misplaced section heading and ensured all section titles and figure placements are properly aligned.
Location in revised manuscript: Page 14, Lines 467.
Comment 11: A subsection focused on "Discussions" should be included in Section 4.
Response 11: Thank you for your important advice! We have added a paragraph in the conclusion explicitly discussing limitations, such as the need for high-quality synchronized sensor data and challenges in the pose initialization of dynamic objects. We have added future research directions in the conclusion, including plans for lightweight deployment and the extension of Gaussian models to multimodal data.
Location in revised manuscript: Page 16, Lines 532–555.
[Revised sentence:]
In this study, a method is proposed for the unbounded dynamic 3D scenes that autonomous driving cars encounter. This method innovatively utilizes the 3D Gaussian Splatting technique and introduces a deep learning network on this basis. Through LiDAR - SfM point cloud fusion, the Gaussian color feature prediction network, and the pose tracking mechanism, certain achievements have been made in autonomous driving scene reconstruction. Experimental results show that this method performs well in key metrics. For example, in metrics such as PSNR, it approaches the baseline method using ground - truth poses, validating the effectiveness of modules like the pose tracking mechanism.
However, this research has certain limitations. Firstly, the current method relies on the precise spatio - temporal synchronization of LiDAR and cameras. In monocular or low - frame - rate sensor scenarios, due to the lack of sufficient depth information and continuous observations in the time dimension, the performance may decline. Secondly, the pose initialization of dynamic objects still requires manual intervention and has not achieved full automation, which will increase labor and time costs in large - scale data processing and practical applications.
Based on these results, future research can be carried out in the following directions. On the one hand, there are plans to expand the Gaussian model to multi - modal data, integrating data from more sensors such as IMUs (Inertial Measurement Units) and millimeter - wave radars, so as to enhance the robustness in complex environments (such as extreme weather and heavily occluded scenes). On the other hand, exploring lightweight network design, through techniques such as model compression and pruning, to reduce computational complexity and support real - time deployment on edge devices, thus promoting the widespread use of autonomous driving simulators in practical application scenarios.
Comment 12: The processing times or the computing complexity of the proposed method are missing.
Response 12: Thank you for this important comment. In Section 3.2, we have added a detailed comparison of the computational complexity of the models between NeRF and our method. Meanwhile, in Section 4.2, we have supplemented the processing time of our method.
Location in revised manuscript: Section 3.2: Page 8, Lines 261–283. Section 4.2: Page 11, Lines 407–415.
[Section 3.2 Revised sentence:]
The computational complexity of NeRF mainly comes from two aspects: the evaluation of the neural network for each query point in the volume rendering process and the construction of the neural network itself. Let's assume that in a traditional NeRF-based method, the neural network has layers with neurons in the -th layer (). For a single query point, the forward-pass computation in the neural network has a complexity of . In the volume rendering process, if we consider a scene with volume elements (voxels) and rays for rendering, the overall computational complexity is approximately .
The Gaussian Splatting reduces the number of elements that need to be processed compared to the voxel-based approach in NeRF. Specifically, we represent the scene with Gaussian primitives, where . The evaluation of the Gaussian-based model for a single ray has a complexity of . Moreover, our deep learning network is designed in a more lightweight way. Suppose our network has layers with neurons in the -th layer (), and 、 for most . The forward-pass computation for a single query point in our network has a complexity of . Considering the same number of rays R for rendering, the overall computational complexity of our method is approximately .
By comparison, it is evident that our method significantly reduces the computational complexity. In practical scenarios, we have observed that the reduction in the number of elements () and the lightweight design of the neural network lead to a decrease in computational complexity by at least several times compared to traditional NeRF-based methods, thus achieving higher efficiency in dynamic scene reconstruction for autonomous driving.
[Section 4.2 Revised sentence:]
For all the metrics, our model achieves the best performance among all the methods with an 8.8% improvement in PSNR, 75% reduction in LPIPS, and four orders of magnitude improvement in rendering speed over the Nerf-based methods [8, 46], which completed the whole training process in about one hour. Although 3DGS renders faster than us, it can only be applied to static scenes, and the rendering effect under dynamic scenes decreases significantly. These results validate that Gaussian-UDSR not only provides high-quality rendering but also enables real-time performance, making it particularly well-suited for dynamic scene reconstruction in autonomous driving applications.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI congratulate the authors for the excellent review work carried out. The article can be accepted.