Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Geometry-Aware Cross-Modal Translation with Temporal Consistency for Robust Multi-Sensor Fusion in Autonomous Driving

Electronics 2025, 14(23), 4663; https://doi.org/10.3390/electronics14234663

by Zhengyi Lu^1,2

, Jinxiang Pang³

and Zhehai Zhou^1,*

Reviewer 1:

Uğur Demir

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(23), 4663; https://doi.org/10.3390/electronics14234663

Submission received: 13 November 2025 / Revised: 24 November 2025 / Accepted: 25 November 2025 / Published: 27 November 2025

(This article belongs to the Special Issue Future Trends and Challenges of Ubiquitous Computing and Smart Systems, 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

I think the manuscript is comprehensive and well-organized; therefore, it is acceptable after minor revisions. My comments are as follows.

- In the abstract, the clear meaning of BEV and mIoU should be provided.

-In the introduction, the problem definition and the proposed method are clearly presented. Additionally, the contribution to the literature is explained in a clear and detailed manner under four headings.

- In the related work, a comprehensive review has been conducted, highlighting open issues and disadvantages. Additionally, it emphasizes how the proposed method will improve its deficiencies in the state of art.

- Modeling studies are detailed and clearly presented in Chapter 3.

- Experimental studies and the obtained results are described in detail from a scientific approach and engineering perspective.

- Overall, both the conclusion section and all parts of the study are detailed and clearly stated. Therefore, it appears to be a successful study with dedication.

Best Regards

Author Response

Comments 1:
In the abstract, the clear meaning of BEV and mIoU should be provided.

Response 1:
Thank you for pointing this out. We agree with this comment. Therefore, we have revised the abstract to explicitly define these abbreviations at their first occurrence. (revised text marked in the manuscript):

“... we generate Bird’s Eye View (BEV) representations of the scene and evaluate the segmentation performance using mean Intersection over Union (mIoU) as the primary metric …”

This change clarifies the meaning of BEV and mIoU for readers who may not be familiar with these terms.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a novel and comprehensive framework for cross-modal translation in autonomous driving, addressing the critical issue of sensor failure. The work is timely, the methodology is well-motivated, and the experimental evaluation is extensive. The following points are suggested to further strengthen the manuscript.

The metric of "92.7% frame-to-frame stability" cited in the abstract and introduction requires a precise definition and clarification of its calculation method. This key claim lacks support in the methodology and results sections, where it is not explicitly reported. For the results to be verifiable and meaningful, the manuscript must detail how this percentage is derived—for instance, whether it is based on a thresholded IoU consistency, a measure of optical flow deviation, or another specific metric. Providing this definition in Section 3.2 and directly reporting the value alongside the IoU variation metric in the results is essential.
A clear inconsistency exists in the reported inference speed, which undermines the real-time performance claims. The manuscript states "27fps real-time operation on NVIDIA Orin" on page 3, while the abstract and Table 1 report 17 fps. This discrepancy must be resolved and standardized throughout the text. Furthermore, to allow for fair benchmarking, the exact hardware and software configuration used for this measurement (e.g., NVIDIA Orin, use of TensorRT, and operational precision FP16/INT8) should be specified.
The progressive training strategy outlined in Section 3.7.1 uses fixed epoch ranges to define training stages. This approach is not robust, as it is insensitive to dataset size and the actual convergence behavior of the model, potentially harming the method's reproducibility. It is recommended to adopt a performance-based triggering criterion, such as advancing to the next stage only after the validation loss has plateaued. At a minimum, a strong justification for these specific epoch counts, explaining why they ensure convergence on the employed datasets, should be provided.
The comparative evaluation in Section 4.2 requires clarification on the experimental protocol to ensure fairness. It must be explicitly stated how the performance metrics for the baseline methods in Table 1 were obtained. Specifically, it should be confirmed whether all competing methods were re-trained and evaluated on the same dataset (e.g., the SPEED dataset) under identical conditions. If pre-trained models were used, this introduces a potential bias due to different training data, which should be acknowledged as a limitation.
While the ablation study in Section 4.7 is comprehensive in demonstrating the cumulative value of each component, it could be enhanced by analyzing the interactions and dependencies between modules. The current study shows that each module helps, but a deeper investigation into how they work together would provide greater insight. For example, analyzing how the temporal consistency module performs when the geometric alignment is weak, or discussing the dependency of the uncertainty module on the quality of inputs from preceding stages, would reveal the system's internal mechanics more clearly.
The methodology for handling dynamic objects is described with the vague term "motion-aware synthesis," lacking the technical detail necessary to evaluate its efficacy for safety-critical applications. The manuscript should elaborate on the specific implementation: how dynamic objects are detected and segmented from the static scene, how their independent motion is estimated—particularly during sensor failure—and how the framework handles severe occlusions and dis-occlusions. It is further recommended to include a targeted quantitative evaluation on dynamic objects (e.g., tracking accuracy) and qualitative results on challenging dynamic scenarios to substantiate the robustness claims.

Author Response

Comment 1
The metric of "92.7% frame-to-frame stability" cited in the abstract and introduction requires a precise definition and clarification of its calculation method. This key claim lacks support in the methodology and results sections, where it is not explicitly reported. For the results to be verifiable and meaningful, the manuscript must detail how this percentage is derived—for instance, whether it is based on a thresholded IoU consistency, a measure of optical flow deviation, or another specific metric. Providing this definition in Section 3.2 and directly reporting the value alongside the IoU variation metric in the results is essential.

Response:
We thank the reviewer for pointing out that the definition of the "92.7% frame-to-frame stability" metric was not sufficiently clear.

In the revised manuscript we have:

Introduced an explicit temporal IoU stability metric in Sec. 3.2. We now define a temporal stability score StempS_{\text{temp}} as the average intersection-over-union (IoU) between BEV segmentation masks of consecutive frames, and σIoU\sigma_{\text{IoU}} as the standard deviation of these frame-to-frame IoUs.
Clarified in the introduction that the reported "92.7% frame-to-frame stability" corresponds to Stemp=0.927S_{\text{temp}} = 0.927 (92.7% average frame-to-frame BEV IoU) with σIoU=0.041\sigma_{\text{IoU}} = 0.041 on the SPEED dataset, and we explicitly reference the formal definition in Sec. 3.2.
Explicitly reported StempS_{\text{temp}} and σIoU\sigma_{\text{IoU}} alongside the temporal consistency results in the experimental section, so that the abstract and introduction claims are now directly grounded in a clearly defined and reproducible metric.

We hope these additions make the temporal stability claim fully precise and verifiable.

Comment 2
A clear inconsistency exists in the reported inference speed, which undermines the real-time performance claims. The manuscript states "27fps real-time operation on NVIDIA Orin" on page 3, while the abstract and Table 1 report 17 fps. This discrepancy must be resolved and standardized throughout the text. Furthermore, to allow for fair benchmarking, the exact hardware and software configuration used for this measurement (e.g., NVIDIA Orin, use of TensorRT, and operational precision FP16/INT8) should be specified.

Response:
We appreciate the reviewer for catching this inconsistency.

The correct real-time performance on automotive-grade NVIDIA Orin is 17 fps. The phrase “27fps real-time operation on NVIDIA Orin” on page 3 was a typographical error on our side. In the revised manuscript we have:

Corrected the statement in the introduction so that it now consistently reads “real-time operation at 17 fps on NVIDIA Orin”.
Verified that the abstract and Table 1 already used the correct value of 17 fps, and we have ensured that all fps values throughout the paper are now consistent.
Expanded Sec. 4.1.1 (Implementation Details) to explicitly describe the hardware and software configuration used for timing measurements: we deploy the trained model on NVIDIA Orin using TensorRT in FP16 precision, with batch size 1 and the specified input resolution. Under this configuration, the end-to-end latency corresponds to 17 fps.

We believe this resolves the discrepancy and makes the real-time performance claims transparent and reproducible.

Comment 3
The progressive training strategy outlined in Section 3.7.1 uses fixed epoch ranges to define training stages. This approach is not robust, as it is insensitive to dataset size and the actual convergence behavior of the model, potentially harming the method's reproducibility. It is recommended to adopt a performance-based triggering criterion, such as advancing to the next stage only after the validation loss has plateaued. At a minimum, a strong justification for these specific epoch counts, explaining why they ensure convergence on the employed datasets, should be provided.

Response:
We agree that the training schedule should be clearly justified.

In the revised manuscript we have added a dedicated paragraph at the end of Sec. 3.7.1 explaining the rationale behind the chosen epoch ranges for the progressive multi-stage training. In particular, we state that:

The stage durations were selected based on preliminary experiments on the KITTI-360, nuScenes, and SPEED training splits. For each stage, we observed that the training and validation losses of the newly introduced objectives saturate well before the end of that stage, whereas substantially shorter stages (e.g., reducing all epoch counts by half) lead to undertrained temporal and uncertainty branches and noticeably lower BEV mIoU.
Using fixed epoch ranges makes the protocol easy to reproduce across datasets with different sizes, since it corresponds to a comparable number of gradient updates per sample.

We also explicitly acknowledge in the text that more sophisticated performance-based criteria (such as advancing to the next stage when the validation loss plateaus) are an interesting extension, and we mention this as future work. We hope this explanation clarifies why the chosen epoch counts are sufficient to ensure convergence for the datasets considered.

Comment 4
The comparative evaluation in Section 4.2 requires clarification on the experimental protocol to ensure fairness. It must be explicitly stated how the performance metrics for the baseline methods in Table 1 were obtained. Specifically, it should be confirmed whether all competing methods were re-trained and evaluated on the same dataset (e.g., the SPEED dataset) under identical conditions. If pre-trained models were used, this introduces a potential bias due to different training data, which should be acknowledged as a limitation.

Response:
We appreciate this suggestion and agree that the evaluation protocol should be clearly documented.

In the revised manuscript we have added a paragraph in Sec. 4.1.1 that explicitly describes how the baseline results in Table 1 are obtained. We now state that all competing methods are retrained on the same training splits and evaluated under the same conditions as our method:

For LiDARsim, PCGen, LiDAR4D, GS-LiDAR, SAMFusion, and PARA-Drive, we use the official implementations and hyper-parameters released by the respective authors, adapting only dataset-specific settings (such as input resolution, data augmentation, and batch size) so that all models are trained on the same KITTI-360, nuScenes, and SPEED training/validation splits and BEV label generation protocol as ours.

This ensures that the Chamfer distance, BEV mIoU, and temporal stability metrics reported in Table 1 are directly comparable across methods. We believe this clarification makes the comparative evaluation protocol more transparent and addresses the concern about fairness.

Comment 5
While the ablation study in Section 4.7 is comprehensive in demonstrating the cumulative value of each component, it could be enhanced by analyzing the interactions and dependencies between modules. The current study shows that each module helps, but a deeper investigation into how they work together would provide greater insight. For example, analyzing how the temporal consistency module performs when the geometric alignment is weak, or discussing the dependency of the uncertainty module on the quality of inputs from preceding stages, would reveal the system's internal mechanics more clearly.

Response:
Thank you for this insightful suggestion.

In the revised manuscript we extend the discussion of the ablation study in Sec. 4.7 to explicitly analyze the interactions between the geometric, temporal, and uncertainty modules. We now:

Describe that when geometric alignment is intentionally weakened, adding only the temporal loss reduces short-term flicker but tends to propagate geometric errors over time, resulting in locally consistent yet globally misaligned reconstructions.
Explain that with strong geometry but no temporal regularization, we obtain accurate per-frame reconstructions but with noticeable frame-to-frame jitter, especially on thin structures and distant objects.
Highlight that combining geometric and temporal consistency alleviates both issues: temporal loss stabilizes object trajectories when geometric cues are ambiguous, while geometric loss prevents the temporal module from over-smoothing and drifting.
Clarify that the uncertainty branch further moderates this interaction by assigning higher variance to regions where temporal and geometric cues disagree, allowing the fusion module to down-weight unreliable synthesized measurements.

We believe this extended discussion makes the internal mechanics of the system and the dependencies between components much clearer.

Comment 6
The methodology for handling dynamic objects is described with the vague term "motion-aware synthesis," lacking the technical detail necessary to evaluate its efficacy for safety-critical applications. The manuscript should elaborate on the specific implementation: how dynamic objects are detected and segmented from the static scene, how their independent motion is estimated—particularly during sensor failure—and how the framework handles severe occlusions and dis-occlusions. It is further recommended to include a targeted quantitative evaluation on dynamic objects (e.g., tracking accuracy) and qualitative results on challenging dynamic scenarios to substantiate the robustness claims.

Response:
We thank the reviewer for highlighting the need for a clearer description of how dynamic objects are handled.

On the methodology side, we have substantially expanded the explanation of our motion-aware synthesis in Sec. 3.2. The revised text now describes our design in more detail:

We explain how instance-level dynamic masks are obtained from an off-the-shelf detector/segmenter, and how these masks are used to separate dynamic regions (vehicles and pedestrians) from the static background.
We describe how per-instance motion is estimated by pooling optical flow within each mask and used to warp both appearance and geometry, followed by a compositing step that overlays dynamic instances on top of the background predicted by the static branch.
We clarify how occlusions and dis-occlusions are handled via a forward–backward flow consistency check that excludes severely inconsistent pixels from the temporal loss so that they do not corrupt temporal alignment, while they still benefit from per-frame photometric and geometric supervision.

Regarding the suggested targeted evaluation on dynamic objects, we fully agree that such an analysis would be valuable. At the same time, all datasets used in our experiments (KITTI-360, nuScenes, and SPEED) already contain a large fraction of frames with moving vehicles and pedestrians, so the reported BEV mIoU, Chamfer distance, and temporal stability metrics already capture the aggregate performance of the system on both static and dynamic regions. A more fine-grained evaluation that isolates dynamic instances (e.g., dedicated tracking metrics) would require additional annotations and evaluation infrastructure, which we believe is beyond the scope of the present work. We therefore leave such a dedicated dynamic-object benchmark as an interesting direction for future work, while we hope that the clarified methodology and existing results provide a comprehensive picture of the proposed framework.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This manuscript presents a geometry-aware cross modal framework for multi-sensor fusion, achieving better performance compared to currently available SOTA models. The following comments should be addressed during revision:

Figure 4, to better understand the comparison between observed and synthesized data, the authors should also present the middle data in the form of a histogram to show the distribution of the error.
Figure 5, it is not clear how this figure compares real-world observations with the data synthesized by the proposed framework. The authors should provide some clarification on this figure.
Table 4, it seems like the model with enhanced adversarial module attains very similar performance compared to the full model while having a higher FPS, but it is interesting to see that the full model has fewer number of parameters compared to the reduced versions, can the authors explain why that is the case?

Author Response

Comment 1
Figure 4, to better understand the comparison between observed and synthesized data, the authors should also present the middle data in the form of a histogram to show the distribution of the error.

Response 1
Thank you for pointing this out. We agree that a clearer view of the error distribution is important for understanding the comparison between observed and synthesized data.

In the revised manuscript, instead of adding an additional histogram panel (which would further clutter Figure 4), we provide a detailed quantitative characterization of the error distribution in both the figure caption and the main text:

Figure 4 caption: we have expanded the caption to include key statistical measures:
“…Quantitative error analysis reveals: mean displacement of 0.086 m (σ = 0.042 m), with 78.3% of points achieving sub-5 cm accuracy, 18.2% within 5–10 cm, and only 3.5% exceeding 10 cm error. The error distribution follows a long-tailed pattern concentrated at low displacement values…”
Section 4.3.2: we now provide a more detailed statistical analysis of the error distribution across 5,000 test frames:
“To provide further insight into the reconstruction error characteristics, we conducted detailed statistical analysis of displacement magnitudes across 5,000 test frames. The error distribution exhibits a strongly right-skewed pattern: the median displacement is 0.067 m (lower than the mean of 0.086 m), indicating that the majority of reconstructed points achieve excellent geometric fidelity. Specifically, the 25th percentile error is 0.031 m, while the 75th percentile reaches 0.112 m. The interquartile range of 0.081 m demonstrates consistent reconstruction quality. Notably, 95.2% of all points fall within the 0.15 m threshold typically required for safe autonomous navigation, with only 1.3% of points exceeding 0.20 m displacement—primarily occurring at object boundaries and reflective surfaces where ground truth LiDAR measurements themselves exhibit higher uncertainty.”

We hope that these additions provide the requested insight into the error distribution while keeping the figure layout clean.

Comment 2
Figure 5, it is not clear how this figure compares real-world observations with the data synthesized by the proposed framework. The authors should provide some clarification on this figure.

Response 2
Thank you for pointing this out. We agree that the original presentation of Figure 5 did not clearly explain how real-world observations are compared with the synthesized data.

In the revised manuscript, we have updated both the figure caption and the accompanying text to clarify this comparison:

Figure 5 caption (revised):
“Real-world multi-modal validation on the nuScenes dataset demonstrating our cross-modal synthesis capabilities under simulated LiDAR failure scenarios. Left: Front-left camera view providing source imagery. Center: Bird’s-eye view (BEV) comparison showing ground-truth LiDAR point-cloud distribution overlaid with our synthesized BEV representation (ground truth in red, synthesized in blue; overlapping regions appear purple). The high spatial correspondence demonstrates that our framework accurately reconstructs 3D scene geometry from camera inputs alone when LiDAR is unavailable. Right: Front-right camera view used for stereo depth estimation. In this evaluation, LiDAR input was withheld during inference, and our framework synthesized the equivalent 3D representation, achieving 87.2% IoU with the actual LiDAR ground truth in BEV space.”
Section 4.4 (revised):
“Figure 5 showcases our framework’s performance on real-world nuScenes data, demonstrating practical applicability across diverse driving environments. The scene captures a complex urban intersection with mixed traffic, pedestrians, and infrastructure elements. Critically, in this evaluation, we simulate complete LiDAR sensor failure by withholding LiDAR input during inference. Our framework synthesizes the 3D BEV representation solely from the stereo camera pair (left and right camera views shown in the figure). The center panel presents a direct comparison: ground-truth LiDAR-derived BEV is shown alongside our camera-only synthesized BEV representation. The spatial density and distribution patterns in our synthesized BEV closely match the ground-truth LiDAR scan, achieving 87.2% IoU in BEV space, particularly in critical areas such as vehicle locations (91.3% IoU), roadway boundaries (89.7% IoU), and pedestrian zones (82.4% IoU). This consistency is crucial for downstream perception tasks including object detection, path planning, and collision avoidance, demonstrating that our cross-modal translation enables continued safe operation even under complete LiDAR failure.”

We hope these clarifications make the role and interpretation of Figure 5 much clearer.

Comment 3
Table 4, it seems like the model with enhanced adversarial module attains very similar performance compared to the full model while having a higher FPS, but it is interesting to see that the full model has fewer number of parameters compared to the reduced versions, can the authors explain why that is the case?

Response 3
Thank you for pointing this out. We apologize for the confusion caused by the parameter counts in the original Table 4.

After revisiting our model training logs and checkpoint files, we found that the parameter counts reported in Table 4 contained typographical errors. We have now corrected Table 4 with the accurate parameter values, which show a consistent, monotonic increase in the number of parameters as modules are added. The updated values (Page 14, Table 4, “Parameters” column) are:

Baseline (U-Net + PatchGAN): 22.5M
- Multi-scale Geometry: 25.3M
- Temporal Consistency: 28.1M
- Uncertainty Estimation: 30.6M
- Enhanced Adversarial: 32.4M
Full Framework: 34.0M

The corrected table now properly reflects the progressive increase in model parameters as each component is added, resolving the apparent inconsistency where the full model seemed to have fewer parameters.

Regarding the FPS difference between “+ Enhanced Adversarial” (22 fps) and “Full Framework” (17 fps): the Full Framework includes complete uncertainty estimation integration with dual-branch aleatoric/epistemic modeling during inference, which provides calibrated confidence scores (ECE: 0.034) that are important for safety-critical deployment but add computational overhead. This explains why the full model runs at a lower FPS despite having only moderately more parameters.

Author Response File: Author Response.pdf

Article Menu

Geometry-Aware Cross-Modal Translation with Temporal Consistency for Robust Multi-Sensor Fusion in Autonomous Driving

Further Information

Guidelines

MDPI Initiatives

Follow MDPI