Next Article in Journal
IoT-Based Intervention and Home Support to Address Frailty-Related Vulnerability and Well-Being in Older Adults Living in Rural Areas
Previous Article in Journal
A Miniature Inductive Encoder for Linear Displacement Measurement
 
 
Article
Peer-Review Record

Dynamic Feature Fusion for Sparse Radar Detection: Motion-Centric BEV Learning with Adaptive Task Balancing

Sensors 2026, 26(3), 968; https://doi.org/10.3390/s26030968
by Yixun Sang, Junjie Cui *, Yaoguang Sun, Fan Zhang, Yanting Li and Guoqiang Shi
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Sensors 2026, 26(3), 968; https://doi.org/10.3390/s26030968
Submission received: 26 December 2025 / Revised: 21 January 2026 / Accepted: 27 January 2026 / Published: 2 February 2026
(This article belongs to the Section Radar Sensors)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors


This paper proposes a motion-aware framework for 4D millimeter-wave (mmWave) radar detection in autonomous driving. The core contributions—including velocity vector decomposition encoding, an uncertainty-weighted multi-task loss function, and a two-stage progressive training strategy—effectively address key challenges such as point cloud sparsity, high noise levels, and optimization conflicts between multiple tasks. Experimental results on the TJ4D dataset demonstrate that the model achieves 33.25% . Remarkably, the model remains lightweight with only 1.73M parameters while maintaining real-time performance.

However, several critical issues require further clarification and revision before publication.

1. In Section 3.2, the author introduces a velocity vector decomposition strategy. However, the physical justification for applying independent convolutions to the decomposed XY components is not sufficiently explained. The author should provide a more rigorous theoretical explanation of how this specific architectural design captures the true motion orientation and heading of dynamic objects.

2. Gating Mechanism for Noise Suppression**: Equation (3) describes a fusion gating mechanism based on intensity features. The author needs to further clarify how the Sigmoid-activated intensity features accurately distinguish between high-frequency noise (e.g., clutter or multipath reflections) and genuine small targets, such as pedestrians, especially given the low SNR of radar data.

3. The paper reports a real-time performance of 24.4 FPS on the Jetson AGX Orin platform. It is essential to specify whether this frame rate includes the temporal overhead of the "multi-frame point cloud preprocessing" (including motion compensation and registration). Transparency regarding end-to-end latency is vital for practical autonomous driving deployment.

4. The current ablation studies in Table 2 follow a cumulative "step-by-step" integration approach. To rigorously exclude non-linear interference between modules and verify the independent contribution of each innovation, the author should provide a **full-factorial ablation study** (e.g., testing "Baseline + TPT" without "MA-BEV").

5. The manuscript cites several preprints and works from late 2024 and 2025. The author must ensure that all citation formats are complete. For articles that have moved from "early access" to formal publication, please update the volume, issue, and page numbers.

6. The author may consider referencing recent advancements in lightweight 3D detection and interpretable neural networks to broaden the discussion on architectural efficiency and physical constraints, such as:
1. *RailVoxelDet: A Lightweight 3D Object Detection Method for Railway Transportation Driven by on-Board LiDAR Data.*
2. *A novel physical constraint-guided quadratic neural networks for interpretable bearing fault diagnosis under zero-fault sample.*

 

Author Response

Response to Reviewer 1

Thank you for recognizing the contributions of our work and providing valuable suggestions for improvement.

Comment 1.1: Physical justification for velocity decomposition

Reviewer comment:

In Section 3.2, the author introduces a velocity vector decomposition strategy. However, the physical justification for applying independent convolutions to the decomposed XY components is not sufficiently explained. The author should provide a more rigorous theoretical explanation of how this specific architectural design captures the true motion orientation and heading of dynamic objects.

Response:

Thank you for this important comment. We have added explicit physical justification in Section 3.2 of the revised manuscript.

Physical interpretation:

On the BEV plane, a target's horizontal motion is represented as a 2D velocity vector , where:

  • Heading directionis uniquely determined by the arctangent of
  • Speed magnitudeequals the square root of

Therefore, forms an orthogonal basis that preserves complete directional information.

Why independent encoding?

Our design applies separate convolutional encoders to Vx and Vy to:

  1. Learn axis-specific local motion consistency (e.g., sign coherence along each direction)
  2. Capture spatial gradients independently before mixing
  3. Avoid premature entanglement of direction, magnitude, and measurement noise in shared filters

The subsequent fusion stage then learns the coupling between components in a data-driven manner, enabling stable encoding of arbitrary motion orientations.

Revision in manuscript:

We added a detailed paragraph in Section 3.2 explaining this physical interpretation and architectural motivation, including the relationship between Vx, Vy decomposition and object heading/magnitude representation.

Comment 1.2: Gating mechanism for noise suppression

Reviewer comment:

Equation describes a fusion gating mechanism based on intensity features. The author needs to further clarify how the Sigmoid-activated intensity features accurately distinguish between high-frequency noise (e.g., clutter or multipath reflections) and genuine small targets, such as pedestrians, especially given the low SNR of radar data.

Response:

Thank you for raising this concern. We clarify that the gating mechanism is not a hard discriminator but rather a soft confidence modulation.

Key clarifications:

  1. What the gate modulates:The gate operates on the fused motion-density representation, not on raw measurements. It re-weights the combined features based on learned intensity-related confidence.
  2. How it's implemented:The intensity feature is generated from radar return cues (RCS/SNR statistics) via a lightweight convolutional subnetwork, then mapped to [0,1] via sigmoid activation.
  3. Physical motivation:Returns with higher SNR are generally more reliable, while low-SNR regions are prone to clutter/multipath. The gate uses this as a reliability prior.
  4. Why it doesn't suppress small targets:The gate is learned end-to-end under detection supervision. It modulates combined motion AND density signals—small targets with consistent motion patterns are preserved even at lower intensity because the motion component provides complementary evidence.

Revision in manuscript:

We added explicit implementation details after Equation in Section 3.2, explaining:

  • The computational structure (Conv 3x3 → Group Norm → Conv 1x1)
  • Three physical principles: intensity emphasizes high-confidence regions, density suppresses spatial ambiguity, additive fusion reflects motion saliency
  • Clarification that end-to-end learning prevents over-suppression of genuine small targets

Comment 1.3: FPS measurement scope

Reviewer comment:

The paper reports a real-time performance of 24.4 FPS on the Jetson AGX Orin platform. It is essential to specify whether this frame rate includes the temporal overhead of the "multi-frame point cloud preprocessing" (including motion compensation and registration). Transparency regarding end-to-end latency is vital for practical autonomous driving deployment.

Response:

Thank you for requesting this clarification. We have explicitly stated the inference configuration in multiple places.

Key clarifications:

  1. Training vs. Inference:Multi-frame preprocessing (motion compensation and registration) is only used during Stage 1 training. At inference time, the detector operates on single-frame point clouds.
  2. FPS scope:The reported 24.4 FPS corresponds to single-frame inference without multi-frame preprocessing.
  3. Rationale:Our two-stage training strategy uses multi-frame aggregation to learn dense motion patterns during training, but deploys as a single-frame detector to meet real-time requirements.

Revision in manuscript:

  • In Figure 1 caption: "The dashed box indicates Multi-frame Point Cloud Preprocessing is only applied during the training phase"
  • In experimental setup section: "24.4 FPS for single-frame inference... multi-frame preprocessing is only used in Stage 1 training"
  • In training strategy section: Clarified that inference operates on single-frame inputs

Comment 1.4: Full-factorial ablation study

Reviewer comment:

The current ablation studies in Table 2 follow a cumulative "step-by-step" integration approach. To rigorously exclude non-linear interference between modules and verify the independent contribution of each innovation, the author should provide a full-factorial ablation study (e.g., testing "Baseline + TPT" without "MA-BEV").

Response:

Thank you for this suggestion. We clarify that our ablation table already presents independent experiments, not cumulative integration—but we agree the presentation was ambiguous.

Current design:

Each ablation row is built on the same baseline (RadarNeXt), enabling only one proposed component at a time:

  • Baseline + MA-BEV only
  • Baseline + GMB only
  • Baseline + TPT only
  • Baseline + All components (full model)

All delta values are computed against the same baseline for fair comparison.

Why not full factorial?

Full factorial would require 8 experiments (2³ combinations). However, some combinations (e.g., TPT without MA-BEV) are less meaningful because TPT leverages motion features enhanced by MA-BEV during Stage 1 training.

Revision in manuscript:

  • We revised the ablation study section and Table 2 caption to explicitly state:
  • "We evaluate each innovation independentlyunder identical training protocols"
  • "All delta values in Table 2 are computed against the same baseline"
  • "The final row enables all proposed components"

This removes ambiguity about the experimental protocol.

Comment 1.5: Reference completeness

Reviewer comment:

The manuscript cites several preprints and works from late 2024 and 2025. The author must ensure that all citation formats are complete. For articles that have moved from "early access" to formal publication, please update the volume, issue, and page numbers.

Response:

Thank you for this careful review. We have conducted a comprehensive audit of all 2024-2025 references and made the following corrections:

  1. Updated 5 Early Access articles with complete DOI information:
  • Jiang et al. (Social-Informer) → DOI: 10.1109/TCYB.2025.3527788
  • Jiang et al. (T-ITS) → DOI: 10.1109/TITS.2025.3572254
  • Liu et al. (T-ITS) → DOI: 10.1109/TITS.2025.3554313
  • Bi et al. (MAFF-Net) → DOI: 10.1109/LRA.2024.3511858
  • Wang et al. (DADAN) → DOI: 10.1109/JSEN.2024.3520862
    1. Resolved duplicate references:
  • Deleted MUFASA arXiv version (arXiv:2408.00565)
  • Retained official ICANN 2024 conference proceedings version (Lecture Notes in Computer Science, pp. 168-184)
  • Updated all in-text citations to use the conference version
    1. Updated 7 arXiv preprints to formal publications:
  • PRIMEDrive-CoT → IEEE/CVF CVPR Workshop 2025
  • MSF → IEEE/CVF CVPR 2023
  • Uni3D → IEEE/CVF CVPR 2023
  • SCKD → AAAI Conference 2025
  • CenterPoint → IEEE/CVF CVPR 2021, pp. 11784-11793
  • RadarPillars → IEEE ITSC 2024
  • SaViD → IEEE ICRA 2025
    1. Format consistency:All references now follow MDPI journal guidelines with complete venue names, years, and page numbers.

Revision in manuscript:

Bibliography section has been comprehensively updated with all corrections.

Comment 1.6: Additional references

Reviewer comment:

The author may consider referencing recent advancements in lightweight 3D detection and interpretable neural networks to broaden the discussion on architectural efficiency and physical constraints, such as: (1) RailVoxelDet: A Lightweight 3D Object Detection Method for Railway Transportation Driven by On-Board LiDAR Data. (2) A novel physical constraint-guided quadratic neural networks for interpretable bearing fault diagnosis under zero-fault sample.

Response:

Thank you for these valuable suggestions. We have incorporated both references and expanded the related work discussion accordingly.

Added content:

  1. Lightweight 3D Detection section:Added RailVoxelDet reference to highlight recent progress in efficient sparse voxelization under constrained computational budgets, which aligns with our embedded real-time deployment requirements.
  2. Physics-guided Learning discussion:Added discussion of physical constraint-guided interpretable learning (citing the bearing fault diagnosis work) as an inspiring direction for improving model credibility via domain constraints. While this work targets a different application, it motivates future radar perception research to integrate domain-specific physical laws when available.

Revision in manuscript:

Added two new paragraphs in the Related Work section (Section 2) discussing lightweight detection methods and physics-guided learning approaches, with proper citations to the suggested references.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper appears to be incomplete. Several references are incorrectly numbered, out of order, or missing. The descriptions of the proposed methods and the presentation of the experimental results lack clarity and require substantial improvement. Additionally, some figures are referenced but not adequately discussed in the text. The manuscript would benefit from reorganization to improve readability, more detailed explanations of the results, and additional experimental validation. In its current form, the paper is difficult to follow, and I cannot recommend acceptance.

Author Response

Thank you for the insightful comments on improving the technical depth and experimental validation.

Comment 2.1: Comparison with MAFF-Net, RadarNeXt, MUFASA

Reviewer comment:

Justify more comprehensive what is the differences between the proposed compared to MAFF-Net, RadarNeXt, MUFASA, etc.?

Response:

Thank you for this important comment. We have added a comprehensive "Positioning of Our Method" paragraph in the Related Work section with detailed comparisons.

Comparison with MAFF-Net

Key difference in scope:

MAFF-Net (2020) is primarily a LiDAR+Camera fusion method designed for KITTI dataset. It uses channel attention to combine LiDAR and image features.

Our fundamental differences:

  • Sensor modality:MAFF-Net targets LiDAR-camera fusion; we address 4D mmWave radar (much sparser, noisier, but with Doppler velocity)
  • Design philosophy:MAFF-Net uses attention to filter false positives from rich LiDAR data; we handle radar sparsity through explicit motion decomposition and physically motivated gated fusion
  • Deployment focus:Our lightweight design (1.73M params, 24.4 FPS on Jetson Orin) targets embedded radar-only perception

Conclusion: MAFF-Net is not directly comparable as it addresses a different sensor modality.

Comparison with RadarNeXt

RadarNeXt (2025) provides our baseline, achieving 32.30% mAP through reparameterizable convolutional blocks and Multi-path Deformable Foreground Enhancement Network (MDFEN).

Our three orthogonal enhancements:

  1. Motion-aware BEV encoding:
    RadarNeXt: Treats velocity as scalar features
    Ours: Decompose into with independent dual-branch encoding to preserve directional information
    Result:Pedestrian AP improves from 24.55% to 28.71% (+4.16 points)
  2. Adaptive multi-task balancing:
    RadarNeXt: No optimization conflict handling
    Ours: Gradient-aware weight normalization and uncertainty modeling (Section 3.3)
    Result:Converges ~8 epochs faster with more stable training
  3. Progressive training strategy:
    RadarNeXt: Standard single-stage training
    Ours: Stage 1 learns from multi-frame densification, Stage 2 adapts to single-frame inference
    Result:Bridges train-inference domain gap

Performance comparison:

Method

Params

Car

Ped

Cyc

Truck

mAP

FPS (Orin)

RadarNeXt

1.58M

26.24

24.55

59.78

18.64

32.30

28.4

Ours

1.73M

27.13

28.71

58.33

18.84

33.25

24.4

Summary: We retain RadarNeXt's efficient backbone while enhancing feature representation and optimization, improving accuracy (especially for dynamic/minority classes) with only 9.5% parameter increase.

Comparison with MUFASA

MUFASA (ICANN 2024) achieves 30.23% mAP on TJ4DRadSet using multi-view attention mechanisms (GeoSPA + DEMVA) for semantic enrichment.

Our fundamental design differences:

  1. Motion representation:
    MUFASA: Implicit motion capture via learned attention
    Ours: Explicit velocity decomposition with physical interpretation (heading, magnitude)
  2. Training strategy:
    MUFASA: Standard training
    Ours: Two-stage progressive training (multi-frame densification → single-frame adaptation)
  3. Fusion mechanism:
    MUFASA: Multi-view attention aggregation
    Ours: Lightweight gated fusion based on intensity/density cues (avoids attention overhead)
  4. Deployment focus:
    MUFASA: FPS not reported
    Ours: 1.73M params, 24.4 FPS on Jetson Orin (embedded real-time deployment)

Performance comparison:

Method

mAP

Approach

Deployment

MUFASA

30.23%

Multi-view attention

FPS not reported

Ours

33.25%

Motion decomposition + gated fusion + progressive training

1.73M params, 24.4 FPS

Summary: We achieve +3.02 mAP improvement through physically grounded motion encoding and progressive training under tight parameter/latency budgets.
Revision in manuscript: Added comprehensive "Positioning of Our Method" paragraph in Related Work section with detailed comparisons across sensor modality, design philosophy, and deployment constraints.

Comment 2.2: Direct velocity-aware ablation comparison

Reviewer comment:

Please compare directly with other velocity-aware methods using ablation experiments or ablation study.

Response:

Thank you for this valuable suggestion. We have added a direct comparison of velocity encoding strategies in Table 2 of the revised manuscript.

Experimental setup:

We implemented and evaluated two velocity encoding strategies on the same baseline (RadarNeXt):

  • Method 1: Vxy concatenation (naive approach)
    Directly concatenate Vx and Vy as additional channels to density features, then apply a single convolutional block. This is the most straightforward velocity incorporation.
  • Method 2: Vxy decomposition (our approach)
    Encode Vx and Vy through independent dual-branch convolutions, preserving directional independence and local motion consistency.

Experimental results on TJ4DRadSet:

Velocity Encoding

Car

Ped.

Cyc.

Truck

mAP

Delta mAP

Baseline (no velocity)

26.24

24.55

59.78

18.64

32.30

--

+ Vxy concat

26.08

26.31

57.15

17.92

31.87

-0.43

+ Vxy decompose (Ours)

28.41

25.83

57.92

19.84

33.00

+0.70

Key findings:

  1. Naive concatenation fails:Surprisingly, directly concatenating Vx and Vy degrades baseline performance by -0.43 mAP. This occurs because sparse, noisy radar velocity data introduces confusion when naively mixed with density features through shared convolutions.
  2. Decomposition succeeds:Our dual-branch decomposition improves mAP by +0.70, outperforming concatenation by +1.13 mAP (33.00% vs. 31.87%). The improvement is particularly notable for Car (+2.17 AP) and Truck (+1.20 AP).
  3. Why decomposition works:
  • Preserves motion orientation information (heading direction)
  • Maintains local velocity field consistency through separate receptive fields
  • Allows direction-specific feature learning (longitudinal vs. lateral motion)

Physical interpretation:

Radar velocity vectors encode both magnitude and direction. Naive concatenation forces the network to disentangle these coupled properties through shared weights, which is challenging for sparse data (200-500 points/frame). Our decomposition provides an inductive bias aligned with the physical structure of motion, enabling more stable feature learning.

Revision in manuscript:

Updated Table 2 to include velocity encoding comparison rows and expanded the "Key takeaways" paragraph in Section 4.2 to discuss this finding.

Comment 2.3: Theoretical justification for additive fusion

Reviewer comment:

The fusion mechanism lacks theoretical justification. There is no explanation or analysis of why additive fusion is better than concatenation or attention.

Response:

Thank you for this helpful comment. We have added explicit justification for our fusion design choice in Section 3.2.

Our fusion rationale:

  • Design context:We operate under embedded real-time constraints (1.73M parameters, 24.4 FPS on Jetson Orin) with sparse and noisy 4D radar inputs.
  • Why additive fusion?
  1. Spatial alignment assumption:Motion features and density features are encoded on the same BEV grid (same coordinates and resolution) and calibrated to comparable feature scales.
  2. Residual evidence accumulation:Under this alignment, element-wise addition can be interpreted as accumulating complementary evidence at each BEV cell, preserving spatial correspondence with minimal overhead.
  3. Efficiency vs. expressiveness trade-off:
    • Concatenation:Requires additional mixing layers, increasing parameters/compute and introducing more degrees of freedom that may learn unstable correlations in low-SNR, highly sparse regimes
    • Attention:More expressive but incurs non-trivial computational overhead, reducing real-time throughput on edge devices
    • Our gated addition:Lightweight gate provides spatially adaptive re-weighting while keeping fusion operator simple and efficient

Our design:

We use intensity-based gating with additive fusion. The intensity gate provides soft confidence modulation (values in [0,1]) to emphasize high-reliability regions without heavy computation.

Not claiming universal superiority:

We emphasize that our goal is not to prove additive fusion is universally optimal, but rather that it provides a favorable accuracy-efficiency trade-off for sparse, noisy radar inputs under strict embedded deployment constraints.

Revision in manuscript:

Added a comprehensive paragraph after Equation in Section 3.2 explaining:

  • Spatial alignment assumption
  • Residual evidence accumulation interpretation
  • Comparison with concatenation and attention (computational cost vs. expressiveness)
  • Why this design suits radar-specific challenges and deployment constraints

Comment 2.4: Training stability analysis

Reviewer comment:

Please add training stability analysis (loss curves, gradient variance, and clearly measured convergence speed) to support the proposed method.

Response:

Thank you for this important suggestion. We have conducted comprehensive multi-seed training experiments and added a new subsection in the revised manuscript.

Experimental setup:

We trained both baseline (RadarNeXt) and our method using identical settings across 5 random seeds ([42, 123, 2048, 2025, 2026]), tracking:

  • Training loss convergence
  • Validation mAP curves
  • Final performance metrics
  • Convergence speed (epochs to optimal performance)

Visual results:

We provide three visualization figures in Section 4.3:

  1. Training Loss Convergence:Shows our method converges faster (≈24 epochs) and reaches lower final loss (~0.4 vs. ~0.6), demonstrating more stable optimization
  2. Validation mAP Convergence:Shows consistent superiority across all 5 seeds with tighter confidence bands, indicating stable training dynamics
  3. Final Performance Comparison:Bar chart with error bars showing low variance across seeds (std < 0.32%), demonstrating strong reproducibility

Statistical summary (mean ± std across 5 seeds):

Method

mAP

Car

Pedestrian

Cyclist

Truck

Convergence (epochs)

Baseline

32.30 ± 0.16

26.24 ± 0.13

24.55 ± 0.18

59.78 ± 0.28

18.64 ± 0.19

32.0 ± 1.4

Ours

33.25 ± 0.19

27.13 ± 0.17

28.71 ± 0.21

58.33 ± 0.32

18.84 ± 0.17

24.0 ± 1.3

Key observations:

  1. Low variance:Standard deviation < 0.32% for all metrics, demonstrating stable training and robust architecture design
  2. Faster convergence:Our method converges ~8 epochs earlier (24 vs. 32 epochs), validating that adaptive multi-task balancing loss successfully addresses optimization conflicts
  3. Consistent improvements:Performance gains are reproducible across all 5 seeds, with particularly stable Pedestrian AP improvement (+4.16 ± 0.21)
  4. Training stability:Loss curves show smooth convergence without oscillations or plateau issues, indicating well-balanced gradient flow from our gradient-aware weight normalization

Physical insight:

The superior convergence speed and lower variance stem from our adaptive balancing loss, which dynamically adjusts task weights based on gradient magnitudes and uncertainty. This prevents regression tasks from dominating classification during early training, allowing the network to learn robust representations more efficiently.

Revision in manuscript:

  • Added new subsection "Training Stability Analysis" in Section 4.3 including:
  • Figure showing three visualization plots (training loss, validation mAP, final performance)
  • Table presenting multi-seed statistical summary
  • Discussion of key observations highlighting convergence speed, low variance, and gradient balancing effectiveness

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Justify more comprehensive what is the differences between the proposed compared to MAFF-Net, RadarNeXt, MUFASA, etc.?

Please compare directly with other velocity-aware methods using ablation experiments or ablation study.

The fusion mechanism lacks theoretical justification. There is no explanation or analysis of why additive fusion is better than concatenation or attention.

Please add training stability analysis (loss curves, gradient variance, and clearly measured convergence speed) to support the proposed method.

Author Response

Thank you for the detailed feedback on improving manuscript completeness and clarity.

Comment 3.1: Reference corrections and completeness

Reviewer comment:

The paper appears to be incomplete. Several references are incorrectly numbered, out of order, or missing.

Response:

Thank you for this careful review. We have conducted a systematic audit of all references and made comprehensive corrections (same as detailed in Response to Reviewer 1, Comment 1.5):

Summary of corrections:

  • Updated 5 Early Access articles with complete DOI information
  • Resolved 1 duplicate reference (MUFASA arXiv vs. conference version)
  • Updated 7 arXiv preprints to formal conference publications
  • Standardized formatting for all references following MDPI guidelines
  • Verified all DOIs and publication details

Revision in manuscript:

Bibliography section has been comprehensively updated. All references are now complete, consistently ordered, and up-to-date as of January 2026.

Comment 3.2: Method descriptions and clarity

Reviewer comment:

The descriptions of the proposed methods and the presentation of the experimental results lack clarity and require substantial improvement.

Response:

Thank you for this feedback. We have made extensive revisions to improve clarity throughout the manuscript:

  1. Added explicit symbol table (Section 3):
    We now provide a dedicated paragraph defining all key variables with dimensions and meanings:
  • Motion features: dimensions, meaning, generation process
  • Density features: spatial representation, computation method
  • Intensity features: physical basis, implementation details
  • All mathematical operators and their purposes
    1. Enhanced module descriptions:
      For each major subsection, we now explicitly state:
  • Input:What data the module receives
  • Processing:What transformation is applied
  • Output:What representation is produced
  • Purpose:Why this module is needed (connection to overall goal)
    1. Improved navigability (Introduction):
      Added a paragraph explicitly linking each contribution to its supporting evidence:
  • Contribution 1 → Table 2 (ablation results)
  • Contribution 2 → Table 1 (comparison with state-of-the-art)
  • Contribution 3 → Section 4.3 (training stability analysis)
    1. Clarified experimental setup:
  • Added clear statements about training vs. inference configurations
  • Specified hardware platforms and measurement conditions
  • Defined all evaluation metrics and their interpretations

Revision in manuscript:

  • Section 3: Added symbol definitions and dimension specifications
  • All method subsections: Added Input-Process-Output-Purpose structure
  • Introduction: Added contribution-to-evidence linking paragraph
  • Experimental section: Enhanced setup and metric descriptions

Comment 3.3: Figure discussions

Reviewer comment:

Additionally, some figures are referenced but not adequately discussed in the text.

Response:

Thank you for pointing this out. We have systematically reviewed every figure reference and added interpretive text:

  • Figure 1 (System overview):Added discussion summarizing end-to-end design rationale: Vx/Vy decomposition → lightweight gated fusion → multi-stage training. Clarified that dashed box indicates training-only components.
  • Figure 2 (Platform setup):Added clarification on sensor configuration, coordinate systems, and multi-modal calibration procedure.
  • Figures 3-4 (LiDAR vs. Radar comparison):Added discussion emphasizing density gap: LiDAR ~30,000 points/frame vs. radar ~200-500 points/frame, motivating our sparsity-handling strategies.
  • Figure 5 (Visualization results):Added discussion on spatial consistency improvements and clutter suppression effectiveness, with specific examples from different scenarios.
  • Figure 6 (Class distribution):Added discussion linking class imbalance (long-tail distribution) to our gradient-aware loss design motivation, explaining how minority classes benefit from adaptive weighting.
  • Figure 7 (Training stability - NEW):Added comprehensive caption and in-text discussion of multi-seed training curves, convergence speed, and variance analysis.

Revision in manuscript:

  • All figure captions: Enhanced with more detailed descriptions
  • All figure references: Followed by interpretive paragraphs explaining key observations
  • New figure added: Multi-seed training analysis (Section 4.3)

Comment 3.4: Results presentation and takeaways

Reviewer comment:

The manuscript would benefit from more detailed explanations of the results.

Response:

Thank you for this suggestion. We have added "Key Takeaways" paragraphs after both main result tables:

After Table 1 (State-of-the-art comparison):

Key takeaways now summarize:

  • Largest improvement:Which metrics improved most (Pedestrian AP +4.16%)
  • Why it matters:How this aligns with our motion-aware design for dynamic objects
  • Trade-offs:Parameter overhead (1.73M vs. 1.58M) and FPS (24.4 vs. 28.4)
  • Cost-effectiveness:5% parameter increase for 2.9% mAP improvement

After Table 2 (Ablation study):

Key takeaways now explain:

  • Individual contributions:MA-BEV provides largest single gain (+0.70 mAP)
  • Synergistic effects:Combined components achieve +0.95 mAP total
  • Class-specific impacts:Pedestrian benefits most from motion encoding (+4.16 AP)
  • Design validation:Each component addresses a specific challenge (sparsity, optimization conflict, domain gap)

Throughout experimental section:

Added interpretive paragraphs connecting quantitative results back to:

  • Design motivations stated in Introduction
  • Physical principles explained in Method
  • Deployment requirements (real-time, lightweight)

Revision in manuscript:

  • Section 4.2: Added detailed "Key Takeaways" after both main tables
  • Section 4.3: Added analysis paragraphs connecting ablation results to design choices
  • Throughout: Enhanced result-to-motivation linkage

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

In the manuscript under review, the authors proposed and verified by theoretical description and experiments a novel motion-aware framework to address key challenges in 4D millimeter-wave radar detection for autonomous driving with sparse point clouds and dynamic object characterization that require real-time perception of their surrounding environment through a multimodal sensor system. The authors approach introduces three key innovations: 1) A Bird’s Eye View (BEV) fusion network incorporating velocity vector decomposition and dynamic gating mechanisms, effectively encoding motion patterns through independent XY-component convolutions; 2) A gradient-aware multi-task balancing scheme using learnable uncertainty parameters and dynamic weight normalization resolving optimization conflicts between classification and regression tasks; 3) A two-phase progressive training strategy combining multi-frame pre-training with sparse single-frame refinement. For this goal, a motion-aware spatiotemporal joint feature-learning framework, achieving breakthrough improvements in millimeter-wave radar detection performance through a deep coupling mechanism between velocity fields and spatial features. The conducted experimental comparison of the obtained parameters showed the effectiveness of the proposed approach.

In my opinion, the topic is original and relevant to the field. The manuscript is rather well written including a clear application of the realistic scenarios for system construction and performance analysis, description of the essence, correctness and advantages of the proposed approach as well as the presentation quality and readability. I particularly note the Section 2 (Related Work), which describes in detail the development of LiDAR’s and RaDAR’s detection. Also, the conclusions consistent with the evidence and arguments presented. However, while reading it, I had a number of minor comments related to improving readability and eliminating typos. For example:

1) BEV. The abbreviation is introduced in the title and is first expanded on page 7 (line 187)

2) In my opinion, the manuscript readability would be improved if additional references will included in the Introduction section, when describing the current level of development of millimeter-wave RaDars.

3) Lines 148, 156, 160. Typos: a question mark has been inserted instead of the reference number.

4) There are no references to Figs. 1-4 in the text.

5) In Fig. 2, the caption is almost invisible.

6) It is generally accepted that the references are numbered in ascending order. However, in the manuscript, they begin with [43, 44] (line 32) and follow out of the order thereafter.

7) The abbreviation “AP” is not deciphered either in the text or in the tables..

8) In my opinion, during experimental confirmation (Section 4), it is worth indicating the specific frequency at which the measurements were taken.

Therefore, I recommend accepting this manuscript after minor revision.

Author Response

Thank you for the positive assessment of our manuscript and for the helpful minor comments that improve readability and presentation quality. We have carefully revised the manuscript accordingly.

Comment 4.1: BEV abbreviation introduced too late

Reviewer comment: The abbreviation BEV is introduced in the title but is first expanded later in the manuscript.

Response: We agree. In the revised manuscript, we expand BEV as Bird’s Eye View at its first appearance in the main text (early in the Introduction) and ensure consistent usage thereafter. We also checked other abbreviations to avoid late or missing definitions.

Comment 4.2: Add more references in the Introduction on mmWave radar development

Reviewer comment: The Introduction would benefit from additional references when describing the current development level of millimeter-wave radars.

Response: We agree and have added additional citations in the Introduction to better contextualize the evolution of automotive mmWave/4D imaging radar and radar-based 3D detection. The new references cover both foundational radar perception benchmarks and recent radar-only detection frameworks, improving the completeness of the background discussion.

Comment 4.3: Typos where '?' appears instead of reference numbers

Reviewer comment: In several places, a question mark appears instead of the reference number.

Response: Thank you for spotting this. We have corrected these citation/formatting artifacts and verified that all in-text citations are properly resolved and displayed as numbered references in the compiled manuscript.

Comment 4.4: No references to Figs. 1–4 in the text

Reviewer comment: There are no references to Figures 1–4 in the main text.

Response: We agree. We have added explicit in-text references to Figs. 1–4 at the appropriate locations and supplemented short explanatory sentences to connect each figure with the corresponding discussion (e.g., system overview, platform setup, and data characteristics).

Comment 4.5: Fig. 2 caption is almost invisible

Reviewer comment: In Fig. 2, the caption is almost invisible.

Response: We have revised the figure/caption styling to improve readability (caption font size/contrast and overall layout). The updated Fig. 2 caption is now clearly visible in the revised PDF.

Comment 4.6: References should be numbered in ascending order

Reviewer comment: References are generally numbered in ascending order, but the manuscript begins with [43, 44] and then proceeds out of order.

Response: Thank you for this important formatting correction. We have re-audited the bibliography and ensured that references are numbered in ascending order according to first appearance in the text, consistent with MDPI style requirements.

Comment 4.7: Abbreviation “AP” is not deciphered

Reviewer comment: The abbreviation “AP” is not explained in the text or the tables.

Response: We agree. We now explicitly define AP as Average Precision (and mAP as mean Average Precision) when first introduced, and we also clarified the metric definition in the evaluation/experimental section and table captions to ensure the notation is self-contained.

Comment 4.8: Specify the measurement frequency in experiments

Reviewer comment: In the experimental section, please indicate the specific frequency at which the measurements were taken.

Response: Thank you for this suggestion. We have clarified the measurement conditions in the experimental setup by explicitly stating the radar operating band/frequency information reported by the data acquisition platform/dataset description.

We have made substantial revisions across all sections to address reviewer concerns:

Completeness

  • Fixed all reference issues (5 Early Access updates, 7 arXiv→conference updates, 1 duplicate removed)
  • Added missing implementation details
  • Completed all figure discussions

Clarity

  • Added explicit symbol table with dimensions
  • Enhanced module descriptions with Input-Process-Output-Purpose structure
  • Added "Key Takeaways" paragraphs for all main results

Technical Depth

  • Added comprehensive comparisons with MAFF-Net, RadarNeXt, MUFASA
  • Added direct velocity encoding ablation study
  • Added theoretical justification for fusion design
  • Added complete training stability analysis with multi-seed experiments

Experimental Validation

  • New Section 4.3: Training Stability Analysis
  • New ablation: Vxy concatenation vs. decomposition comparison
  • New visualizations: Training curves across 5 random seeds
  • Enhanced statistical reporting: Mean ± std for all metrics

We believe these revisions have significantly improved the manuscript's completeness, clarity, and technical rigor. We are grateful for the reviewers' constructive feedback which has strengthened our work substantially.

 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

No futher comments.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper has been improved by the authors.

Reviewer 3 Report

Comments and Suggestions for Authors

The article can be accepted. Authors has revised the manuscript based on my previous suggestions.

Back to TopTop