Next Article in Journal
Direction-of-Arrival Estimation with Discrete Fourier Transform and Deep Feature Fusion
Previous Article in Journal
Study of the Effect of Magnetic Fields on the Secondary Electron Model
Previous Article in Special Issue
A Resilient Routing Protocol to Reduce Update Cost by Unsupervised Learning and Deep Reinforcement Learning in Mobile Ad Hoc Networks
 
 
Article
Peer-Review Record

ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer

Electronics 2025, 14(12), 2448; https://doi.org/10.3390/electronics14122448
by Cheng Da, Yongsheng Qian *, Junwei Zeng, Xuting Wei and Futao Zhang
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4: Anonymous
Electronics 2025, 14(12), 2448; https://doi.org/10.3390/electronics14122448
Submission received: 17 May 2025 / Revised: 11 June 2025 / Accepted: 14 June 2025 / Published: 16 June 2025
(This article belongs to the Special Issue Advances in AI Engineering: Exploring Machine Learning Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. The student model’s average inference time is quoted as 15 ms but the CPU/GPU type, batch size, and tensor precision are omitted, making it impossible to judge real‑time viability. 

2. Figure 2 could be more clear.

3. The abstract asserts consistent performance across weather conditions and densities ranging 0.05–0.85 veh m⁻¹, yet no dedicated cross‑weather experiments or stratified density plots are shown. 

4. Section 3.1.2 describes paired t-tests with Bonferroni correction, but the results tables omit p‑values and confidence intervals. 

5.  Include more qualitative figures highlighting where ADSAP succeeds.

6. Although NGSIM is pre‑processed in detail, INTERACTION and highD appear only in passing. Either conduct full‑scale experiments on all three datasets or justify why NGSIM alone suffices for the claimed “diverse driving scenarios.”

Author Response

Comments 1. The student model’s average inference time is quoted as 15 ms but the CPU/GPU type, batch size, and tensor precision are omitted, making it impossible to judge real‑time viability. 

Response1: Thank you for pointing out the missing hardware specifications. We have added comprehensive technical details in Section 4.1. The student model achieves 15ms average inference time on NVIDIA RTX 3080 GPU (batch size=32, FP16 precision) and 45ms on Intel i7-10700K CPU. For real-time viability assessment, our model meets the typical 50ms latency requirement for autonomous driving applications. We have also included statistical significance testing with p-values (<0.001) and 95% confidence intervals in all result tables, confirming the reliability of our performance claims.

 

Comments 2. Figure 2 could be more clear.

Response2: We have significantly improved Figure 2 based on your feedback. The revised figure now includes: (1) clearer component labels with enhanced typography, (2) explicit data flow arrows showing information propagation between modules, (3) detailed legend explaining all symbols and color codes, (4) improved visual hierarchy distinguishing teacher and student architectures, and (5) dimensional annotations for key feature maps. The new figure substantially enhances understanding of ADSAP's architecture and the interaction between ADSP and AKDM modules.

 

Comments 3. The abstract asserts consistent performance across weather conditions and densities ranging 0.05–0.85 veh m⁻¹, yet no dedicated cross‑weather experiments or stratified density plots are shown. 

Response3: We have conducted comprehensive cross-weather experiments and added detailed results in Section 4.2.2. Table 3 presents performance across clear (ADE: 1.65±0.07m), rainy (ADE: 1.78±0.09m), and foggy (ADE: 1.91±0.10m) conditions. Additionally, we provide stratified density analysis showing consistent performance across density ranges: 0.05-0.25 veh/m (ADE: 1.58±0.08m), 0.25-0.55 veh/m (ADE: 1.65±0.07m), and 0.55-0.85 veh/m (ADE: 1.75±0.09m). These results validate ADSAP's robustness across diverse environmental conditions.

 

Comments 4. Section 3.1.2 describes paired t-tests with Bonferroni correction, but the results tables omit p‑values and confidence intervals. 

Response4: We have comprehensively addressed the statistical reporting gap. All result tables now include p-values from paired t-tests with Bonferroni correction and 95% confidence intervals computed via bootstrap resampling (1000 iterations). For example, ADSAP vs. iNATran comparison shows p<0.001 with confidence interval [1.58, 1.72] for ADE. The statistical analysis confirms significant improvements over all baseline methods.

 

Comments 5.  Include more qualitative figures highlighting where ADSAP succeeds.

Response5: We have added Section 4.2.4 with extensive qualitative analysis including Figure 5 showing trajectory visualizations across challenging scenarios: lane changing (23% error reduction), merging situations (18% improvement), and emergency braking (21% better performance). The visualizations clearly demonstrate ADSAP's superior prediction accuracy in complex interactions, with ground truth in red, ADSAP predictions in blue, and baseline predictions in green.

 

Comments 6. Although NGSIM is pre‑processed in detail, INTERACTION and highD appear only in passing. Either conduct full‑scale experiments on all three datasets or justify why NGSIM alone suffices for the claimed “diverse driving scenarios.”

Response6: We have conducted full-scale experiments on all three datasets. Table 4 presents comprehensive results: NGSIM (ADE: 1.65±0.07m, FDE: 3.25±0.14m), INTERACTION (ADE: 1.89±0.09m, FDE: 3.78±0.18m), and highD (ADE: 1.71±0.08m, FDE: 3.42±0.15m). The consistent superior performance across all datasets validates ADSAP's generalization capability and justifies our "diverse driving scenarios" claim through empirical evidence rather than relying solely on NGSIM data.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents a novel framework, ADSAP, for trajectory prediction in autonomous driving, combining adaptive speed-aware pooling and adversarial knowledge distillation. While the work is technically sound and addresses important challenges, several sections require clarification, methodological rigor, and better organization. Below are detailed critiques and suggestions for improvement.

Title and Abstract

The abstract is overly dense with technical terms, which may hinder readability for a broader audience. Simplify the abstract by breaking it into shorter sentences.

Moreover, claims of "robust generalization" are not quantified (e.g., no metrics for cross-dataset validation). Add specific metrics for generalization (e.g., performance on unseen datasets like INTERACTION).

Introduction

The literature review lacks critical analysis of recent works (e.g., how ADSAP compares to Transformer-based methods like [48]). Please add a table comparing ADSAP’s components (ADSP, AKDM) with prior art.

Moreover, the "gap" in knowledge is not explicitly contrasted with prior work (e.g., how ADSP differs from deformable convolutions [23]). Please clearly state the limitations of GANs/VAEs in trajectory prediction (e.g., mode collapse, computational cost).

Materials and Methods

Regarding ADSP Mechanism, the speed map (Equation 2) lacks justification for the Gaussian term. Why not use a learned attention mechanism? Ablate the Gaussian term in Equation 2 to show its necessity.

Regarding AKDM, the adversarial loss (Equation 6) is standard GAN loss. How does it improve over traditional distillation (e.g., KL divergence alone)? Compare AKDM with vanilla distillation (e.g., performance vs. inference time).

Regarding Multi-Scale Feature Aggregation, the fusion weights (Equation 17) are heuristic. Why not use cross-attention?

In addition, GRU-SE and GATv2 are described but not motivated (e.g., why not use a Transformer?). Justify architectural choices (e.g., GRU-SE over LSTM or Transformer).

Experiments

Regarding the datasets, only NGSIM is used for main results. INTERACTION is mentioned but not evaluated.

Regarding the metrics, TSE is novel but not compared to other temporal metrics (e.g., jerk).

Regarding results, ADSAP outperforms baselines, but the margin over [48] (6.3%) is modest. Is this statistically significant? Report p-values for significance testing.

Finally, missing critical experiments (e.g., ADSP vs. deformable convs, AKDM vs. TA [26]). Add ablation on knowledge distillation variants.

Conclusions

Future work is generic (e.g., "multi-modal context"). No mention of limitations (e.g., reliance on NGSIM’s highway data). Please, discuss limitations (e.g., performance in urban scenarios). Also, propose specific extensions (e.g., integrating HD maps).

Author Response

Comments 1:Title and Abstract

The abstract is overly dense with technical terms, which may hinder readability for a broader audience. Simplify the abstract by breaking it into shorter sentences.

Moreover, claims of "robust generalization" are not quantified (e.g., no metrics for cross-dataset validation). Add specific metrics for generalization (e.g., performance on unseen datasets like INTERACTION).

Response1: We have substantially simplified the abstract while maintaining technical precision. The revised abstract uses shorter, more accessible sentences and quantifies generalization claims with specific cross-dataset metrics: INTERACTION dataset ADE of 1.89m and FDE of 3.78m. We've restructured the abstract into three clear sections: problem statement, methodology, and results, improving readability for broader audiences while preserving essential technical information.

 

Comments 2:Introduction

The literature review lacks critical analysis of recent works (e.g., how ADSAP compares to Transformer-based methods like [48]). Please add a table comparing ADSAP’s components (ADSP, AKDM) with prior art.

Moreover, the "gap" in knowledge is not explicitly contrasted with prior work (e.g., how ADSP differs from deformable convolutions [23]). Please clearly state the limitations of GANs/VAEs in trajectory prediction (e.g., mode collapse, computational cost).

Response2: We have enhanced the introduction with Table 1 providing detailed component comparison between ADSAP and existing methods, clearly showing our unique combination of speed-aware pooling, adaptive attention, and adversarial knowledge distillation. We've explicitly contrasted ADSP with deformable convolutions, highlighting that ADSP incorporates velocity-dependent dynamic adjustment rather than purely spatial deformations. Additionally, we've clearly stated GAN/VAE limitations in trajectory prediction: mode collapse leading to limited trajectory diversity and computational costs hindering real-time deployment.

 

Comments 3:Materials and Methods

Regarding ADSP Mechanism, the speed map (Equation 2) lacks justification for the Gaussian term. Why not use a learned attention mechanism? Ablate the Gaussian term in Equation 2 to show its necessity.

Regarding AKDM, the adversarial loss (Equation 6) is standard GAN loss. How does it improve over traditional distillation (e.g., KL divergence alone)? Compare AKDM with vanilla distillation (e.g., performance vs. inference time).

Regarding Multi-Scale Feature Aggregation, the fusion weights (Equation 17) are heuristic. Why not use cross-attention?

In addition, GRU-SE and GATv2 are described but not motivated (e.g., why not use a Transformer?). Justify architectural choices (e.g., GRU-SE over LSTM or Transformer).

Response3: We have provided theoretical justification for the Gaussian term in Equation 2, showing it provides spatial locality bias essential for realistic interaction modeling. Ablation experiments demonstrate removing this term causes 4.2% ADE degradation. We've added comprehensive AKDM vs. traditional distillation comparison (Table 5) showing 8.5% ADE improvement over Hinton's method. Regarding architectural choices, we justify GRU-SE over Transformer due to computational efficiency requirements in autonomous driving, achieving 60% parameter reduction while maintaining comparable performance. Cross-attention for fusion weights was tested but showed only marginal improvement (1.2%) at 40% higher computational cost.

 

Comments 4:Experiments

Regarding the datasets, only NGSIM is used for main results. INTERACTION is mentioned but not evaluated.

Regarding the metrics, TSE is novel but not compared to other temporal metrics (e.g., jerk).

Regarding results, ADSAP outperforms baselines, but the margin over [48] (6.3%) is modest. Is this statistically significant? Report p-values for significance testing.

Finally, missing critical experiments (e.g., ADSP vs. deformable convs, AKDM vs. TA [26]). Add ablation on knowledge distillation variants.

Response4: We have expanded experiments to include INTERACTION dataset evaluation with ADE: 1.89m and FDE: 3.78m. We've introduced jerk metric for temporal consistency (ADSAP: 2.45 m/s³ vs. best baseline: 3.18 m/s³). Statistical significance is confirmed with p-values (<0.001) for all comparisons. We've added critical ablation experiments comparing ADSP vs. deformable convolutions (12% better performance) and AKDM vs. traditional distillation variants (8.5% improvement), demonstrating the necessity of our novel components.

 

Comments 5:Conclusions

Future work is generic (e.g., "multi-modal context"). No mention of limitations (e.g., reliance on NGSIM’s highway data). Please, discuss limitations (e.g., performance in urban scenarios). Also, propose specific extensions (e.g., integrating HD maps).

Response5: We have extensively discussed limitations in the conclusion: ADSAP's current focus on highway scenarios may limit urban performance, and dependency on NGSIM's driving patterns requires geographic adaptation. We propose specific future extensions: HD map integration for semantic understanding, multi-modal context incorporation (weather, traffic lights), and cross-domain adaptation techniques for geographic generalization. These concrete directions address the generic future work criticism.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes ADSAP, a trajectory prediction model that combines an adaptive speed-aware pooling mechanism with adversarial knowledge distillation (AKDM). The authors conduct extensive experiments using real-world datasets such as NGSIM, demonstrating improvements in prediction accuracy and computational efficiency. However, the theoretical novelty of the proposed method remains unclear throughout the manuscript, and the overall impression is that the approach is primarily a combination of existing techniques. Furthermore, given the model’s complexity, its interpretability and practical implications for real-world deployment are insufficiently discussed. The following comments highlight the main concerns:

  1. The proposed techniques, such as ADSP and AKDM, largely appear to be a straightforward integration of existing methods including Deformable Convolution, Knowledge Distillation, GANs, Squeeze-and-Excitation blocks, and Shift-Window Attention. The authors should provide a clear theoretical justification for why this specific combination is fundamentally advantageous, and explain why such performance could not be achieved with conventional methods.

  2. The superiority of AKDM over existing distillation methods (e.g., Hinton’s original Knowledge Distillation) is not demonstrated. While the ablation study includes a comparison with and without AKDM, a fair benchmark against standard distillation techniques is necessary to validate its claimed advantages.

  3. The motivation and concrete effects of introducing ADSP remain insufficiently explained. While the idea of pooling deformations based on velocity is theoretically interesting, the paper should describe more specifically how speed contributes to the pooling mechanism in practice. Visual examples such as attention maps would help in clarifying this point.

  4. Although the paper mentions a 3.2× improvement in inference speed, it lacks details on model size (e.g., number of parameters and memory usage), training time, and hardware requirements. For practical deployment scenarios, it is important to include discussions on power consumption and the feasibility of on-board implementation.

Author Response

Comments 1.The proposed techniques, such as ADSP and AKDM, largely appear to be a straightforward integration of existing methods including Deformable Convolution, Knowledge Distillation, GANs, Squeeze-and-Excitation blocks, and Shift-Window Attention. The authors should provide a clear theoretical justification for why this specific combination is fundamentally advantageous, and explain why such performance could not be achieved with conventional methods.

Response1: We have added comprehensive theoretical analysis in Section 2.4 explaining why ADSAP's component combination is fundamentally advantageous. Mathematical derivations show speed-aware pooling provides tighter error bounds: E[|Ŷ - Y|²] ≤ C₁·σᵥ² + C₂·εₛₚₐₜᵢₐₗ, which is superior to traditional pooling methods. The adversarial training minimizes Wasserstein distance between teacher-student distributions, providing theoretical guarantees for knowledge transfer quality. Ablation studies support these theoretical advantages with empirical evidence.

 

Comments 2.The superiority of AKDM over existing distillation methods (e.g., Hinton’s original Knowledge Distillation) is not demonstrated. While the ablation study includes a comparison with and without AKDM, a fair benchmark against standard distillation techniques is necessary to validate its claimed advantages.

Response2: We have conducted direct comparison between AKDM and Hinton's original knowledge distillation in Table 5, demonstrating 8.5% ADE improvement and 7.2% FDE enhancement. The adversarial mechanism enhances robustness by learning distribution-level features rather than point estimates, leading to better generalization under domain shift. Our analysis shows AKDM maintains performance degradation within 2% under distribution shift compared to 8% for traditional distillation.

 

Comments 3.The motivation and concrete effects of introducing ADSP remain insufficiently explained. While the idea of pooling deformations based on velocity is theoretically interesting, the paper should describe more specifically how speed contributes to the pooling mechanism in practice. Visual examples such as attention maps would help in clarifying this point.

Response3: We have extensively clarified ADSP's speed contribution mechanism in Section 2.1 with detailed mathematical formulations showing how velocity influences sampling weights and offset computations. Figure 3 provides visual evidence through attention heatmaps at different speeds: low speed (5 m/s), medium speed (15 m/s), and high speed (25 m/s), clearly demonstrating adaptive focus based on velocity patterns. The speed-dependent weighting function enables dynamic attention allocation based on interaction urgency.

 

Comments 4.Although the paper mentions a 3.2× improvement in inference speed, it lacks details on model size (e.g., number of parameters and memory usage), training time, and hardware requirements. For practical deployment scenarios, it is important to include discussions on power consumption and the feasibility of on-board implementation.

Response4: We have added comprehensive computational analysis in Section 4.1.3 including detailed specifications: teacher model (8.2M parameters, 245MB memory), student model (2.3M parameters, 89MB memory), representing 72% parameter and 64% memory reduction. Training time analysis shows 24 hours for teacher model and 8 hours for student model on RTX 3080. Power consumption analysis indicates student model requires 35W vs. 95W for teacher model, making on-board deployment feasible with current automotive computing platforms.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

A promising approach with substantial empirical support and an understanding of the field's existing limits is presented in this study. However, to meet the standards set by prestigious journals or conferences, clarity and conciseness—particularly in the abstract and early sections—need to be improved.

1- The abstract's lengthy, technical language make it harder to understand since it tries to accomplish too much at once.

2- To support robustness claims, include or validate on other datasets (such as Argoverse, INTERACTION, and Lyft Level 5).

3- To gauge the impact of changing or eliminating important elements from the ADSAP model, incorporate ablation experiments.

4- Evaluate runtime performance against baseline models, or at the very least, talk about the viability of the calculation.

5- Performance metrics should be accompanied by a thorough table of p-values or confidence ranges.

6- Make sure measurements match the classification of ADSAP as a unimodal or multimodal predictor.

7- Use trajectory visualizations to illustrate the advantages and disadvantages of ADSAP in a qualitative manner.

8- Discuss the potential for integrating scene context or agent interaction modeling in future work.

Author Response

Comments 1- The abstract's lengthy, technical language make it harder to understand since it tries to accomplish too much at once.

Response1: We have completely restructured the abstract to improve clarity and conciseness. The revised version uses shorter sentences, eliminates excessive technical jargon, and organizes information in three clear sections: problem definition, methodology overview, and key results. This improves accessibility for broader audiences while maintaining essential technical content.

 

Comments 2- To support robustness claims, include or validate on other datasets (such as Argoverse, INTERACTION, and Lyft Level 5).

Response2: We have validated ADSAP on multiple datasets beyond NGSIM. Comprehensive evaluation includes INTERACTION (ADE: 1.89±0.09m, FDE: 3.78±0.18m), highD (ADE: 1.71±0.08m, FDE: 3.42±0.15m), and Argoverse preliminary results (ADE: 2.12±0.11m). Cross-dataset validation demonstrates consistent superior performance, supporting our robustness claims with quantitative evidence across diverse driving scenarios and geographic regions.

 

Comments 3- To gauge the impact of changing or eliminating important elements from the ADSAP model, incorporate ablation experiments.

Response3: We have incorporated comprehensive ablation experiments examining each ADSAP component's impact. Table 6 shows removing ADSP increases ADE by 8.5%, removing AKDM increases ADE by 6.7%, and removing multi-scale aggregation increases ADE by 5.4%. These systematic experiments quantify each component's contribution and validate the necessity of the complete framework architecture.

 

Comments 4- Evaluate runtime performance against baseline models, or at the very least, talk about the viability of the calculation.

Response4: We have added detailed runtime performance analysis comparing ADSAP against baseline models. Our student model achieves 15ms inference time compared to iNATran (32ms), MATF-GAN (45ms), and Social-GAN (38ms), representing 47-67% speedup. Computational viability analysis confirms real-time deployment feasibility with standard automotive hardware configurations.

 

Comments 5- Performance metrics should be accompanied by a thorough table of p-values or confidence ranges.

Response5: All performance metrics now include comprehensive statistical analysis with p-values from paired t-tests with Bonferroni correction and 95% confidence intervals computed via bootstrap resampling. For example, ADSAP vs. best baseline shows p<0.001 with CI [1.58, 1.72] for ADE, confirming statistical significance of all reported improvements.

 

Comments 6- Make sure measurements match the classification of ADSAP as a unimodal or multimodal predictor.

Response6: We have clarified ADSAP as a multimodal predictor generating K=20 diverse trajectory hypotheses. Evaluation includes both unimodal metrics (ADE, FDE) and multimodal metrics (minADE₂₀, minFDE₂₀) showing superior performance in both categories. The multimodal capability provides comprehensive uncertainty quantification essential for autonomous driving safety.

 

Comments 7- Use trajectory visualizations to illustrate the advantages and disadvantages of ADSAP in a qualitative manner.

Response7: We have added extensive trajectory visualizations in Section 4.2.4 illustrating ADSAP's advantages and limitations. Figure 5 shows successful predictions in lane changes, merging, and emergency braking scenarios with quantified improvements (18-23% error reduction). We also discuss failure cases in complex intersections where multi-agent interactions exceed current model capacity.

 

Comments 8- Discuss the potential for integrating scene context or agent interaction modeling in future work.

Response8: We have expanded the conclusion with specific future work proposals for scene context integration and agent interaction modeling. Planned extensions include HD map semantic understanding, traffic light state incorporation, multi-agent coordination mechanisms, and hierarchical interaction modeling for dense urban scenarios. These concrete directions address autonomous driving's evolving requirements.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have considered the previous reviews, for my part I suggest acceptance of the article.

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for your revised manuscript. The points I had suggested were clarified and improved.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have addressed all the required comments.

Back to TopTop