ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer

Da, Cheng; Qian, Yongsheng; Zeng, Junwei; Wei, Xuting; Zhang, Futao

doi:10.3390/electronics14122448

Open AccessArticle

ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer

by

Cheng Da

,

Yongsheng Qian

^*

,

Junwei Zeng

,

Xuting Wei

and

Futao Zhang

School of Traffic and Transportation, Lanzhou Jiaotong University, Lanzhou 730030, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(12), 2448; https://doi.org/10.3390/electronics14122448

Submission received: 17 May 2025 / Revised: 11 June 2025 / Accepted: 14 June 2025 / Published: 16 June 2025

(This article belongs to the Special Issue Advances in AI Engineering: Exploring Machine Learning Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate trajectory prediction of surrounding vehicles is a fundamental challenge in autonomous driving, requiring sophisticated modeling of complex vehicle interactions, traffic dynamics, and contextual dependencies. This paper introduces Adaptive Speed-Aware Prediction (ADSAP), a novel trajectory prediction framework that advances the state of the art through innovative mechanisms for adaptive attention modulation and knowledge transfer. At its core, ADSAP employs an adaptive deformable speed-aware pooling mechanism that dynamically adjusts the model’s attention distribution and receptive field based on instantaneous vehicle states and interaction patterns. This adaptive architecture enables fine-grained modeling of diverse traffic scenarios, from sparse highway conditions to dense urban environments. The framework incorporates a sophisticated speed-aware multi-scale feature aggregation module that systematically combines spatial and temporal information across multiple scales, facilitating comprehensive scene understanding and robust trajectory prediction. To bridge the gap between model complexity and computational efficiency, we propose an adversarial knowledge distillation approach that effectively transfers learned representations and decision-making strategies from a high-capacity teacher model to a lightweight student model. This novel distillation mechanism preserves prediction accuracy while significantly reducing computational overhead, making the framework suitable for real-world deployment. Extensive empirical evaluation on the large-scale NGSIM and highD naturalistic driving datasets demonstrates ADSAP’s superior performance. The ADSAP framework achieves an 18.7% reduction in average displacement error and a 22.4% improvement in final displacement error compared to state-of-the-art methods while maintaining consistent performance across varying traffic densities (0.05–0.85 vehicles/meter) and speed ranges (0–35 m/s). Moreover, ADSAP exhibits robust generalization capabilities across different driving scenarios and weather conditions, with the lightweight student model achieving 95% of the teacher model’s accuracy while offering a 3.2× reduction in inference time. Comprehensive experimental results supported by detailed ablation studies and statistical analyses validate ADSAP’s effectiveness in addressing the trajectory prediction challenge. Our framework provides a novel perspective on integrating adaptive attention mechanisms with efficient knowledge transfer, contributing to the development of more reliable and intelligent autonomous driving systems. Significant improvements in prediction accuracy, computational efficiency, and generalization capability demonstrate ADSAP’s potential ability to advance autonomous driving technology.

Keywords:

adaptive speed-aware; knowledge distillation approach; adversarial knowledge transfer; trajectory prediction methods

1. Introduction

Autonomous driving represents a paradigm-shifting technology that holds immense promise to fundamentally transform modern transportation systems, substantially enhance road safety, and dramatically improve traffic efficiency through intelligent vehicle coordination and operation [1,2]. Among the multitude of technical challenges that must be addressed in order to achieve fully autonomous driving capabilities, accurate and robust prediction of future trajectories of surrounding vehicles in complex, dynamic, and highly interactive traffic scenarios stands out as a critical research problem [3,4,5]. This prediction task is particularly challenging due to the inherent uncertainties in human driving behavior, the complex interdependencies between multiple traffic participants, and the diverse range of environmental conditions that autonomous vehicles must navigate.

The significance of reliable trajectory prediction cannot be overstated, as it serves as a fundamental building block for autonomous decision-making systems. By accurately forecasting the future states and intentions of surrounding vehicles, autonomous vehicles can generate optimal trajectories, execute safe maneuvers, and maintain appropriate safety margins while maximizing operational efficiency [6,7,8]. This capability is especially crucial in scenarios involving multiple interacting agents, where the prediction model must simultaneously consider the coupled dynamics of numerous vehicles, each potentially influencing the behavior of others through complex social interactions. Moreover, accurate trajectory prediction enables autonomous vehicles to anticipate and proactively respond to potentially dangerous situations, thereby reducing the likelihood of accidents and enhancing overall traffic safety.

The development of trajectory prediction methodologies has witnessed significant advancement with the emergence of sophisticated deep learning architectures. These have demonstrated remarkable capability in capturing and learning the intricate spatiotemporal dependencies and complex interaction patterns inherent in vehicular motion from historical trajectory data [9,10,11]. Such approaches have revolutionized the field by moving beyond traditional kinematic models to data-driven frameworks that can automatically extract relevant features and learn meaningful representations of vehicle behavior patterns.

Among the various deep learning architectures, Recurrent Neural Networks (RNNs) and their advanced variants, particularly Long Short-Term Memory (LSTM) networks [12], have emerged as fundamental building blocks for sequence modeling in trajectory prediction tasks [13,14]. These architectures excel in capturing temporal dependencies and maintaining long-term memory of motion patterns, enabling them to effectively model the sequential nature of vehicle trajectories. Concurrently, graph-based approaches have garnered significant attention in the research community, offering a natural and powerful framework for modeling the complex interactions among multiple traffic participants [15,16]. By representing vehicles as nodes and their interactions as edges in a graph structure, these methods can explicitly capture the relational dynamics and social influences that govern vehicle behavior in traffic scenarios.

Table 1 highlights the comprehensive integration of advanced components in ADSAP. Unlike existing methods that typically focus on individual aspects, ADSAP uniquely combines speed-aware mechanisms, adaptive pooling, adversarial knowledge distillation, multi-scale processing, and a transformer architecture.

While previous methodologies have achieved noteworthy success in capturing general motion patterns and predicting future vehicle positions, they face substantial challenges in real-world applications [18,20,21]. A primary limitation lies in their ability to adapt to the highly dynamic and continuously evolving nature of traffic scenarios, where interaction patterns and behavioral dynamics can change rapidly and unpredictably. Furthermore, previous approaches often struggle to fully leverage the rich hierarchical contextual information available at different spatial and temporal scales, such as lane-level features, road topology, traffic rules, and broader environmental context. This incomplete utilization of multi-scale contextual information can lead to predictions that, while mathematically accurate, may not fully align with the physical and social constraints of real-world traffic scenarios.

The inherent limitations of current trajectory prediction models extend beyond their immediate performance metrics to fundamental challenges in knowledge transfer and generalization capabilities [22,23,24]. The domain-specific nature of learned representations often results in models that excel in scenarios that are similar to the training distribution, but exhibit significant performance degradation when encountered with novel traffic conditions, varying road geometries, or unfamiliar driving behaviors. This limitation is particularly problematic in real-world applications, where autonomous vehicles must navigate through diverse and continuously evolving traffic environments that may differ substantially from the training scenarios.

To address these limitations, researchers have increasingly turned to sophisticated generative modeling approaches, notably Generative Adversarial Networks (GANs) [25] and Variational Autoencoders (VAEs) [26], which offer promising frameworks for capturing the inherent multimodality and uncertainty in future trajectory predictions [14,17]. These generative approaches represent a paradigm shift from deterministic prediction to probabilistic modeling, enabling the generation of multiple plausible future trajectories that reflect the stochastic nature of human driving behavior. By learning the underlying distribution of possible future trajectories rather than attempting to predict a single optimal path, these models can better account for the inherent uncertainty in traffic scenarios and provide more comprehensive information for downstream decision-making processes.

However, the integration of advanced generative models introduces new challenges in the trajectory prediction pipeline [8,11]. Traditional GANs face mode collapse during trajectory prediction. This issue causes generators to produce limited trajectory variations, significantly reducing prediction diversity and failing to capture the full spectrum of possible future behaviors. Additionally, high computational costs hinder real-time deployment in autonomous vehicles, where rapid prediction updates are crucial for safe operation. Increased model complexity often results in substantially higher computational requirements, which can be problematic for real-time applications in which rapid prediction updates are crucial for safe vehicle operation. The tradeoff between model expressiveness and computational efficiency becomes particularly acute when considering the strict latency requirements and limited computational resources available in autonomous vehicles. Furthermore, the evaluation and validation of generative models present unique challenges, as traditional metrics may not fully capture the quality and diversity of the generated trajectories, necessitating the development of more sophisticated evaluation frameworks.

To overcome the aforementioned challenges and limitations, we introduce Adaptive Dynamic Speed-Aware Prediction (ADSAP), a novel trajectory prediction framework that seamlessly integrates adversarial knowledge transfer with dynamic spatial adaptation mechanisms. The fundamental innovation of ADSAP lies in its sophisticated architecture, which adaptively modulates the model’s attention mechanisms and receptive field characteristics based on two critical factors: the instantaneous interaction states between vehicles, and their corresponding speed variations. This adaptive capability enables the framework to dynamically adjust its prediction strategy according to the evolving complexity and dynamics of traffic scenarios.

At the heart of ADSAP is the advanced Adaptive Deformable Speed-aware Pooling (ADSP) mechanism, which represents a significant advancement over traditional pooling operations [19,27]. Unlike deformable convolutions that focus purely on spatial deformations, ADSP incorporates velocity-dependent dynamic adjustment mechanisms, enabling adaptive receptive field modulation based on traffic dynamics rather than static spatial patterns. The ADSP mechanism dynamically reconfigures its pooling grid structure by incorporating both spatial and kinematic information, specifically, the relative positions and velocities of surrounding vehicles. This dynamic adaptation allows the model to effectively capture and process information at varying spatial scales and temporal resolutions, depending on the instantaneous traffic conditions. The deformable nature of the pooling operation enables the model to focus on regions of high interaction potential while maintaining computational efficiency by adaptively allocating computational resources.

Furthermore, ADSAP incorporates a sophisticated speed-aware multi-scale feature aggregation scheme that systematically extracts and synthesizes contextual information across different spatial and temporal scales [28]. This hierarchical feature processing approach enables the model to simultaneously capture both fine-grained local interactions between adjacent vehicles and broader spatial patterns in the traffic flow. The multi-scale feature aggregation mechanism operates in conjunction with ADSP to create a comprehensive representation of the traffic scene that encompasses both microscopic vehicle interactions and macroscopic traffic patterns. This integrated approach ensures that the model maintains awareness of both immediate vehicle interactions and larger-scale traffic dynamics, leading to more robust and contextually aware trajectory predictions.

Another significant innovation of ADSAP lies in its novel approach to knowledge transfer through the integration of adversarial learning principles. The framework incorporates an advanced Adversarial Knowledge Distillation Module (AKDM), which represents a sophisticated mechanism for transferring learned representations and decision-making strategies from a high-capacity teacher model to a more compact and efficient student model [29,30]. This knowledge distillation process is fundamentally different from traditional approaches in that it operates at multiple levels of abstraction, transferring not only the final predictions but also the intermediate feature representations along with the reasoning patterns that led to those predictions.

The AKDM employs an adversarial learning framework inspired by the principles of Generative Adversarial Networks (GANs) [25], in which the student model is trained to generate predictions that are indistinguishable from those of the teacher model. This adversarial objective creates a powerful learning signal that goes beyond simple mimicry of the teacher’s outputs, encouraging the student model to develop robust internal representations that capture the essential characteristics of the teacher’s decision-making process. The adversarial component introduces a form of regularization that helps prevent the student model from overfitting to specific aspects of the teacher’s behavior while maintaining the ability to generalize effectively to new scenarios.

Furthermore, the knowledge distillation process is carefully designed to maintain computational efficiency without compromising prediction accuracy [31]. The student model’s architecture is optimized through a systematic process of structural pruning and parameter optimization, resulting in a significantly reduced computational footprint compared to the teacher model. This efficiency gain is particularly crucial for real-time applications in autonomous driving systems, where rapid processing of sensor data and quick decision-making are essential. The compact nature of the student model combined with its ability to maintain high prediction accuracy through the adversarial knowledge transfer process makes ADSAP particularly well suited for deployment in resource-constrained environments while maintaining robust performance across diverse traffic scenarios.

To evaluate the performance of ADSAP against state-of-the-art trajectory prediction methods, we conducted extensive experiments on real-world traffic datasets, including the widely-used NGSIM dataset [32], the INTERACTION dataset [6], and the highD dataset [33]. Experimental results demonstrate that ADSAP significantly outperforms existing approaches in terms of prediction accuracy, generalization ability, and computational efficiency [16,17]. Our framework achieves state-of-the-art performance while requiring fewer parameters and computational resources, making it a promising solution for practical autonomous driving applications.

The main contributions of this work can be summarized as follows:

We propose ADSAP, an adaptive speed-aware trajectory prediction framework that introduces novel techniques for capturing fine-grained vehicle interactions and modeling dynamic traffic scenarios through velocity-dependent attention mechanisms.
We develop an Adaptive Deformable Speed-aware Pooling (ADSP) mechanism that dynamically adjusts the model’s attention and receptive field based on the vehicle’s interaction state and speed variation, enabling context-aware trajectory prediction superior to traditional deformable convolutions.
We introduce an Adversarial Knowledge Distillation Module (AKDM) that facilitates the transfer of feature hierarchies and decision-making patterns from a teacher model to a student model, improving prediction accuracy and model efficiency compared to traditional knowledge distillation methods.
We conduct comprehensive experiments on multiple real-world traffic datasets along with statistical significance testing, demonstrating the superior performance of ADSAP compared to state-of-the-art trajectory prediction methods across diverse scenarios and environmental conditions.

The remainder of this paper is structured as follows: Section 1 reviews the related work on trajectory prediction and knowledge distillation; Section 2 presents the proposed ADSAP framework, detailing the adaptive deformable speed-aware pooling mechanism and the adversarial knowledge distillation module; Section 3 describes the experimental setup, datasets, and evaluation metrics, and presents the results and analysis; finally, Section 4 concludes the paper and discusses future research directions.

2. Materials and Methods

In this section, we provide a comprehensive description of the proposed ADSAP framework, which introduces a novel approach to trajectory prediction in autonomous driving systems. The framework’s architecture is built upon three fundamental and interconnected components, each addressing specific challenges in trajectory prediction: (1) an Adaptive Deformable Speed-aware Pooling (ADSP) mechanism that dynamically adjusts to varying traffic conditions and vehicle dynamics, (2) an Adversarial Knowledge Distillation Module (AKDM) that enables efficient knowledge transfer and model compression, and (3) a sophisticated multi-scale feature aggregation scheme that captures contextual information across different spatial and temporal scales. The synergistic integration of these components is depicted in Figure 3, which provides a detailed illustration of ADSAP’s overall architecture and the interconnections between its key components.

2.1. Adaptive Deformable Speed-Aware Pooling (ADSP)

The Adaptive Deformable Speed-aware Pooling (ADSP) mechanism represents a fundamental advancement in capturing and processing dynamic vehicle interactions in complex traffic scenarios. Unlike traditional pooling operations that employ rigid predefined grid structures, ADSP introduces a flexible content-adaptive pooling strategy that dynamically modulates its receptive field based on both spatial relationships and kinematic characteristics of vehicles.

Given an input feature map

F \in R^{C \times H \times W}

, where C, H, and W represent the channel dimension, height, and width, respectively, ADSP generates an adaptively pooled feature representation

F_{a d s p} \in R^{C \times H^{'} \times W^{'}}

. The adaptation process is governed by learnable offset vectors

Δ p_{i j}

, where

Δ p_{i j} \in R^{2}

defines the spatial offset for each grid point

(i, j)

. These offset vectors are computed through a specialized convolutional layer that processes the concatenated information from both the feature map

F

and a speed map

S \in R^{H \times W}

:

Δ p_{i j} = Conv ([F; S]) + λ \cdot SpeedBias (S)

(1)

where

[\cdot; \cdot]

denotes channel-wise concatenation and

SpeedBias (S)

introduces an additional speed-dependent bias term weighted by

λ

. The speed map

S

encodes both magnitude and directional information of vehicle velocities:

S (i, j) = \sqrt{v_{x}^{2} + v_{y}^{2}} \cdot exp (- \frac{| p_{i j} - p_{r e f} |^{2}}{2 σ^{2}})

(2)

where

(v_{x}, v_{y})

represents the velocity components,

p_{i j}

and

p_{r e f}

are the current and reference positions, respectively, and

σ

controls the spatial influence range. The Gaussian term provides spatial locality bias, ensuring that nearby vehicles have stronger influence while maintaining smooth attention transitions. This design is theoretically motivated by human attention mechanisms in driving, where spatial proximity strongly influences attention allocation. Ablation experiments demonstrate that removing this term reduces performance by 4.2% ADE, confirming its necessity for realistic interaction modeling.

The deformation process transforms the regular grid

G = {(i, j)}

into an adapted sampling grid

G_{a d s p} = {(i + Δ p_{i j}^{y}, j + Δ p_{i j}^{x})}

. The adaptive pooling operation is then formulated as

F_{a d s p} (c, i, j) = \sum_{(i^{'}, j^{'}) \in R (i, j)} F (c, i^{'}, j^{'}) \cdot K (i, j, i^{'}, j^{'}) \cdot W (v_{i^{'} j^{'}}),

(3)

where

R (i, j)

defines the sampling region around the deformed grid point,

K

is a bilinear interpolation kernel, and

W (v_{i^{'} j^{'}})

is a velocity-dependent weighting function:

W (v_{i^{'} j^{'}}) = \frac{exp (α \cdot | v_{i^{'} j^{'}} |)}{\sum_{(k, l) \in R (i, j)} exp (α \cdot | v_{k l} |)},

(4)

with

α

controlling the sensitivity to speed variations. This velocity-dependent weighting enables dynamic attention allocation based on interaction urgency, where higher velocities receive increased attention weights to reflect their greater impact on trajectory evolution.

Figure 1 provides visual evidence of ADSP’s adaptive behavior, clearly demonstrating how attention patterns shift based on velocity conditions to enable more effective capture of speed-dependent interaction dynamics.

To ensure smooth and continuous adaptation, we introduce a regularization term in the learning objective:

L_{r e g} = β \sum_{i, j} | Δ {p_{i j} |}^{2} + γ \sum_{i, j} {| \nabla Δ p_{i j} |}^{2}

(5)

where

β

and

γ

are regularization coefficients and the second term enforces spatial smoothness in the deformation field.

This sophisticated pooling mechanism enables ADSP to dynamically adjust its receptive field based on both spatial and kinematic features, resulting in more effective capture of vehicle interactions across varying speeds and distances. The adaptive nature of ADSP makes it particularly effective in scenarios with heterogeneous traffic patterns and varying vehicle densities.

2.2. Adversarial Knowledge Distillation Module (AKDM)

The Adversarial Knowledge Distillation Module (AKDM) is introduced to transfer knowledge from a complex teacher model to a simplified student model, thereby improving prediction accuracy and model efficiency. The AKDM employs adversarial learning to encourage the student model to mimic the teacher model’s behavior while achieving superior performance compared to traditional knowledge distillation methods.

Figure 2 illustrates the enhanced workflow of the AKDM with improved clarity and detailed annotations. The teacher model and student model first extract feature maps

F_{t}

and

F_{s}

, respectively. A discriminator D distinguishes between the features from the teacher and student models. The discriminator’s objective is to minimize the adversarial loss

L_{a d v}

, defined as

L_{a d v} = E_{F_{t}} [log D (F_{t})] + E_{F_{s}} [log (1 - D (F_{s}))],

(6)

where

D (\cdot)

represents the discriminator’s output probability. The teacher model aims to maximize this loss in order to generate features that can fool the discriminator, while the student model tries to minimize it, generating features that are indistinguishable from those of the teacher model.

In addition to the adversarial loss, the AKDM also introduces a distillation loss

L_{d i s}

to measure the discrepancy between the output probability distributions of the teacher and student models. The distillation loss is defined as the Kullback–Leibler (KL) divergence between the output probabilities

P_{t}

and

P_{s}

:

L_{d i s} = KL (P_{t} | | P_{s}) .

(7)

Through the joint optimization of the adversarial loss

L_{a d v}

and distillation loss

L_{d i s}

, the AKDM facilitates comprehensive knowledge transfer across multiple levels of abstraction. The optimization process can be formulated as a min–max game:

min_{θ_{s}} max_{θ_{d}} L_{t o t a l} = α L_{a d v} + β L_{d i s} + λ L_{r e g}

(8)

where

θ_{s}

and

θ_{d}

respectively represent the parameters of the student model and discriminator. The regularization term

L_{r e g}

is introduced to prevent overfitting:

L_{r e g} = γ_{1} | θ_{s} |_{2}^{2} + γ_{2} \sum_{l} {| F_{s}^{l} - F_{t}^{l} |}_{F}^{2} + γ_{3} TV (F_{s})

(9)

where TV denotes total variation regularization. This optimization framework enables the student model to learn both feature representations and decision boundaries from the teacher model while maintaining computational efficiency through architectural simplification.

Table 2 demonstrates the AKDM’s significant advantages over Hinton’s original knowledge distillation method, achieving 8.5% ADE improvement and 7.2% FDE enhancement while maintaining computational efficiency. The adversarial mechanism enhances robustness by learning distribution-level features rather than point estimates, leading to better generalization under domain shift conditions.

Integration of the AKDM within ADSAP yields several significant advantages. First, it enables efficient knowledge compression, reducing the model complexity from

O (N^{2})

to

O (N log N)

while preserving prediction accuracy. Second, the adversarial training mechanism enhances the robustness of the learned representations, as demonstrated by the improved performance under distribution shift:

E_{x \sim P_{t e s t}} [| f_{s} (x) - f_{t} (x) |_{2}] \leq ϵ + δ W_{2} (P_{t r a i n}, P_{t e s t})

(10)

where

W_{2}

denotes the Wasserstein-2 distance between the training and test distributions and

ϵ, δ

are both small constants. The resulting student model achieves a favorable tradeoff between computational efficiency (average inference time of 15 ms on an NVIDIA RTX 3080 GPU) and prediction accuracy (within 2% of teacher model performance), making it well suited for real-time autonomous driving applications.

2.3. Multi-Scale Feature Aggregation

To effectively capture the hierarchical nature of traffic scenes, ADSAP implements a sophisticated multi-scale feature aggregation architecture. This framework systematically extracts and combines features across multiple spatial resolutions, enabling comprehensive scene understanding from both microscopic vehicle interactions and macroscopic traffic patterns.

Given an input feature map

F \in R^{C \times H \times W}

, we construct a feature pyramid

{F^{1}, F^{2}, \dots, F^{L}}

through iterative downsampling and feature extraction. Each level l in the pyramid is generated through a combination of convolution and max-pooling operations:

F^{l} = BN (ReLU (Conv (MaxPool (F^{l - 1}))))

(11)

where BN denotes batch normalization. To enhance feature discrimination at each scale, we incorporate a channel attention mechanism

F_{a t t}^{l} = F^{l} \otimes σ (MLP (GlobalPool (F^{l}))),

(12)

where ⊗ represents channel-wise multiplication and

σ

is the sigmoid activation function.

The multi-scale feature aggregation process follows a bidirectional pathway combining bottom-up and top-down information flow. The top-down pathway progressively upsamples lower-resolution features and merges them with higher-resolution features through adaptive fusion:

F_{a g g}^{l} = α_{l} \cdot Conv (F_{a t t}^{l}) + β_{l} \cdot Upsample (F_{a g g}^{l + 1}) + γ_{l} \cdot Lateral (F^{l})

(13)

where

α_{l}

,

β_{l}

, and

γ_{l}

are learnable scale-specific weights and the lateral connection is defined as follows:

Lateral (F^{l}) = {Conv}_{1 \times 1} (F^{l}) + SE (F^{l})

(14)

with SE denoting a squeeze-and-excitation block that adaptively recalibrates channel-wise feature responses.

To enhance cross-scale feature interaction, we introduce a scale-aware attention mechanism

A^{l, k} = softmax (\frac{(W_{Q} F^{l}) {(W_{K} F^{k})}^{T}}{\sqrt{d_{k}}}) W_{V} F^{k},

(15)

where

W_{Q}

,

W_{K}

, and

W_{V}

are learnable projection matrices.

The final aggregated representation is obtained through adaptive feature fusion:

F_{f i n a l} = \sum_{l = 1}^{L} w_{l} \cdot Transform (F_{a g g}^{l})

(16)

where the fusion weights

w_{l}

are dynamically computed based on the current context

w_{l} = \frac{exp (ϕ (F_{a g g}^{l}))}{\sum_{k = 1}^{L} exp (ϕ (F_{a g g}^{k}))},

(17)

with

ϕ (\cdot)

being a lightweight context encoding network. We tested cross-attention mechanisms for fusion weight computation but found only marginal improvement (1.2%) at 40% higher computational cost, making the current approach more suitable for real-time applications.

Experimental results demonstrate that this multi-scale feature aggregation scheme significantly improves prediction accuracy across diverse traffic scenarios. The architecture achieves a 15.3% reduction in prediction error compared to single-scale baselines, with particularly notable improvements in complex urban environments where multi-scale context is crucial for accurate trajectory forecasting.

2.4. Theoretical Analysis

We provide comprehensive theoretical justification for ADSAP’s component combination and demonstrate why this specific integration is fundamentally advantageous over conventional methods. The theoretical analysis establishes error bounds and convergence guarantees that support our empirical findings.

The speed-aware pooling mechanism provides superior dynamic interaction modeling through tighter error bounds compared to traditional pooling methods:

E [| \hat{Y} - Y |^{2}] \leq C_{1} \cdot σ_{v}^{2} + C_{2} \cdot ϵ_{s p a t i a l} + C_{3} \cdot δ_{t e m p o r a l}

(18)

where

σ_{v}^{2}

represents the velocity variance,

ϵ_{s p a t i a l}

denotes the spatial discretization error, and

δ_{t e m p o r a l}

captures the temporal alignment error. This bound is tighter than traditional pooling methods due to velocity-dependent adaptation, which reduces both spatial and temporal prediction uncertainties.

The adversarial training process in the AKDM enhances robustness by minimizing the Wasserstein distance between the teacher and student distributions:

W_{2} (P_{t e a c h e r}, P_{s t u d e n t}) \leq ϵ + δ \cdot L_{a d v} .

(19)

This theoretical framework guarantees the quality of knowledge transfer while maintaining computational efficiency. The adversarial component provides distribution-level alignment rather than point-wise matching, leading to better generalization:

sup_{x \in X} | f_{s} (x) - f_{t} (x) | \leq \sqrt{2 W_{2} (P_{t e a c h e r}, P_{s t u d e n t})} .

(20)

The multi-scale feature aggregation provides hierarchical error decomposition:

E_{t o t a l} = \sum_{l = 1}^{L} w_{l} \cdot E_{l} + E_{f u s i o n}

(21)

where

E_{l}

represents scale-specific errors and

E_{f u s i o n}

captures fusion-related uncertainties. The adaptive weight mechanism minimizes the total error by optimally combining scale-specific contributions.

2.5. Preliminaries

In this section, we present a comprehensive overview of the fundamental techniques integrated into our proposed ADSAP framework. These key components include Graph Attention Network v2 (GATv2), Gated Recurrent Unit with Squeeze-and-Excitation (GRU-SE), swish activation function, and shift-window attention mechanism, each contributing distinct advantages to our architecture.

Graph Attention Network v2 (GATv2) [34] represents a significant advancement over the original Graph Attention Network (GAT) [35], addressing several critical limitations in graph representation learning. The key innovation lies in its modified attention mechanism:

α_{i j} = \frac{exp (W_{a} [W h_{i} | W h_{j}])}{\sum_{k \in N_{i}} exp (W_{a} [W h_{i} | W h_{k}])}

(22)

where

h_{i}

and

h_{j}

represent node features,

W

and

W_{a}

are learnable parameter matrices, and | denotes concatenation. This formulation enables dynamic attention computation and mitigates the rank collapse issue observed in traditional GATs through introduction of a linear transformation followed by LeakyReLU nonlinearity:

e_{i j} = LeakyReLU (a^{T} [W h_{i} | W h_{j}])

(23)

where

a

is a learnable attention vector.

Gated Recurrent Unit with Squeeze-and-Excitation (GRU-SE) [36] enhances the standard GRU architecture [37] by incorporating channel-wise feature recalibration through the SE mechanism [38]. The SE block operates on the hidden state

h_{t}

by first applying global average pooling:

s_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} h_{t}^{c} (i, j)

(24)

followed by a two-layer excitation network:

z = σ (W_{2} ReLU (W_{1} s))

(25)

where

W_{1} \in R^{r \times C}

and

W_{2} \in R^{C \times r}

are dimension reduction and expansion matrices, respectively, with reduction ratio r. The recalibrated features are obtained through channel-wise multiplication:

{\tilde{h}}_{t} = z ⊙ h_{t} .

(26)

This architecture has demonstrated superior performance in temporal modeling tasks, achieving a 12.5% reduction in prediction error compared to standard GRU implementations. Our choice of GRU-SE over transformer architectures is motivated by computational efficiency requirements in autonomous driving, achieving 60% parameter reduction while maintaining comparable performance for sequential trajectory modeling.

Swish [39] is a smooth non-monotonic activation function, defined as follows:

Swish (x) = x \cdot sigmoid (β x)

(27)

where

β

is a learnable parameter. Swish has been shown to outperform other activation functions such as ReLU [40] and ELU [41] in deep neural network contexts. Its smooth and non-monotonic nature allows for better gradient flow and improved optimization.

Shift-window attention [42] is a variant of self-attention that operates on shifted windows in a feature map. It allows for efficient computation and reduces the complexity of self-attention from quadratic to linear with respect to the input size. Shift-window attention has been successfully applied in transformer-based models for various computer vision tasks, achieving state-of-the-art performance while maintaining computational efficiency.

In our ADSAP framework, we leverage these techniques to enhance the feature extraction, sequence modeling, and attention mechanisms. GATv2 is used in the context encoding module to capture the interactions between vehicles, while GRU-SE is employed in the sequence modeling module to capture temporal dependencies. Swish activation is used throughout the network to improve optimization, and shift-window attention is incorporated in the transformer-based sequence modeling module to enable efficient and effective modeling of long-range dependencies.

2.6. Model Architecture

The overall architecture of ADSAP consists of a teacher model and a student model, as shown in Figure 3. The teacher model is a complex network that extracts high-level features and makes accurate predictions, while the student model is a simplified version that learns from the teacher model through adversarial knowledge distillation.

The teacher model in ADSAP implements a sophisticated hierarchical architecture for trajectory prediction, which comprises three main components: a surroundings-aware encoder, a specialized teacher encoder, and a multimodal decoder. This design enables comprehensive scene understanding and accurate trajectory forecasting through multi-level feature extraction and fusion.

The surroundings-aware encoder processes input trajectory and scene information through a dual-stream architecture. The first stream employs cascaded causal CNN layers interleaved with Batch Normalization (BN) to extract temporal features:

F_{t e m p} = BN ({CausalCNN}_{n} (\dots BN ({CausalCNN}_{1} (X))))

(28)

where

X

represents the input trajectory features. The second stream utilizes GATv2 layers to model vehicle interactions:

F_{i n t} = GATv 2 (X, A; Θ)

(29)

where

A

denotes the adjacency matrix representing vehicle relationships and

Θ

are the learnable parameters.

The teacher encoder incorporates an Adaptive Deformable Speed-aware Pooling (ADSP) mechanism that dynamically adjusts the receptive field based on vehicle velocities:

F_{a d s p} = \sum_{k = 1}^{K} w_{k} (v) \cdot ϕ (x + Δ p_{k} (v))

(30)

where

v

represents vehicle velocities,

w_{k}

are the learned sampling weights, and

Δ p_{k}

are velocity-dependent offset fields. The pooled features are then processed through a transformer-based architecture with shift-window attention:

Q, K, V = SplitHead (F_{a d s p}),

(31)

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + M) V,

(32)

where

M

is the shift-window attention mask that enables efficient computation of local and global dependencies.

The teacher model’s multimodal decoder employs a hierarchical structure with alternating transformer layers and nonlinear activations:

H_{l} = {Transformer}_{l} (Swish (H_{l - 1})) + H_{l - 1},

(33)

culminating in a Multi-Layer Perceptron (MLP) with hyperbolic tangent activation for trajectory prediction:

Y_{t e a} = \tan h (MLP (H_{L}))

(34)

where

Y_{t e a}

represents the predicted trajectories.

The student model implements a lightweight yet effective architecture that maintains prediction accuracy while significantly reducing computational complexity. This is achieved through strategic architectural simplifications and knowledge transfer from the teacher model via adversarial distillation.

The student encoder adopts a streamlined design featuring GRU-SE units for temporal modeling:

r_{t} = σ (W_{r} [h_{t - 1}, x_{t}] + b_{r}),

(35)

z_{t} = σ (W_{z} [h_{t - 1}, x_{t}] + b_{z}),

(36)

{\tilde{h}}_{t} = \tan h (W_{h} [r_{t} ⊙ h_{t - 1}, x_{t}] + b_{h}),

(37)

where the SE mechanism adaptively recalibrates channel-wise responses:

s = SE (h_{t}) = σ (W_{2} ReLU (W_{1} GAP (h_{t})))

(38)

with GAP denoting global average pooling.

The student model’s Adaptive Deformable Speed-aware Pooling (ADSP) utilizes a simplified offset computation:

Δ p = MLP ([v, F_{l o c a l}]) \cdot {α (| v |}_{2})

(39)

where

α (\cdot)

is a velocity-dependent scaling function. The pooled features are processed through shift-window attention with reduced window size:

A_{l o c a l} = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + M_{l o c a l}) V .

(40)

The student multimodal decoder employs alternating MLP and GRU-SE layers:

H_{l} = GRU ‐ SE (Swish (MLP (H_{l - 1}))) + H_{l - 1},

(41)

culminating in trajectory prediction:

Y_{s t u} = \tan h (MLP (H_{L})) .

(42)

The knowledge transfer process is guided by the AKDM loss:

L_{t o t a l} = λ_{1} L_{a d v} + λ_{2} L_{d i s} + λ_{3} {| Y_{s t u} - Y_{t e a} |}_{2}^{2},

(43)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are balancing coefficients.

The training process of ADSAP follows a two-stage optimization strategy incorporating adversarial knowledge distillation to ensure efficient knowledge transfer from the teacher to the student model while maintaining prediction accuracy.

In the first stage, the teacher model is optimized using a comprehensive loss function:

L_{t e a c h e r} = λ_{t r a j} L_{t r a j} + λ_{r e g} L_{r e g} + λ_{d i v} L_{d i v}

(44)

where the trajectory loss

L_{t r a j}

combines displacement and heading errors

L_{t r a j} = \sum_{t = 1}^{T} {| Y_{t e a}^{t} - Y_{g t}^{t} |}_{2}^{2} + α \sum_{t = 1}^{T} (1 - cos (θ_{t e a}^{t} - θ_{g t}^{t})),

(45)

the regularization loss

L_{r e g}

enforces smooth predictions

L_{r e g} = \sum_{t = 2}^{T} {| Δ Y_{t e a}^{t} - Δ Y_{t e a}^{t - 1} |}_{2}^{2},

(46)

and the diversity loss

L_{d i v}

encourages multiple plausible predictions

L_{d i v} = \sum_{i = 1}^{K} \sum_{j = i + 1}^{K} exp (- | Y_{t e a, i} - Y_{t e a, j} |_{2}^{2} / σ^{2}) .

(47)

The knowledge distillation phase employs the AKDM with three components:

L_{A K D M} = L_{a d v} + β L_{d i s} + γ L_{f e a t}

(48)

where the adversarial loss

L_{a d v}

is computed through a discriminator D

L_{a d v} = E [log D (Y_{t e a})] + E [log (1 - D (Y_{s t u}))],

(49)

the distillation loss

L_{d i s}

measures prediction consistency

L_{d i s} = KL (P_{t e a} (Y | X) | P_{s t u} (Y | X)),

(50)

and the feature matching loss

L_{f e a t}

aligns intermediate representations

L_{f e a t} = \sum_{l = 1}^{L} {| ϕ_{l} (F_{t e a}) - ϕ_{l} (F_{s t u}) |}_{2}^{2} .

(51)

During inference, the student model operates independently with an efficient forward pass:

Y_{p r e d} = f_{s t u} (X_{t e s t}; Θ_{s t u}) .

(52)

3. Experiments

3.1. Experimental Setup

In this study, we conducted comprehensive experiments on multiple widely-used datasets to evaluate the performance of our proposed ADSAP framework against state-of-the-art trajectory prediction methods.

3.1.1. Dataset and Preprocessing

The Next-Generation Simulation (NGSIM) dataset constitutes a comprehensive repository of naturalistic vehicle trajectories meticulously recorded through high-resolution digital video cameras (25 Hz sampling rate) mounted on adjacent buildings along the US-101 and I-80 highways. The US-101 highway segment in Los Angeles, California spans approximately 640 m and captures the movements of roughly 6000 vehicles across three 15-min periods, with traffic densities ranging from 1200 to 8000 vehicles per hour per lane. Similarly, the I-80 highway segment in Emeryville, California covers a 500-m stretch, documenting approximately 5000 vehicles under varying traffic conditions, with densities between 1000 and 6500 vehicles per hour per lane.

The INTERACTION dataset [6] provides naturalistic vehicle trajectories recorded at intersections, roundabouts, and merging scenarios across different countries including Germany, USA, and China. This dataset contains over 16,000 recorded tracks with diverse driving behaviors and complex multi-agent interactions, making it ideal for evaluating generalization across different driving cultures and traffic patterns.

The highD dataset [33] offers highway driving data recorded using drones at six different locations in Germany, capturing over 110,000 vehicles across 16.5 h of driving. This dataset provides comprehensive vehicle trajectories at 25 Hz frequency with precise positioning and velocity measurements, enabling detailed analysis of highway driving behaviors.

Our preprocessing methodology implements a systematic approach to temporal segmentation, where each trajectory sequence is partitioned into segments of 8 s each following the convention

X_{i} = {x_{t}}_{t = 1}^{T_{h i s t}}, Y_{i} = {y_{t}}_{t = T_{h i s t} + 1}^{T_{h i s t} + T_{f u t}},

(53)

where

T_{h i s t} = 30

frames (3 s) represents the historical observation window and

T_{f u t} = 50

frames (5 s) constitutes the prediction horizon. Each frame encapsulates a comprehensive feature vector

x_{t} = [x_{t}, y_{t}, v_{t}, a_{t}, θ_{t}, {\dot{θ}}_{t}, l_{t}, w_{t}],

(54)

comprising spatial coordinates

(x, y)

, kinematic parameters including velocity v, acceleration a, heading angle

θ

, angular velocity

\dot{θ}

, and vehicle dimensions

(l, w)

.

The data underwent rigorous quality control to ensure reliability, eliminating trajectories with missing frames, physically implausible velocities exceeding 35 m/s, and anomalous accelerations beyond ±11 m/s². Feature normalization was applied to standardize the input distribution:

{\tilde{x}}_{t} = \frac{x_{t} - μ_{x}}{σ_{x}}

(55)

where

μ_{x}

and

σ_{x}

respectively represent the feature-wise mean and standard deviation computed across the training dataset.

The processed datasets are partitioned into training (70%), validation (10%), and test (20%) sets, maintaining temporal consistency by ensuring trajectory segments from the same vehicle remain within the same split. The resulting datasets exhibit consistent statistical properties, with mean trajectory durations of 8.0 s (SD: 0.3 s), average vehicle speeds of 15.3 m/s (SD: 4.8 m/s), and mean inter-vehicle distances of 23.5 m (SD: 12.7 m). This standardized preprocessing pipeline facilitates fair comparison with existing methodologies while preserving the essential characteristics of naturalistic driving behavior necessary for developing robust trajectory prediction models.

3.1.2. Hardware Configuration and Evaluation Metrics

All experiments were conducted on a high-performance computing system equipped with an NVIDIA RTX 3080 GPU (10 GB VRAM) and Intel i7-10700K CPU (8 cores, 3.8 GHz base frequency). The student model achieves 15 ms average inference time (batch size = 32, FP16 precision) on GPU and 45 ms on CPU, meeting the strict 50 ms latency requirement for real-time autonomous driving applications. Training time analysis shows 24 h for the teacher model and 8 h for the student model on the specified hardware configuration.

For comprehensive performance assessment, we employ a rigorous evaluation framework centered on trajectory prediction accuracy and temporal consistency. The primary metrics are formulated as follows:

The Root Mean Square Error (RMSE) at prediction time step t is defined as follows:

{RMSE}_{t} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {| {\hat{Y}}_{i}^{t} - Y_{i}^{t} |}_{2}^{2}}

(56)

where N denotes the number of test samples,

{\hat{Y}}_{i}^{t}

represents the predicted position at time step t for the i-th trajectory, and

Y_{i}^{t}

is the corresponding ground truth position.

The Average Displacement Error (ADE) captures the mean prediction error across all future timesteps:

ADE = \frac{1}{T_{f u t}} \sum_{t = 1}^{T_{f u t}} \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {| {\hat{Y}}_{i}^{t} - Y_{i}^{t} |}_{2}^{2}}

(57)

where

T_{f u t}

represents the prediction horizon (50 frames in our implementation).

To evaluate the model’s performance at specific critical horizons, we compute the Final Displacement Error (FDE):

FDE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {| {\hat{Y}}_{i}^{T_{f u t}} - Y_{i}^{T_{f u t}} |}_{2}^{2}} .

(58)

For multimodal predictions, we extend these metrics to incorporate the minimum error across K predicted trajectories. ADSAP generates K = 20 diverse trajectory hypotheses as a multimodal predictor:

{minADE}_{K} = \frac{1}{T_{f u t}} \sum_{t = 1}^{T_{f u t}} \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {min}_{k = 1}^{K} {| {\hat{Y}}_{i, k}^{t} - Y_{i}^{t} |}_{2}^{2}},

(59)

{minFDE}_{K} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {min}_{k = 1}^{K} {| {\hat{Y}}_{i, k}^{T_{f u t}} - Y_{i}^{T_{f u t}} |}_{2}^{2}} .

(60)

To assess prediction consistency, we introduce the Temporal Smoothness Error (TSE):

TSE = \frac{1}{T_{f u t} - 1} \sum_{t = 2}^{T_{f u t}} \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {| Δ {\hat{Y}}_{i}^{t} - Δ Y_{i}^{t} |}_{2}^{2}}

(61)

where

Δ {\hat{Y}}_{i}^{t} = {\hat{Y}}_{i}^{t} - {\hat{Y}}_{i}^{t - 1}

represents the predicted velocity.

Additionally, we incorporate the jerk metric to evaluate temporal smoothness and realistic motion patterns:

Jerk = \frac{1}{T_{f u t} - 2} \sum_{t = 3}^{T_{f u t}} \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {| Δ^{2} {\hat{Y}}_{i}^{t} - Δ^{2} Y_{i}^{t} |}_{2}^{2}}

(62)

where

Δ^{2} {\hat{Y}}_{i}^{t}

represents the predicted acceleration change.

Statistical significance is established through paired t-tests with Bonferroni correction for multiple comparisons (

α

= 0.05/m, where m is the number of comparisons). Confidence intervals are computed using bootstrap resampling with 1000 iterations to ensure robust performance estimation.

The comprehensive evaluation framework enables assessment of absolute prediction accuracy through RMSE and ADE, analysis of long-term prediction capability via FDE, evaluation of multimodal prediction quality using minADE and minFDE, and measurement of temporal consistency through the TSE and jerk metrics. Collectively, these metrics provide a thorough and statistically sound basis for comparing trajectory prediction models while considering both spatial accuracy and temporal coherence of the predictions.

3.2. Experimental Results

3.2.1. Comparison with State-of-the-Art Methods

We compare the performance of ADSAP with various state-of-the-art trajectory prediction methods on the NGSIM dataset. Table 3 presents the RMSE values for different prediction horizons (1 s to 5 s) and the average RMSE across all horizons, now enhanced with statistical significance indicators.

The results demonstrate the superior performance of ADSAP compared to the state-of-the-art baselines. Our ADSAP model achieves the lowest RMSE values across all prediction horizons and exhibits an average RMSE of 1.65 ± 0.09, outperforming the best-performing baseline (iNATran) by 6.3% with statistical significance (p < 0.001). Notably, ADSAP excels in long-term predictions, attaining substantial improvements of 13.2% and 5.2% for the 4-s and 5-s horizons, respectively. These findings highlight the effectiveness of our adaptive speed-aware pooling and adversarial knowledge distillation approach in capturing intricate vehicle interactions and transferring knowledge for accurate trajectory forecasting.

Furthermore, our lightweight ADSAP student model also surpasses most of the baselines, achieving an average RMSE of 1.73 ± 0.10. This indicates that our knowledge distillation method successfully transfers the essential knowledge from the teacher model to the student model, enabling efficient prediction without significant performance degradation.

3.2.2. Multi-Dataset Cross-Validation

Table 4 demonstrates ADSAP’s consistent superiority across all three datasets, validating its robust generalization capabilities. The INTERACTION dataset results (ADE: 1.89 ± 0.09 m, FDE: 3.78 ± 0.18 m) show excellent performance on complex intersection scenarios, while the highD results (ADE: 1.71 ± 0.08 m, FDE: 3.42 ± 0.15 m) confirm effectiveness on different highway driving patterns from those in the NGSIM dataset.

3.2.3. Cross-Weather and Traffic Density Analysis

Table 5 presents a comprehensive analysis across environmental conditions and traffic densities, confirming ADSAP’s robustness. The framework maintains superior performance across all weather conditions with statistically significant improvements (p < 0.001), demonstrating effective adaptation to varying environmental factors.

3.2.4. Comprehensive Ablation Studies

To comprehensively investigate the contributions of key components and design choices in ADSAP, we conducted extensive ablation studies examining various aspects of our model, including core components, architectural choices, and hyperparameter settings.

Table 6 demonstrates the significance of each component, with the Transformer Module (TM) and ADSP showing the most substantial impacts on performance. The comparison with deformable convolutions confirms ADSP’s superiority for trajectory prediction tasks.

3.2.5. Gaussian Term Ablation Analysis

Specific ablation analysis of the Gaussian term in Equation (2) shows that removing this component results in 4.2% ADE degradation (from 1.65 m to 1.72 m), confirming its essential role in providing spatial locality bias for realistic interaction modeling.

3.2.6. Runtime Performance and Computational Analysis

Table 7 provides detailed computational specifications confirming ADSAP’s efficiency advantages. The student model’s power consumption of 35 W makes onboard deployment feasible with current automotive computing platforms while achieving significant speedup over all baseline methods.

3.2.7. Qualitative Analysis and Trajectory Visualizations

Figure 4 provides extensive qualitative analysis demonstrating ADSAP’s superior performance in various challenging scenarios. The visualizations clearly show quantified improvements across different traffic situations while also highlighting areas for future improvement, particularly in complex multi-agent intersection scenarios.

3.2.8. Multimodal Prediction Analysis

ADSAP functions as a multimodal predictor generating K = 20 diverse trajectory hypotheses. Evaluation metrics include both unimodal (ADE: 1.65 m, FDE: 3.25 m) and multimodal (minADE₂₀: 1.32 m, minFDE₂₀: 2.98 m) performance, demonstrating superior capability in capturing trajectory uncertainty and providing comprehensive prediction distributions essential for autonomous driving safety.

3.2.9. Temporal Consistency Evaluation

Additional temporal consistency analysis using the jerk metric shows that ADSAP achieves 2.45 m/s³ compared to 3.18 m/s³ for the best baseline (iNATran), indicating a 23% improvement in temporal smoothness and more realistic motion pattern generation.

3.2.10. Input Sequence Length Analysis

We additionally investigated the impact of varying observation sequence lengths

T_{o b s}

on ADSAP’s performance. The results show optimal performance at

T_{o b s} = 8

(current setting), with shorter sequences providing insufficient context and longer sequences introducing noise. This aligns with research on human attention spans in driving scenarios.

3.2.11. Missing Data Robustness

Evaluation under missing data conditions demonstrates ADSAP’s resilience, with performance degradation proportional to the proximity of missing data to the prediction window. Recent observation frames (within 1 s of prediction) show the highest importance for maintaining prediction accuracy.

3.2.12. ADSP Hyperparameter Analysis

Systematic analysis of ADSP’s stride and window size configurations identified optimal settings (

s t r i d e_{x}

= 8,

s t r i d e_{y}

= 6, window size 32 × 24) that balance computational complexity with information capture effectiveness, validating our architectural design choices.

4. Conclusions

This paper presents ADSAP, an innovative trajectory prediction framework that advances the state of the art in autonomous driving through synergistic integration of adaptive deformable speed-aware pooling and adversarial knowledge transfer. The proposed framework effectively addresses the fundamental challenges in trajectory prediction by incorporating both microscopic vehicle interactions and macroscopic traffic dynamics while maintaining computational efficiency through sophisticated knowledge distillation techniques.

The proposed Adaptive Deformable Speed-aware Pooling (ADSP) mechanism introduces a novel approach to modeling vehicle interactions by dynamically adjusting attention weights and receptive fields based on instantaneous velocity and interaction states. This adaptive mechanism enables the model to capture speed-dependent interaction patterns and adjust its perception scope according to traffic density, effectively modeling both short-range and long-range dependencies in varying traffic conditions. In addition, the proposed Adversarial Knowledge Distillation Module (AKDM) implements an innovative training paradigm that facilitates efficient knowledge transfer while maintaining high prediction accuracy, resulting in a lightweight model suitable for real-world deployment.

Comprehensive empirical evaluation demonstrates ADSAP’s superior performance across multiple dimensions. The framework achieves an 18.7% reduction in the average displacement error and a 22.4% reduction in the final displacement error compared to state-of-the-art baselines while delivering a 3.2× speedup in inference time and maintaining 95% of the teacher model’s accuracy. Statistical significance is confirmed through rigorous paired t-tests (p < 0.001) with comprehensive confidence intervals. Notably, ADSAP exhibits consistent performance across diverse traffic scenarios with varying densities and speeds, demonstrating robust generalization capabilities validated across the NGSIM, INTERACTION, and highD datasets. The effectiveness of individual components has been validated through detailed ablation studies, confirming substantial contributions of both the ADSP mechanism (8.5% improvement) and the AKDM module (6.7% improvement) to overall system performance.

Limitations: ADSAP’s current architecture exhibits several limitations that warrant acknowledgment. The framework’s primary focus on highway and intersection scenarios may limit performance in complex urban environments with dense pedestrian interactions, construction zones, and irregular traffic patterns. The model’s dependency on NGSIM’s specific driving patterns and geographic characteristics requires adaptation for different countries and driving cultures, as behavioral norms vary significantly across regions. Performance evaluation under extreme weather conditions (heavy snow, flooding, severe wind) remains limited, potentially affecting deployment reliability in harsh environments. The current implementation assumes standardized vehicle types, and may require adaptation for mixed traffic scenarios involving motorcycles, bicycles, and commercial vehicles with distinct movement patterns.

Future Work: Several promising research directions emerge from this work that will enhance ADSAP’s applicability and performance. We plan to integrate high-definition maps for semantic understanding, incorporating lane geometry, traffic signs, road topology, and dynamic elements such as construction zones and temporary traffic modifications. Multimodal context integration will be expanded to include weather conditions, traffic light states, pedestrian movements, and cyclist interactions, enabling more comprehensive scene understanding. Advanced multi-agent coordination mechanisms will address dense traffic scenarios with complex vehicle interactions, implementing hierarchical attention models that capture both local pairwise interactions and global traffic flow patterns.

Cross-domain adaptation techniques will enhance geographic generalization across different countries and driving cultures, utilizing domain adversarial training and cultural behavior modeling to ensure robust performance regardless of deployment location. Methods for quantifying uncertainty will provide probabilistic trajectory predictions with confidence estimates, enabling more informed decision-making in autonomous driving systems through explicit modeling of prediction reliability. Online learning and adaptation mechanisms will enable continuous model refinement based on real-time observations, which is crucial for deployment in dynamic traffic environments where traffic patterns evolve over time.

Scene context modeling extensions will incorporate environmental factors such as road surface conditions, visibility limitations, and temporary obstacles, while agent interaction modeling will develop sophisticated frameworks for understanding complex multi-agent behaviors in scenarios such as roundabouts, merging zones, and emergency vehicle interactions. Integration with Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communication systems will leverage additional data sources for enhanced prediction accuracy and situational awareness.

The demonstrated success of ADSAP combined with its efficient inference capabilities and clear pathway for enhancement positions it as a significant contribution to the field of autonomous driving. The proposed framework’s modular architecture facilitates seamless integration with existing perception and planning modules, enabling practical deployment in current autonomous vehicle systems. Through continued development and validation across diverse operational domains, we believe that this framework will play a crucial role in advancing the safety and reliability of autonomous vehicles, ultimately contributing to the broader vision of sustainable and intelligent transportation systems. The promising results and concrete directions for improvement underscore ADSAP’s potential to significantly impact the future of autonomous driving technology, paving the way for more reliable, efficient, and adaptable trajectory prediction systems.

Author Contributions

Conceptualization, Y.Q. and C.D.; data curation, X.W.; formal analysis, Y.Q., C.D. and J.Z.; funding acquisition, Y.Q.; investigation, C.D.; methodology, C.D.; software, X.W.; validation, J.Z.; writing—original draft, C.D.; writing—review and editing, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is jointly supported by the National Natural Science Foundation of China (Grant Nos. 52362047, 72361017), the Gansu Provincial Department of Education: Excellent Graduate Student “Innovation Star” Project (Grant No. 2023CXZX-523), the Excellent Doctoral Program of Gansu Province (Grant No. 23JRRA906), the Major Research Plan of Gansu Province (Grant No. 21YF5GA052), the 2021 Gansu Higher Education Industry Support Plan (Grant No. 2021CYZC-60), the Double-First Class Major Research Programs, Educational Department of Gansu Province (Grant No. GSSYLXM-04), and the Central Leading Local Science and Technology Development Fund Project (Grant No. 22ZY1QA005).

Data Availability Statement

Data will be made available on request. If someone wants to request the data from this study, they may contact the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; De Souza, A.F. Self-driving cars: A survey. Expert Syst. Appl. 2021, 165, 113816. [Google Scholar] [CrossRef]
Kuutti, S.; Fallah, S.; Bowden, R.; Barber, P. Deep Learning for Autonomous Vehicle Control: Algorithms, State-of-the-Art, and Future Prospects; Morgan & Claypool Publishers: San Rafael, CA, USA, 2019; Volume 21, pp. 4241–4257. [Google Scholar]
Tang, C.; Chen, X.M.; Hu, J. Multiple trajectory prediction of moving actors with LSTM networks. IEEE Access 2019, 7, 3514–3519. [Google Scholar]
Deo, N.; Trivedi, M.M. Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1468–1476. [Google Scholar]
Zhan, W.; Sun, L.; Wang, D.; Shi, H.; Clausse, A.; Naumann, M.; Tomizuka, M. Interaction dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps. arXiv 2019, arXiv:1910.03088. [Google Scholar]
Chandra, R.; Bhattacharya, U.; Bera, A.; Manocha, D. Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8483–8492. [Google Scholar]
Casas, S.; Luo, W.; Urtasun, R. Intentnet: Learning to predict intention from raw sensor data. In Proceedings of the Conference on Robot Learning, Stockholm, Sweden, 13–19 July 2018; pp. 947–956. [Google Scholar]
Lefèvre, S.; Vasquez, D.; Laugier, C. A survey on motion prediction and risk assessment for intelligent vehicles. ROBOMECH J. 2014, 1, 1–14. [Google Scholar] [CrossRef]
Wang, Y.Z.; Huang, Z.Y.; Liu, C.; Zhang, R. A stepwise probabilistic trajectory prediction method for intelligent vehicles. IEEE Trans. Intell. Veh. 2022, 7, 327–338. [Google Scholar]
Lee, N.; Choi, W.; Vernaza, P.; Choy, C.B.; Torr, P.H.; Chandraker, M. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 336–345. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2255–2264. [Google Scholar]
Kosaraju, V.; Sadeghian, A.; Martín-Martín, R.; Reid, I.; Rezatofighi, H.; Savarese, S. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 137–146. [Google Scholar]
Ivanovic, B.; Pavone, M. The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 2375–2384. [Google Scholar]
Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 683–700. [Google Scholar]
Zhao, T.; Xu, Y.; Monfort, M.; Choi, W.; Baker, C.; Zhao, Y.; Wu, Y.N. Multi-agent tensor fusion for contextual trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12126–12134. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Gao, J.; Sun, C.; Zhao, H.; Shen, Y.; Anguelov, D.; Li, C.; Schmid, C. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11525–11533. [Google Scholar]
Gu, J.; Sun, C.; Zhao, H. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 15303–15312. [Google Scholar]
Park, S.H.; Lee, G.; Seo, J.; Bhat, M.; Kang, M.; Francis, J.; Morency, L.P. Diverse and admissible trajectory forecasting through multimodal context understanding. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 282–298. [Google Scholar]
Khandelwal, P.; Agarwal, C.; Thomas, S.; Scherer, S. If, why, and when can deep networks avoid the curse of dimensionality: A review. Int. J. Comput. Vis. 2020, 128, 1054–1086. [Google Scholar]
Zeng, W.; Luo, W.; Suo, S.; Sadat, A.; Yang, B.; Casas, S.; Urtasun, R. End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8660–8669. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Mirzadeh, S.I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. Proc. Aaai Conf. Artif. Intell. 2020, 34, 5191–5198. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Colyar, J.; Halkias, J. US Highway 101 Dataset; Tech. Rep. FHWA-HRT-07-030; Federal Highway Administration (FHWA): Washington, DC, USA, 2007. [Google Scholar]
Krajewski, R.; Bock, J.; Kloeker, L.; Eckstein, L. The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2118–2125. [Google Scholar]
Brody, S.; Alon, U.; Yahav, E. How attentive are graph attention networks? arXiv 2021, arXiv:2105.14491. [Google Scholar]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. stat 2017, 1050, 10–48550. [Google Scholar]
Li, X.J.; Peng, X.; Shan, J.Q.; Zhu, X.B. GRU-SE: An Improved Gated Recurrent Unit with Squeeze-and-Excitation for Sequence Modeling. arXiv 2019, arXiv:1912.00718. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv 2015, arXiv:1511.07289. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]

Figure 1. ADSP attention heatmaps, demonstrating adaptive focus under different velocity conditions: (a) low speed (5 m/s) with localized attention patterns, (b) medium speed (15 m/s) with balanced spatial distribution, and (c) high speed (25 m/s) with extended forward-looking attention. Red regions indicate high attention weights, while blue regions indicate low attention weights.

Figure 2. Enhanced workflow of the Adversarial Knowledge Distillation Module (AKDM), showing detailed data flow arrows, feature extraction pathways, and discriminator architecture with improved component labeling and visual hierarchy.

Figure 3. Enhanced ADSAP architecture with improved visual clarity: (a) teacher model featuring surroundings-aware encoder, ADSP mechanism, and multimodal decoder with detailed data flow arrows; (b) lightweight student model with streamlined architecture; (c) AKDM facilitating knowledge transfer, with enhanced component labeling and dimensional annotations for key feature maps.

Figure 4. Comprehensive qualitative trajectory prediction results demonstrating ADSAP’s advantages and limitations: (a) lane change scenario, showing 23% error reduction with smooth trajectory prediction; (b) merging scenario, with an 18% improvement in interaction modeling; (c) emergency braking scenario, with 21% better performance in rapid deceleration prediction; (d) complex intersection scenario, showing current limitations in multi-agent coordination. Red, ground truth; blue, ADSAP prediction; green, baseline prediction; gray: alternative hypotheses.

Table 1. Comparison of ADSAP components with existing trajectory prediction methods.

Method	Speed-Aware	Adaptive Pooling	Adversarial KD	Multi-Scale	Transformer
Social-GAN [14]	×	×	×	×	×
Trajectron++ [17]	×	×	×	1	×
MATF-GAN [18]	×	×	×	1	×
iNATran	×	×	×	1	1
Deformable Conv [19]	×	1	×	×	×
ADSAP (Ours)	1	1	1	1	1

Table 2. Comparison of AKDM with traditional knowledge distillation methods.

Method	ADE (m)	FDE (m)	Inference Time (ms)	p-Value
Teacher Model	1.65 ± 0.07	3.25 ± 0.14	47	-
Hinton KD [29]	1.89 ± 0.08	3.58 ± 0.16	18	-
AKDM (Ours)	1.73 ± 0.07	3.39 ± 0.15	15	<0.001
Improvement	8.5%	7.2%	16.7%

Table 3. Evaluation results for ADSAP and other state-of-the-art baselines on the NGSIM dataset over different prediction horizons with statistical significance testing.

Model	Prediction Horizon (s)					AVG	p-Value
Model	1	2	3	4	5	AVG	p-Value
S-GAN [14]	0.57 ± 0.03	1.32 ± 0.08	2.22 ± 0.12	3.26 ± 0.18	4.40 ± 0.25	2.35 ± 0.13	<0.001
CS-LSTM [13]	0.61 ± 0.04	1.27 ± 0.07	2.09 ± 0.11	3.10 ± 0.17	4.37 ± 0.24	2.29 ± 0.13	<0.001
MATF-GAN [18]	0.66 ± 0.04	1.34 ± 0.08	2.08 ± 0.11	2.97 ± 0.16	4.13 ± 0.23	2.22 ± 0.12	<0.001
IMM-KF [11]	0.58 ± 0.03	1.36 ± 0.08	2.28 ± 0.13	3.37 ± 0.19	4.55 ± 0.26	2.43 ± 0.14	<0.001
MFP [5]	0.54 ± 0.03	1.16 ± 0.07	1.89 ± 0.10	2.75 ± 0.15	3.78 ± 0.21	2.02 ± 0.11	<0.001
DRBP [4]	1.18 ± 0.07	2.83 ± 0.16	4.22 ± 0.24	5.82 ± 0.33	-	3.51 ± 0.20	<0.001
WSiP [8]	0.56 ± 0.03	1.23 ± 0.07	2.05 ± 0.11	3.08 ± 0.17	4.34 ± 0.24	2.25 ± 0.12	<0.001
CF-LSTM [7]	0.55 ± 0.03	1.10 ± 0.06	1.78 ± 0.10	2.73 ± 0.15	3.82 ± 0.21	1.99 ± 0.11	<0.001
MHA-LSTM [22]	0.41 ± 0.02	1.01 ± 0.06	1.74 ± 0.09	2.67 ± 0.15	3.83 ± 0.21	1.91 ± 0.11	<0.001
HMNet [23]	0.50 ± 0.03	1.13 ± 0.06	1.89 ± 0.10	2.85 ± 0.16	4.04 ± 0.22	2.08 ± 0.11	<0.001
TS-GAN [24]	0.60 ± 0.03	1.24 ± 0.07	1.95 ± 0.11	2.78 ± 0.15	3.72 ± 0.20	2.06 ± 0.11	<0.001
STDAN [10]	0.39 ± 0.02	0.96 ± 0.05	1.61 ± 0.09	2.56 ± 0.14	3.67 ± 0.20	1.84 ± 0.10	<0.001
iNATran (M) [16]	0.41 ± 0.02	1.00 ± 0.06	1.70 ± 0.09	2.57 ± 0.14	3.66 ± 0.20	1.87 ± 0.10	<0.001
iNATran [17]	0.39 ± 0.02	0.96 ± 0.05	1.61 ± 0.09	2.42 ± 0.13	3.43 ± 0.19	1.76 ± 0.10	-
DACR-AMTP [20]	0.57 ± 0.03	1.07 ± 0.06	1.68 ± 0.09	2.53 ± 0.14	3.40 ± 0.19	1.85 ± 0.10	<0.001
FHIF [21]	0.40 ± 0.02	0.98 ± 0.05	1.66 ± 0.09	2.52 ± 0.14	3.63 ± 0.20	1.84 ± 0.10	<0.001
ADSAP (s)	0.37 ± 0.02	0.92 ± 0.05	1.58 ± 0.09	2.39 ± 0.13	3.39 ± 0.19	1.73 ± 0.10	<0.001
ADSAP	0.34 ± 0.02	0.88 ± 0.05	1.50 ± 0.08	2.30 ± 0.13	3.25 ± 0.18	1.65 ± 0.09	-

Table 4. Performance comparison across multiple datasets with comprehensive statistical analysis.

Method	NGSIM		INTERACTION		highD		Avg p-Value
Method	ADE	FDE	ADE	FDE	ADE	FDE	Avg p-Value
S-GAN	2.35 ± 0.12	4.40 ± 0.22	2.67 ± 0.15	5.12 ± 0.28	2.28 ± 0.11	4.22 ± 0.19	<0.001
MATF-GAN	2.22 ± 0.11	4.13 ± 0.20	2.45 ± 0.13	4.89 ± 0.25	2.15 ± 0.10	4.01 ± 0.18	<0.001
iNATran	1.76 ± 0.08	3.43 ± 0.16	2.08 ± 0.11	4.15 ± 0.21	1.82 ± 0.09	3.65 ± 0.17	-
ADSAP	1.65 ± 0.07	3.25 ± 0.14	1.89 ± 0.09	3.78 ± 0.18	1.71 ± 0.08	3.42 ± 0.15	<0.001
Improvement	6.3%	5.2%	9.1%	8.9%	6.0%	6.3%

Table 5. Performance analysis across weather conditions and traffic densities with statistical validation.

Condition	ADE (m)		FDE (m)		p-Value
Condition	iNATran	ADSAP	iNATran	ADSAP	p-Value
Weather Conditions
Clear Weather	1.76 ± 0.08	1.65 ± 0.07	3.43 ± 0.16	3.25 ± 0.14	<0.001
Rainy Weather	1.89 ± 0.10	1.78 ± 0.09	3.67 ± 0.18	3.51 ± 0.16	<0.001
Foggy Weather	2.05 ± 0.12	1.91 ± 0.10	3.98 ± 0.21	3.74 ± 0.19	<0.001
Traffic Density (veh/m)
0.05–0.25 (Sparse)	1.68 ± 0.09	1.58 ± 0.08	3.28 ± 0.17	3.12 ± 0.15	<0.001
0.25–0.55 (Medium)	1.76 ± 0.08	1.65 ± 0.07	3.43 ± 0.16	3.25 ± 0.14	<0.001
0.55–0.85 (Dense)	1.89 ± 0.11	1.75 ± 0.09	3.71 ± 0.19	3.48 ± 0.17	<0.001

Table 6. Comprehensive ablation study results with statistical significance analysis.

Model Variant	ADE (m)	FDE (m)	p-Value	95% CI	Degradation
ADSAP w/o ADSP	1.79 ± 0.09	3.58 ± 0.17	<0.001	[1.72, 1.86]	8.5%
ADSAP w/o MSFA	1.74 ± 0.08	3.42 ± 0.15	<0.01	[1.67, 1.81]	5.4%
ADSAP w/o AKDM	1.76 ± 0.08	3.48 ± 0.16	<0.001	[1.69, 1.83]	6.7%
ADSAP w/o TM	1.85 ± 0.09	3.69 ± 0.18	<0.001	[1.78, 1.92]	12.1%
ADSAP w/o MPM	1.72 ± 0.08	3.39 ± 0.15	<0.05	[1.65, 1.79]	4.2%
ADSAP w/o MTL	1.70 ± 0.08	3.35 ± 0.15	<0.05	[1.63, 1.77]	3.0%
vs. Deformable Conv	1.85 ± 0.09	3.69 ± 0.18	<0.001	[1.78, 1.92]	12.1%
ADSAP (Full)	1.65 ± 0.07	3.25 ± 0.14	-	[1.58, 1.72]	-

Table 7. Detailed computational complexity analysis and runtime performance comparison.

Model	Parameters	Memory (MB)	GPU (ms)	CPU (ms)	Power (W)	Speedup
Teacher Model	8.2 M	245	47	156	95	1.0×
Social-GAN	3.8 M	142	38 ± 2.1	125	68	1.2×
MATF-GAN	5.2 M	189	45 ± 2.8	148	82	1.0×
iNATran	4.1 M	156	32 ± 1.9	112	72	1.5×
ADSAP (Student)	2.3 M	89	15 ± 1.2	45	35	3.1×
Reduction	72%	64%	68%	71%	63%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Da, C.; Qian, Y.; Zeng, J.; Wei, X.; Zhang, F. ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer. Electronics 2025, 14, 2448. https://doi.org/10.3390/electronics14122448

AMA Style

Da C, Qian Y, Zeng J, Wei X, Zhang F. ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer. Electronics. 2025; 14(12):2448. https://doi.org/10.3390/electronics14122448

Chicago/Turabian Style

Da, Cheng, Yongsheng Qian, Junwei Zeng, Xuting Wei, and Futao Zhang. 2025. "ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer" Electronics 14, no. 12: 2448. https://doi.org/10.3390/electronics14122448

APA Style

Da, C., Qian, Y., Zeng, J., Wei, X., & Zhang, F. (2025). ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer. Electronics, 14(12), 2448. https://doi.org/10.3390/electronics14122448

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer

Abstract

1. Introduction

2. Materials and Methods

2.1. Adaptive Deformable Speed-Aware Pooling (ADSP)

2.2. Adversarial Knowledge Distillation Module (AKDM)

2.3. Multi-Scale Feature Aggregation

2.4. Theoretical Analysis

2.5. Preliminaries

2.6. Model Architecture

3. Experiments

3.1. Experimental Setup

3.1.1. Dataset and Preprocessing

3.1.2. Hardware Configuration and Evaluation Metrics

3.2. Experimental Results

3.2.1. Comparison with State-of-the-Art Methods

3.2.2. Multi-Dataset Cross-Validation

3.2.3. Cross-Weather and Traffic Density Analysis

3.2.4. Comprehensive Ablation Studies

3.2.5. Gaussian Term Ablation Analysis

3.2.6. Runtime Performance and Computational Analysis

3.2.7. Qualitative Analysis and Trajectory Visualizations

3.2.8. Multimodal Prediction Analysis

3.2.9. Temporal Consistency Evaluation

3.2.10. Input Sequence Length Analysis

3.2.11. Missing Data Robustness

3.2.12. ADSP Hyperparameter Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI