Next Article in Journal
Advancing Real-Time Aerial Wildfire Detection Through Plume Recognition and Knowledge Distillation
Previous Article in Journal
Multi-UAV Cooperative Search in Partially Observable Low-Altitude Environments Based on Deep Reinforcement Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PLMT-Net: A Physics-Aware Lightweight Network for Multi-Agent Trajectory Prediction in Interactive Driving Scenarios

School of Mechatronic Engineering, Guilin University of Electronic Technology, Guilin 541004, China
*
Author to whom correspondence should be addressed.
Drones 2025, 9(12), 826; https://doi.org/10.3390/drones9120826 (registering DOI)
Submission received: 9 October 2025 / Revised: 19 November 2025 / Accepted: 26 November 2025 / Published: 28 November 2025
(This article belongs to the Section Innovative Urban Mobility)

Highlights

What are the main findings?
  • A physics-aware, modular and vectorized architecture (PLMT-Net) achieves competitive accuracy on the Argoverse 1 dataset while reducing model parameters and inference time.
  • Embedding physics priors at two levels—interaction-target filtering and physics-guided graph attention—improves the physical plausibility and stability of multi-agent trajectory prediction.
What are the implications of the main findings?
  • The method offers a practical path to real-time, resource-constrained deployment without sacrificing prediction quality.
  • Lightweight physics priors integrated into learned attention provide a generalizable design pattern that can enhance robustness and physical feasibility in future trajectory-forecasting systems.

Abstract

Accurate and efficient multi-agent trajectory prediction remains a core challenge for autonomous driving, particularly in modeling complex interactions while maintaining physical plausibility and computational efficiency. Many existing methods- especially those based on large transformer architectures- tend to overlook physical constraints, leading to unrealistic predictions and high deployment costs. In this work, we propose a lightweight trajectory prediction framework that integrates physical information to enhance interaction modeling and runtime performance. Our method introduces two physically inspired strategies: (1) a constraint-guided mechanism is used to filter irrelevant or distracting neighbors, and (2) a physics-aware attention module is applied to steer attention weights toward physically plausible interactions. The overall architecture adopts a modular and vectorized design, effectively reducing model complexity and inference latency. Experiments on the Argoverse V1 dataset, comparing against multiple existing methods, demonstrate that our approach achieves a favorable balance among accuracy, physical feasibility, and efficiency, running in real time on a commodity desktop GPU. Future work will focus on validating its performance on embedded hardware.

1. Introduction

Accurate prediction of the future motion of surrounding traffic participants is important for safe and efficient decision-making in autonomous driving systems (Figure 1). Vehicles often show complex spatial interactions and behavior patterns in multi-agent scenarios. As a result, traditional single-agent prediction methods are not enough to meet the needs of real-world perception and planning. At the same time, as autonomous driving moves from the lab to real roads, prediction models face limits in onboard computing power and must respond quickly. This situation poses new challenges to the design of lightweight models. Also, if physical feasibility is not considered, then the predicted trajectories may be unrealistic, thereby potentially reducing the reliability of the planning module. For these reasons, a lightweight multi-agent trajectory prediction method that includes physical information needs to be developed. Such a method can help improve the usefulness and deployment ability of autonomous driving systems.
Multi-agent and multi-modal trajectory prediction faces several challenges, including interpretable modeling, interaction modeling among agents, and multi-modal reasoning in complex and dynamic driving scenarios. Early physics-based methods are not effective in handling these problems because of the randomness and subjectivity of driving behavior. By contrast, learning-based methods use data-driven models to predict trajectories. Unlike physics-based approaches, they can learn patterns from data to better adapt to complex scenarios [1]. Learning-based multi-agent multi-modal trajectory prediction aims to infer multiple possible future trajectories based on the historical motion of traffic participants and scene context. Doing so helps improve prediction accuracy and reliability in real-world conditions [2]. Learning-based approaches show clear advantages over physics-based methods in vehicle trajectory prediction. They can automatically extract features, model complex interactions in traffic scenes, support multi-modal prediction, and generalize well across different situations.
In recent years, many studies have focused on improving multi-agent interaction modeling through deep learning methods. For example, graph neural networks (GNNs) and transformer-based architectures have been used to capture long-range dependencies [3,4,5]. However, most of these methods focus only on semantic-level behavior modeling, thereby limiting their ability to ensure physical plausibility. Some works adopt rasterization to represent scene context in an intuitive way [6,7]. While effective in some cases, raster-based methods can be computationally expensive, especially when processing large-scale data. To address this issue, recent studies have introduced vectorized representations [8,9], where models learn the interactions between vectorized entities (such as trajectories and lane segments) using GNN or transformer architectures [9,10,11,12,13]. Compared with earlier methods based on recurrent neural networks (RNNs) [14] or long short-term memory (LSTM) networks [15], these newer approaches offer stronger interaction modeling capabilities and achieve higher prediction accuracy. However, they still face two main limitations: (1) Large numbers of parameters and high computational complexity increase training costs and inference latency, making real-time deployment difficult; (2) lack of explicit physical modeling can lead to predictions that violate basic kinematic or dynamic constraints.
An important way to address the above challenges is to design lightweight motion prediction models that can operate efficiently under limited computational resources while incorporating physical information to ensure the feasibility and plausibility of predicted trajectories. However, balancing lightweight design and physical plausibility in multi-agent prediction remains a challenging and largely unresolved issue. Learning-based models often suffer from limited interpretability and lack explicit physical meaning. In addition, they require extensive parameter tuning across different scenarios, which limits their generalization and real-world applicability. By contrast, physics-based models offer simplicity and strong interpretability by modeling traffic behaviors through mathematical laws. To combine the strengths of both approaches, we introduce physical constraints into the interaction modeling process, aiming to ensure both physical plausibility and learnability. Inspired by the work of Zikang Zhou et al. [13], we propose PLMT-Net, a lightweight multi-agent motion prediction framework that incorporates physical information. This approach ensures the accuracy and physical reasonableness of trajectory prediction while significantly reducing model complexity.
The primary contributions of this paper are summarized as follows:
(1)
The overall architecture adopts a modular and vectorized design, enabling reduced model complexity and improved inference efficiency without compromising prediction performance.
(2)
We propose a physical-information-based strategy to filter interaction targets, eliminating irrelevant and disruptive agents.
(3)
We propose a graph attention network that incorporates physical priors to guide attention weights toward physically plausible interaction targets, effectively improving the realism and stability of interaction modeling.
(4)
We conduct experiments on the Argoverse 1 dataset. The proposed method strikes a favorable balance between prediction accuracy, physical feasibility, and computational efficiency, demonstrating promising performance and potential for practical applications.

2. Related Work

2.1. Multi-Agent Trajectory Prediction

With the development of autonomous driving and intelligent transportation systems, motion prediction has gradually evolved from single-agent forecasting to multi-agent interaction modeling [16,17,18,19,20,21]. Early approaches typically relied on RNNs or LSTMs to capture temporal interaction dynamics among agents for trajectory prediction [14,15].
In recent years, GNNs and Transformers have been widely adopted in multi-agent prediction due to their advantages in explicitly modeling dynamic interactions, capturing long-range spatiotemporal dependencies, and supporting parallel computation. VectorNet [9] introduced a vectorized graph structure to model interactions between trajectories and map elements. LaneGCN [8] further enhanced the fusion of trajectory and lane information through a local receptive field mechanism. HiVT [13] employed a hierarchical vectorized transformer to improve the efficiency and accuracy of large-scale multi-agent prediction.
Although these methods have made notable progress in modeling complex interactions, most of them focus solely on data-driven interaction modeling and lack explicit incorporation of physical motion constraints.

2.2. Physical Information Fusion Modeling

Recently, prior knowledge has attracted increasing attention as a key component of successful prediction models [22]. Some studies have started to incorporate physical information—such as motion continuity and dynamic constraints—as prior knowledge into trajectory prediction to improve model plausibility and interpretability.
PINNs [23] are a pioneering effort to combine neural networks with physical modeling. They explicitly embed physical priors into the training process, encouraging the model to follow physical laws while fitting observed data. This feature enables better generalization, especially in small-sample scenarios. SF-GRU [24] integrates data-driven and physics-driven approaches to improve prediction accuracy and interpretability, but it does not include environmental information. Trajectron++ [25] introduces physical priors during training to improve physical plausibility, but the constraints are implemented as soft penalties in the loss function, which may lead to inconsistency during inference. Smooth-Trajectron++ [26] adds a trajectory smoothing regularization term to Trajectron++, effectively reducing sudden changes and oscillations in predicted trajectories. However, it still applies physical modeling only as a soft constraint during training, and the model lacks explicit physical reasoning at inference time.
In summary, existing physics-aware methods have several limitations. First, most of them focus on single-agent prediction. Second, physical constraints are often added only through loss functions, leading to weak coupling with the end-to-end learning process. Lastly, incorporating physical modeling often increases model complexity, resulting in larger parameter sizes and longer inference paths, which limits their potential for real-time deployment. In contrast, our approach incorporates lightweight physics priors that guide the inference process in a computationally efficient manner, avoiding additional model complexity or training–stage coupling.

2.3. Lightweight Prediction Model

In multi-agent trajectory prediction tasks, inference efficiency and resource consumption are critical for practical use in real-world systems, especially in scenarios with strict real-time and resource constraints, such as autonomous driving systems and embedded platforms.
In recent years, a number of methods based on complex architectures similar to Transformers and GNNs, such as mmTransformer [10], Trajectron++ [25], MOST [27], and Multi-modal Transformer [28], have achieved significant improvements in prediction accuracy. However, these models typically suffer from large parameter sizes and high inference latency, thereby limiting their applicability in real-world systems.
To address these issues, some studies have explored lightweight model design. For example, G. Hinton and S. Han et al. [29,30] applied knowledge distillation and pruning to compress high-accuracy but complex models into lightweight versions for deployment. A Howard et al. [31,32,33] reduced parameter size and computational cost through structural optimization. J. Tan et al. [34] focused on compact feature modeling by pruning the attention mechanism to retain key information while reducing redundant computation.
However, most lightweight models mainly focus on compression or acceleration, without considering physical plausibility or consistency in interaction modeling. Without physical constraints, such models are more likely to produce unrealistic predictions. Therefore, a key research direction is to design efficient and lightweight prediction networks that maintain both physical consistency and interaction modeling capability, particularly for on-device or embedded applications.

2.4. Our Method

Current multi-agent motion prediction methods have not yet achieved a simultaneous balance among three key aspects: physically plausible modeling, efficient lightweight inference, and complex multi-agent interaction modeling.
Motivated by this gap, this paper proposes a novel approach that integrates physical priors with a lightweight design to predict multi-agent motion. Unlike existing methods, the proposed model explicitly encodes physical properties of agents within the network, providing stronger physical constraints during prediction. At the same time, it adopts a lightweight architecture, ensuring advantages in inference speed and computational efficiency for practical deployment.

3. Method

3.1. PLMT-Net Structure

The overall architecture of PLMT-Net is illustrated in Figure 2.
First, the raw traffic scene is represented as a set of structured vectorized entities, including the historical trajectories of traffic participants and their corresponding scene context. On the basis of this representation, the model performs hierarchical aggregation and modeling of spatiotemporal information in the scene.
Next, the model encodes the local context of each target agent by incorporating physical information. This process integrates the agent’s own motion history, the dynamic behavior of nearby vehicles, and surrounding map features. It provides rich semantic support and physical consistency for understanding the motion state of each individual agent.
Then, a proposed global interaction module aggregates local representations from all agents and updates their features. By capturing long-range dependencies and scene-level interaction dynamics, this module enhances each agent’s awareness of the global environment.
Finally, on the basis of the fused multi-level semantic representations, the model generates multi-modal future trajectory predictions. By capturing the diversity and uncertainty of traffic behaviors, the model improves the overall accuracy and robustness of trajectory prediction.
Figure 2 shows a four-step pipeline. Step-1 only vectorizes the scene (no neighbor filtering). In Step-2 (Local Encoder), our two physics-aware contributions appear: (2a) we form an adaptive neighborhood  N i  by scoring neighbors with  A i j = a d i j b Δ v + c cos θ  and selecting top-K; (2b) we inject a physics term  w i j  into attention logits, which is equivalent to a multiplicative physics prior on attention; then (2c) a light temporal module yields the local state  h i . Step-3 aggregates  h i  for global context  h ~ i  Step-4 decodes future trajectories. A complete flow summary and pseudo code are provided in Appendix A.1.

3.2. Data Processing Module

In traffic scenarios, absolute positions (e.g., longitude and latitude) are often redundant. The key information lies in the relative spatial relationships between entities. Therefore, encoding the geometric properties of entities (such as vehicles and lanes) as relative vectors can eliminate the need to store global positions, reduce data dimensionality, and minimize redundant information. This approach also encourages the model to focus more on local geometric structures rather than on global coordinates, thereby improving generalization ability.
Relative position vectors can further eliminate the influence of global translations on scene representation, making the model robust to coordinate shifts [35]. In this work, we introduce relative position vectors to supplement the trajectory and lane vectors, compensating for the loss of spatial relationship information when only using individual entity vectors.
The historical trajectory of agent  i  is represented by the position differences between adjacent time steps as  { p i t p i t 1 } t = 1 T , where  p i t R 2  is the absolute position of agent  i  at time step  t , and  T  is the total number of historical time steps. The geometric attribute of lane segment  ξ  is represented by the vector from the starting point to the ending point as  p ξ 1 p ξ 0 , where  p ξ 0 , p ξ 1 R 2  are the coordinates of the starting point and ending point of the lane segment, respectively.
The relative position of agent to agent and the relative position of agent to lane are obtained by
Δ p i j t = p j t p i t Δ p i ξ t = p i t p ξ 0
where  Δ p i j t  characterizes dynamic interactions by representing the position of agent  j  relative to agent  i  at time step  t , and  Δ p i ξ t  models static environmental constraints by representing the position of agent  i  relative to the starting point of lane  ξ .

3.3. Physics-Aware Local Encoder Module

3.3.1. Physics-Aware Interaction Encoding

The rotation matrix  R i  is calculated with reference to the direction of the latest trajectory segment of the agent
R i = cos θ i sin θ i sin θ i cos θ i
where  θ i = ( p i T p i T 1 ) , which is the motion direction angle of agent  i  at the latest time step.
For each agent  i , we first rotate its historical trajectory to align with its current motion direction (velocity direction). This step removes the influence of the scene’s absolute orientation. Next, we concatenate the aligned trajectory coordinates with the agent’s semantic attribute vector  a i  (e.g., agent type, relative position). The combined vector is then passed through a multilayer perceptron (MLP). The MLP encodes the motion features (implicit in the trajectory) and semantic attributes of the agent  i  itself, generating a compact, self-aware feature vector  z i t
z i t = ϕ c e n t e r R i p i t p i t 1 , a i
where  ϕ c e n t e r ( · )  together with  ϕ n e r ( · ) ϕ l a n e ( · )  and  ϕ r e l ( · )  in the following are a set of shared-parameter MLPs used to encode geometric relations into vector representations of uniform dimensionality. This feature  z i t  serves as the foundational input that represents the self-state of agent  i  in the subsequent attention mechanism. Its primary role is to explicitly incorporate the agent’s intent, which is inferred from both motion features and semantic attributes. This approach enables the attention mechanism to effectively capture how the agent’s intent may influence surrounding agents during interaction modeling.
In conventional multi-agent interaction modeling, neighboring agents are typically selected on the basics of Euclidean distance or within a fixed radius, as shown in Figure 3a. This approach often leads to two main issues: (1) Some nearby vehicles in the opposite lane (e.g., Vehicle 1) are unlikely to influence the target agent yet are still included as interaction neighbors. This approach results in redundant interaction modeling and introduces irrelevant information into the learning process. (2) By contrast, certain distant vehicles moving in a similar direction (e.g., Vehicle 2) may have a higher potential for future interaction but are mistakenly excluded.
To improve the effectiveness of interaction modeling, we introduce physical information, including relative distance, relative velocity, and motion direction, as constraints in the selection of neighboring agents. Let the relative distance be  d i j = p i p j  (m), the relative speed  Δ v i j = v i v j  (m/s), and the relative direction cosine be  cos θ i j = v i v j v i v j ϵ 1 ,   1 . For each target agent  i , interaction weights  A i j  with its surrounding agents are computed (see Figure 3b) and used to dynamically select interaction neighbors as
A i j = a d i j b Δ v i j + c cos θ i j ,       a , b , c 0 .
This formulation imposes clear monotonic relationships:  A i j  decreases with distance and velocity mismatch, and increases with directional alignment, providing interpretable physical priors. In practice, we evaluate common selection strategies from the GNN literature and adopt a top-K rule with  K = 0.8 | C i . The top K agents ranked by  A i j  are retained, where  C i  denotes the candidate set for agent  i . Equivalently, this is thresholding process with a data-dependent cutoff  τ i  (the K-th largest score) yielding  N i = j : A i j τ i . Combining with Equation (4), we can express an equivalent adaptive radius constraint as
d i j c cos θ i j τ i b v i j a
which analytically reveals that the effective interaction radius expands for co-directional, speed-matched agents and shrinks otherwise, aligning with physical intuition. For numerical stability,  d  and  v  are normalized to comparable scales, and  a b c  are kept non-negative.
This selection strategy not only significantly reduces redundant computations but also highlights neighbors with actual interaction relevance, thereby enhancing the model’s dynamic modeling capability. To improve efficiency, we implement the interaction weight calculation by using vectorized operations, enabling parallel processing on the GPU and greatly reducing computational overhead. Implementation details are provided in the Appendix A.2.
The selected neighboring agents are embedded into features. For each neighbor agent  j , its trajectory segment, position relative to agent  i , and semantic attributes  a j  are rotated to align and then input into an MLP. We encode the neighbor’s independent motion, spatial relationship with agent  i , and semantic attributes, providing an environment-aware feature  z i j t  for the attention mechanism
z i j t = ϕ n e r R i p j t p j t 1 , R i p j t p i t , a j

3.3.2. Physics-Guided Contextual Aggregation

The strength of interaction between participants is not solely determined by learned features but is also closely related to their physical states. Vehicles moving at similar speeds and directions tend to interact through maneuvers such as merging or following. By contrast, vehicles traveling in opposite directions and at long distances have little practical impact, even if their features appear similar. Traditional attention mechanisms are highly sensitive to training data. Incorporating physical priors can stabilize the attention distribution and improve robustness. As shown in Figure 4, to effectively integrate interactions between an agent and its neighbors, we design a graph attention mechanism with physical priors that encodes common traffic knowledge into the model. By explicitly introducing physical priors, the model gains a reliable basis for decision-making. This approach also addresses the limitation of attention mechanisms that rely solely on feature similarity scores, making the attention more consistent with real physical laws.
Specifically, when computing the attention weights for neighboring agents, we explicitly incorporate physical information, such as relative distance, relative angle, and relative velocity, as priors to guide the attention distribution. On the basis of the conventional graph attention network (GAN), we introduce a physical weighting term  w v u  into the attention scoring function. This term increases the focus on neighbors that are close in distance, similar in speed, and aligned in direction during feature fusion. This mechanism not only improves the interpretability of the attention distribution but also enhances the model’s ability to perceive physical interactions in real traffic scenarios.
We transform the feature embedding of agent  i  into a query vector and use the embeddings of its filtered neighboring agents to compute key and value vectors:
q i t = W Q s p a c e z i t       k i j t = W K s p a c e z i j t       v i j t = W V s p a c e z i j t
To incorporate physical priors into the attention mechanism, we introduce a physics-guided term:
w i j = α d i j β Δ v i j + λ cos θ i j ,       α , β , λ 0 ,
which captures the influence of relative distance, velocity difference, and directional alignment between agents.
The attention logits are then modified as:
l i j t = q i t k i j t d k + w i j
yielding the normalized attention weights
α i j t = s o f t m a x j l i j t = e w i j e q i t k i j t d k s e w i s e q i t k i s t d k e w i j s o f t m a x j q i t k i j t d k
where  W Q s p a c e W K s p a c e  and  W V s p a c e R d k × d h  are learnable projection matrices, and  d k  denotes the latent dimension.
Equation (9) shows that adding  w i j  to the attention logits is analytically equivalent to multiplying the standard attention distribution by a physics prior factor  e w i j . This formulation biases attention toward physically plausible interactions during inference, without introducing any additional parameters or auxiliary training branches. The projection layers  W Q s p a c e W K s p a c e W V s p a c e  remain unchanged, ensuring that model complexity and parameter count are preserved. For numerical stability,  d i j  and  Δ v i j  are normalized to comparable scales, and  α β λ  are constrained to be non-negative.
This mechanism fuses information through weighted summation in a fixed manner and does not introduce nonlinear decision-making capability. Therefore, we incorporate a gating mechanism that allows the model to dynamically adjust weights based on the current context, enabling flexible selection of key information. This dynamic adjustment improves the model’s adaptability to different scenarios and its ability to represent complexity. During local encoding, the gate determines how much of the agent’s own historical trajectory features (such as acceleration intent) are retained while absorbing influences from surrounding vehicles (such as avoidance behavior).
First, neighbor information is aggregated by weighted summation to generate a context-aware intermediate representation  m i t
m i t = j N i α i j t v i j t
Then, a gating mechanism dynamically adjusts the fusion ratio between the agent’s own features and neighbor information. This feature prevents over-reliance on external information or neglect of environmental constraints.
g i t = s i g m o i d W g a t e z i t , m i t
z ^ i t = g i t W s e l f z i t + 1 g i t m i t
where  W g a t e  and  W s e l f  are learnable parameters,   denotes element-wise multiplication, and  g i t  is the gating value that controls the fusion weight between environmental information and self-features.
As shown in Figure 5, multi-head attention is the core mechanism of the transformer. Its main idea is to parallelize multiple attention heads, allowing the model to focus on different subspace features of the input simultaneously, thereby enhancing its representational capacity. Each attention head independently learns a specific type of interaction, and their outputs are concatenated and fused. In this work, we adopt the standard transformer encoder structure to perform spatial feature fusion among agents. Each attention module is followed by layer normalization and a residual connection to ensure training stability and enable parallel modeling. The spatial embedding of agent  i  at time step  t  is computed as follows:
H 0 = z ^ i t H 1 = L a y N o r m H 0 + M u l t i H e a d H 0 H 2 = L a y N o r m H 1 + F F N H 1 s i t = H i 2

3.3.3. Temporal Modeling

To enhance the model’s ability to capture the evolution of trajectories, we explicitly introduce a temporal dependency modeling mechanism.  s i t R d h  represents the spatial embedding of agent  i  at time step  t [ 1 , T ] . To further extract information from the entire trajectory sequence, we append a learnable temporal token  S T + 1 R d h  at the end of each agent’s time series. This token aggregates the dynamic representation of the full sequence and provides a more comprehensive trajectory encoding for subsequent prediction [36].
Raw   Sequence = s i 1 , s i 2 , , s i T , s T + 1 R T + 1 × d h
To enable the model to perceive the temporal order within trajectories, we introduce a learnable positional encoding  E p o s R ( T + 1 ) × d h  for each time step in the sequence. This approach provides an explicit temporal prior. Such a design helps the model capture dynamic changes over time, thereby improving the accuracy and semantic consistency of trajectory modeling. It also demonstrates stronger representational capability in recognizing complex traffic behavior patterns.
S i = R a w   S e q u e n c e + E p o s   = [ s i 1 + E p o s 1 , , s T + 1 + E p o s T + 1 ] R T + 1 × d h
In prediction scenarios, the model must infer future states solely on the basis of historical information. If future information is leaked during training, then it may lead to overestimated performance and poor generalization. To ensure that predictions at any time step  t  rely only on observed historical trajectories, we incorporate a causal masking mechanism into the transformer encoder. This mechanism applies an upper-triangular mask to the attention matrix during sequence processing, restricting each time step  t  to access only features from previous steps. This design effectively prevents future information leakage. It also ensures that the prediction process aligns with real-world reasoning logic, enhancing the interpretability and reliability of the model during deployment.
Q i = S i W Q t i m e K i = S i W K t i m e V i = S i W V t i m e
S ^ i = s o f t m a x Q i K i d k + M V i
M t q , t k = ,       t k t q 0 ,                 t k > t q
where  W Q t i m e W K t i m e , and  W V t i m e R d h × d k  are learnable matrices;  t q  denotes the current time step; and  t k  denotes the historical time step.
To fully capture the temporal dependencies within trajectory sequences, we adopt a standard transformer encoder to represent each agent’s trajectory. We alternately stack multi-head attention modules and feed-forward network modules and apply residual connections and layer normalization after each sub-module. This design improves training stability and efficiency for long-sequence modeling. The final output  S i T  is an updated sequence of trajectory representations, which serves as the input for subsequent agent–lane interaction modeling. This enables a progressive modeling process from the temporal dimension to the spatial interaction dimension.

3.3.4. Agent and Lane Interaction

To effectively leverage lane structure information for trajectory prediction, we design a lane encoding module that embeds geometric and semantic attributes into a high-dimensional vector space. Lane structures provide rich prior knowledge, such as direction, type (e.g., straight or turning), and boundary constraints, all of which directly influence a vehicle’s driving intent. For example, if a vehicle is on a left-turn lane, then its future trajectory is highly likely to follow a left-turn pattern. Therefore, the model needs to perceive the geometric relationship between the lane and the target agent and incorporate semantic attributes for joint modeling.
For each target agent  i , its relationship with lane segment  ξ  is characterized by the following vector features:
z i ξ = ϕ l a n e R i p ξ 1 p ξ 0 , R i p ξ 0 p i T , a ξ
where  a ξ  is the semantic attribute vector of the lane segment.  R i p ξ 1 p ξ 0  is the vector representation of the lane segment direction in the agent’s local coordinate system.  R i p ξ 0 p i T  is the relative position of the lane start point with respect to the agent’s current position.
This encoding approach not only enables explicit modeling of spatial relationships but also effectively integrates semantic priors from lane information. It enhances the model’s ability to capture the compatibility between target trajectories and road structures. A GAN is employed to fuse lane features by computing the influence weights of lane segments on each agent. A gating mechanism is then used to dynamically integrate the agent features with lane information.  h i  fuses interactions between agents, temporal dynamics, and lane constraints, and is used for downstream trajectory prediction.
Q i = W Q S i T K i ξ = W K z i ξ V i ξ = W V z i ξ
α i = s o f t m a x Q i K i ξ d k
m i = ξ   α i ξ V i ξ
g i = s i g m o i d W g a t e S i T , m i
h ^ i = g i W s e l f S i T + 1 g i m i
h i = M L P h ^ i R d h

3.4. Global Interaction Module

To address the limitations of local interaction modeling in terms of perception range and representational capacity, we design a global interaction module to capture long-range dependencies within traffic scenes. This module constructs a context association mechanism across spatial regions, enabling the model to effectively learn semantic relationships among distant agents. As a result, the model gains a better understanding of global traffic context and achieves improved prediction accuracy in complex scenarios.
In global interaction, we introduce a relative geometric vector  e i j  to explicitly describe the spatial relative relationships and motion direction differences between any two agents    j  and    i . This representation is defined as shown in Figure 6.
e i j = ϕ r e l R i p j T p i T , cos Δ θ i j , sin Δ θ i j
where  p j T p i T  denotes the global position difference of agent    j  relative to agent  i  at the current time step  T .
Δ θ i j = θ j θ i  is the motion direction angle of agent    j  relative to agent  i .
This feature provides geometric priors during global information propagation, helping alleviate alignment issues caused by differences in local coordinate systems. It ensures spatial consistency in modeling, thereby enhancing the representation of long-range dependencies in complex scenarios. To capture global bidirectional interactions, we adopt the same encoding strategy used for local interactions. An MLP is then applied to produce the global representation  h ~ i  for each agent  i .
q ~ i = W Q g l o b a l h i k ~ i j = W K g l o b a l h j , e i j v ~ i j = W V g l o b a l h j , e i j
α i j = s o f t m a x q ~ i k ~ i j d k
m i = j   α i j v ~ i j
g i = s i g m o i d W g a t e h i , m i
h ^ i = g i h i + 1 g i m i
h ^ i = L a y N o r m h i + h ^ i
h ~ i = M L P h ^ i

3.5. Multimodal Predictions Module

To model the multi-modal distribution of future trajectories for traffic participants, we introduce a Laplacian mixture model to represent the potential motion patterns of each agent. This approach enhances the model’s ability to capture trajectory uncertainty and diversity while maintaining prediction accuracy. Compared with Gaussian mixture models, the Laplacian distribution exhibits a sharper peak and heavier tails, making it more suitable for modeling highly dynamic behaviors such as sudden direction changes and sharp turns in real-world traffic scenarios. Specifically, for agent  i  in the scene, its future trajectory distribution is modeled as a combination of  F  Laplacian mixture components
P y i t = f = 1 F π i , f · L a p l a c i a n y i t μ i , f t , b i , f t
where  μ i , f t R 2  denotes the predicted mean trajectory of the  f -th mixture component at time step  t b i , f t R 2  represents the trajectory scale at each time step  t ; and  π i , f 0,1  is the normalized probability weight of the  f -th mixture component, satisfying  f π i , f = 1 .
These parameters are learned via neural networks, as shown in Figure 7. An MLP is used to predict the position and scale parameters of each mixture component at each future time step. The output tensor has the shape  F , N , H , 4 , where  F  is the number of mixture components,  N  is the number of agents,  H  is the number of future time steps for prediction, and 4 represents the 2D mean trajectory and trajectory scale. Another MLP is used to predict the mixture weight vector,  π i R F , for each agent, with a shape of  N , F . The weights are normalized using a softmax layer and are used to fuse multiple motion modes. This mechanism improves the model’s ability to capture trajectory uncertainty and enhances its robustness and interpretability in complex scenarios.

3.6. Loss Function

A diversity-aware supervision mechanism is introduced to enhance the model’s ability to model the uncertainty of future trajectories. This mechanism employs a best-match trajectory selection strategy, which supervises only the predicted trajectory that is closest to the ground truth among all generated candidates. As a result, the averaging effect that can suppress diversity is effectively avoided. In addition, a joint regression and classification loss is designed to ensure prediction accuracy while improving the distinctiveness and selectivity of the multi-modal outputs.
For each target agent, the model predicts F candidate trajectories over a three-second horizon. Let the ground-truth trajectory be denoted as  p i T + 1 , , p i T + H . The model first computes the average displacement error between each predicted trajectory and the ground truth across the entire prediction horizon, resulting in an error matrix  E R F × N , where  E f , i  represents the distance error between the  f -th predicted trajectory of agent  i  and the ground-truth. Subsequently, for each target, the trajectory  k i *  with the minimum error to the ground-truth is selected as the supervision target, and gradient updates are performed only on this trajectory. This strategy effectively avoids the prediction ambiguity problem caused by traditional multi-trajectory average error losses, thereby encouraging the model to learn more distinguishable motion patterns.
Model training employs a joint loss function, encompassing a regression loss for trajectory fitting and a classification loss for multimodal probability modeling, with equal weights assigned to both components.
The regression loss  L r e g  utilizes the negative log-likelihood of the Laplacian distribution to model the uncertainty of the prediction model
L r e g = 1 N H t = T + 1 T + H l o g P R i p i t p i T μ ^ i t , b ^ i t
where  μ ^ i t  is the predicted position mean, and  b ^ i t  is the predicted scale parameter.
The classification loss employs cross-entropy loss to encourage the model to learn accurate assignment of mixture mode probabilities, ensuring that the optimal trajectory has the highest probability
L c l s = l o g ( p ^ k i * )
This joint design not only improves the accuracy of trajectory prediction but also enhances the diversity and interpretability of the prediction distribution, enabling the model to have stronger generalization capabilities in complex traffic scenarios.

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset

This paper uses the Argoverse Motion Forecasting Dataset (V1) for experimental validation [37]. Provided by Argo AI, this dataset is a widely used standard multi-agent trajectory prediction dataset in the autonomous driving field. It contains 323,557 real driving scenarios, covering agents’ motion trajectories and high-definition map information, and is divided into 205,942 training scenarios, 39,472 validation scenarios, and 78,143 test scenarios. Trajectory data are 2D coordinate points with a sampling frequency of 10 Hz, accompanied by rich high-definition map information (lane centerlines, turning attributes, etc.) to facilitate modeling physical constraints and behavioral priors. All training and validation scenarios are five-second continuous sequences, including annotations of the central vehicle and up to 60 surrounding traffic participants’ historical trajectories (first 2 s) and future trajectories (next 3 s). The test set releases only the first two seconds of observation data; the real trajectories of the next three seconds in the test set are used for Argoverse’s official evaluation only to protect data privacy.
This paper follows the official data split, using only the historical two-second trajectory as input to predict the central vehicle’s three-second future motion trajectory. The dataset’s scale and authenticity provide a reliable benchmark for verifying the model’s generalization capability, and its time-sensitive design of two-second input and three-second output also meets the real-time prediction requirements in autonomous driving scenarios.

4.1.2. Evaluation Index

The experiment employs standard evaluation metrics in the field of motion prediction, including minimum average displacement error (minADE), minimum final displacement error (minFDE), and miss rate (MR), to comprehensively quantify the model’s prediction performance.
minADE: The model generates up to six candidate trajectories for each agent, and the trajectory with the smallest endpoint error is selected as the optimal prediction result. The average distance between this trajectory and the ground-truth trajectory across future time steps is defined as the minimum average displacement error.
m i n A D E = 1 T t = 1 T p ^ t b e s t p t g t 2
where  p ^ t ( b e s t )  is the position of the optimal prediction trajectory at time step  t p t ( g t )  is the ground-truth trajectory position, and  T  is the total number of future time steps.
minFDE: It measures the endpoint error of the optimal prediction trajectory at the final time step. This metric directly reflects the model’s prediction accuracy for the target endpoint position, which is crucial for the decision-making safety of autonomous driving.
m i n F D E = p ^ T b e s t p T g t 2
MR: It is the proportion of scenarios where the distance between the endpoint of the optimal prediction trajectory and the ground-truth endpoint exceeds 2 m. This metric quantifies the model’s failure risk at key positions: the lower the error rate, the higher the model’s reliability.

4.1.3. Implementation Details

All training, inference, and profiling experiments were conducted on a single cloud-based virtual machine equipped with 15 vCPU (Intel® Xeon® Platinum 8358P, 2.60 GHz), manufactured by Intel Corporation (Santa Clara, CA, USA), 90 GB RAM, and NVMe SSD storage (30 GB for system and 50 GB for data). The system also included an NVIDIA GeForce RTX 3090 GPU (24 GB VRAM), manufactured by NVIDIA Corporation (Santa Clara, CA, USA) and ran Ubuntu 18.04 with Python 3.8 and PyTorch 1.8.1.
The model was optimized using AdamW with an initial learning rate of 5 × 10−4, a batch size of 32, and 64 total training epochs. Dropout (0.1) and weight decay (1 × 10−4) were applied to prevent overfitting, and a cosine annealing scheduler dynamically adjusted the learning rate to balance training stability and convergence. Each training epoch took approximately 50 min.
All attention modules in the model were configured with eight heads to enhance feature fusion across different scales. No ensemble methods or data augmentation were used during experiments to purely evaluate the model’s inherent performance. For hyperparameters, the interactive neighbor screening stage involved three physical attribute weight coefficients: relative distance ( a = 1 ), relative velocity ( b = 1 ), and relative direction ( c = 0.1 ). Additionally, the feature fusion stage included three learnable weighting parameters with initial values  α = 1 β = 1 , and  λ = 1 , used to regulate the fusion intensity of interaction features between the ego vehicle and neighboring vehicles.

4.2. Ablation Studies

4.2.1. Interactive Neighbor Filtering Strategy

In GNNs, the DropEdge technique is employed to reduce overfitting, enhance robustness, and strengthen graph structure modeling. In this paper, we comprehensively consider physical information to construct the interaction weight  A v u , which is used to determine which neighbors to retain. Top-k selection and min-score thresholding are two common selection strategies in the GNN field. Top-k selection retains weights with the highest scores to ensure important interaction information is not lost but may discard some marginal yet critical interactions. Min-score thresholding retains interactions with scores greater than a set minimum, preserving all high-quality interactions but potentially leading to unstable edge counts.
To evaluate the impact of different screening strategies on model performance while keeping other experimental conditions identical, we introduced these two typical screening mechanisms for an ablation study comparison.
According to experience, top-k selection retains the top 80% of interaction neighbors by weight, while min-score thresholding retains interactions with weights greater than 0.1. To evaluate the impact of different screening strategies on model performance while keeping other experimental conditions identical, we introduced these two typical screening mechanisms for an ablation study comparison.
As shown in Table 1, top-k selection outperforms min-score thresholding in all three evaluation metrics. Our analysis suggests that this result may be attributed to top-k selection’s more stable control over the number of edges, enabling the model to retain critical edges during training and avoid excessive pruning. We further evaluated model performance under different top-k retention rates (70% and 90%). Ultimately, 80% was selected as the default strategy due to its optimal performance across multiple metrics, and all subsequent experiments adopt this screening strategy.

4.2.2. Importance of Introducing Physical Information

To validate the effectiveness of the proposed physical information integration strategy in our local interaction module, we designed three ablation experiments to progressively evaluate two key innovations and their combination:
w/o physics in neighbor selection: Neighbor agents are selected within a fixed radius (r = 50 m) without using physical constraints (relative distance, relative velocity, relative direction).
w/o physics in attention: Physical factors are removed from the softmax computation in the attention mechanism, using only standard self-attention.
w/o physics overall: Both improvements are removed, representing a version without any physical information integration.
As shown in Table 2, removing either module leads to performance degradation, indicating that integrating physical information into both neighbor selection and attention allocation positively affects model effectiveness. Notably, the physical factors in the attention mechanism contribute more significantly to performance gains, highlighting the critical role of physics-aware attention in modeling effective interactions.

4.2.3. Importance of Each Module

To evaluate the contributions of individual modules within the overall architecture, we systematically removed each key component and analyzed its impact on model prediction performance. By alternately removing specific modules and conducting evaluations under identical settings, we validated their respective contributions to performance. Table 3 presents the following ablation results for each module Removing the agent–agent interaction module led to a significant performance drop, indicating its indispensable role in modeling fine-grained short-range behavioral interactions. This module captures motion coordination and conflicts between agents and their neighbors at the current time step; without it, the model struggles to accurately characterize local dynamics, particularly in dense scenarios.
Removing the temporal modeling module had the most pronounced impact on performance. This finding highlights that accurately inferring future trajectories in highly dynamic traffic environments heavily relies on modeling historical motion patterns. The module effectively captures each agent’s temporal evolution features, aiding in predicting subsequent behavioral trends.
Removing the agent–lane module also demonstrated a positive effect on performance. By aggregating contextual information from all agents in the scene, this module enhances the model’s ability to capture long-range dependencies, proving particularly effective for sparse interactions or long-tail trajectories.
Removing the global interaction module similarly showed a positive impact on performance. By integrating contextual information across all agents, this module strengthens the model’s capacity to model long-range dependencies, which is especially valuable for handling sparse interactions or rare trajectory patterns.
In summary, each module plays an irreplaceable role in modeling multi-agent traffic behavior, validating that the modular design of the proposed framework effectively balances accuracy and efficiency.

4.3. Inference Efficiency and Memory Footprint

To substantiate the lightweight design beyond parameter count, we profiled the end-to-end inference efficiency of PLMT-Net on the same server described in Section 4.1.3 (NVIDIA RTX 3090, PyTorch 1.8.1, torch.no_grad mode). Latency was measured using wall-clock timing (time.perf_counter) with explicit device synchronization over 1000 valid samples, and both per-batch and per-sample latency were recorded. Throughput was computed as samples per second, while the peak GPU memory was obtained from the maximum allocated and reserved values during the full forward process (neighbor scoring, top-K selection, physics-guided attention, and decoding).
Table 4 presents that PLMT-Net achieves real-time inference capability (≥10 Hz, 60.6 ms < 100 ms) in an online setting, while maintaining a small memory footprint (<100 MB at B = 1). With mini-batching, the throughput scales to approximately 65 samples/s with modest memory growth. The efficiency stems from the use of physics priors formulated as closed-form terms—Top-K gating coefficients  A i j  and additive logits weights  w i j —which introduce negligible computational overhead and avoid any auxiliary heavy branches, thereby preserving the lightweight design.
These measurements confirm that the proposed PLMT-Net can meet real-time requirements for online trajectory prediction in typical autonomous driving scenarios.
Measurement protocol. Inference was benchmarked for the full forward path (neighbor scoring  A i j , top-K selection, physics-guided attention  w i j , and decoding), using the same input configuration as in Section 4.1 (2 s history and 3 s future horizon,  K = 0.8 | C i ). Measurements were taken in model.eval() and torch.no_grad() modes. Latency was recorded via CUDA events with device synchronization, averaged over 1000 valid samples after a short warm-up (excluded from averages). Peak GPU memory was recorded using torch.cuda.reset_peak_memory_stats(), followed by torch.cuda.max_memory_allocated() and torch.cuda.max_memory_reserved().

4.4. Results

4.4.1. Comparison with Existing Methods

As shown in Table 5, PLMT-Net achieves the best results across all three metrics (minADE, minFDE, MR) while using the fewest parameters, demonstrating both high prediction accuracy and a compact model size. The baselines include a transformer-based architecture (HiVT [13]), a graph neural network (LaneGCN [8]), and a hybrid design with attention modules (DenseTNT [38]), representing three representative families of current trajectory-prediction methods.
The superior performance of PLMT-Net is due to two key innovations: (1) the integration of physical information to optimize interaction partners, and (2) the fusion of interaction features with explicit physical priors in both neighbor selection and attention allocation during inference. Unlike HiVT, LaneGCN, and DenseTNT, which rely purely on learned interaction features or graph structures without explicit physical constraints at inference time, PLMT-Net directly injects physics priors into the interaction modeling. This leads to more robust and interpretable predictions, especially in complex multi-agent scenarios.
From a runtime perspective, we profile PLMT-Net and HiVT on the same GPU, obtaining latencies of 60.6 ms and 49.4 ms. For LaneGCN and DenseTNT, we report latencies of 55.32 ms and 138.59 ms, as reported in (RTX 2080, batch size 1) [39]. Overall, PLMT-Net achieves latency comparable to other lightweight baselines and substantially lower than DenseTNT, while simultaneously offering higher accuracy with fewer parameters.
The slightly higher latency of PLMT-Net compared to the fastest baselines is mainly due to the incorporation of physics-guided neighbor filtering and attention, which introduces a small computational overhead but enforces physically plausible interactions. In practice, the model still operates in real time on a commodity GPU, and this modest increase in inference time represents a reasonable trade-off for improved prediction accuracy, physical feasibility, and interpretability.

4.4.2. Visualization of Prediction Results

As shown in Figure 8, we present several validation scenarios of the model on the Argoverse dataset. Each scenario plots only the prediction results of two representative target agents to intuitively demonstrate the model’s prediction capability. Following the benchmark protocol, trajectories are overlaid on the vectorized HD-map layers (including lane centerlines and topology) provided by Argoverse, consistent with the model’s vectorized inputs. Under various complex traffic situations (such as intersections, curves, overtaking, etc.), the proposed model can generate accurate, physically plausible, and diverse future trajectory distributions for multiple targets.
The results in the figure indicate that the predicted trajectories generally cover the ground-truth trajectories well. In straight driving scenarios in particular, the model’s predictions are more stable and accurate, indicating its strong adaptability to low-dynamics scenarios. In the overtaking scenario in the upper-left corner, the model successfully predicts the vehicle’s motion intent to change lanes and overtakes the front vehicle, demonstrating good behavior recognition ability. In the complex diverging intersection scenario in the lower-right corner, benefiting from the integration of lane structure information, the model can still output multimodal prediction results consistent with the ground-truth trajectory direction, showcasing its effective utilization capability of structural guidance information.

5. Discussion

5.1. Real-Time Feasibility and Deployment Efficiency

To evaluate the real-time feasibility of PLMT-Net, we benchmarked its end-to-end inference performance on a commodity GPU (NVIDIA RTX 3090, PyTorch 1.8.1). The model achieves an average per-sample latency of 60.6 ms (≈16.5 FPS) under online settings (batch = 1) and scales to ≈65 samples/s with mini-batching (batch = 8), while maintaining a small memory footprint (<100 MB). These results confirm that PLMT-Net can operate in real time on standard hardware, meeting the 10 Hz requirement commonly adopted in autonomous driving perception–prediction–planning loops.
In terms of deployment feasibility, the model’s lightweight design (≈0.65 M parameters) and use of closed-form physics priors (Equations (4) and (8)) eliminate the need for heavy auxiliary networks or iterative optimization, which facilitates on-device or embedded deployment. The inference efficiency and memory footprint measured in Section 4.3 demonstrate that the model can be integrated into real-time pipelines, providing a practical foundation for future closed-loop simulation or embedded testing.

5.2. Robustness Under Sensor Noise and Uncertainty

In real-world deployments, the estimation of relative distance, velocity, and direction often suffers from sensor noise and fusion errors, which may degrade attention-based interaction modeling. The proposed framework assumes reasonably accurate perception inputs and does not explicitly handle uncertainty in these measurements. Robustness to such noise remains an open and practically important challenge. Future research will extend this framework by incorporating noise-aware motion semantic modeling and hierarchical graph-based denoising mechanisms to enhance prediction reliability under noisy observations.

5.3. Practical Implications and Interpretability

The interpretability of PLMT-Net arises from its explicitly physics-guided formulation. The neighbor selection (Equation (4)) and attention modulation (Equation (8)) impose monotonic relationships among distance, relative velocity, and heading alignment, allowing each interaction to be interpreted in physical terms rather than learned implicitly. This physics-based design biases the model toward dynamically feasible and socially consistent motion patterns. Consequently, PLMT-Net produces smoother and more physically realizable trajectories that align with vehicle dynamics and are easier for downstream planners to execute. Therefore, the observed performance gains are not only numerical but also practically meaningful for real-world driving systems.

5.4. Limitations and Future Work

This study mainly reports offline evaluation and profiling under open-loop settings. While the employed Argoverse dataset reflects realistic urban traffic conditions, it cannot fully capture unexpected or adversarial behaviors, rare maneuvers, or sensor dropouts encountered in real-world driving. Future work will therefore include closed-loop simulation and on-device testing to more comprehensively assess real-time performance, behavioral stability, and safety within integrated planning–control pipelines.
The current qualitative visualizations are presented on vectorized HD-map layers consistent with the model’s input representation. Incorporating a rendering or interactive simulation module would provide more intuitive, scene-level visualizations and facilitate analysis of agent behaviors, model interpretability, and potential failure modes.
Finally, systematic evaluation under domain shift and out-of-distribution (OOD) conditions remains an open challenge. Extending the framework to explicitly handle sensor noise, perception uncertainty, and novel traffic behaviors will be an important direction for enhancing robustness and generalization across diverse driving contexts, addressing practical limitations for deployment in real-world scenarios.

6. Conclusions

We propose a lightweight multi-agent trajectory prediction framework that integrates physical information, combining a physical–constraint-based interaction partner screening strategy with a physics-aware GAN to effectively model the critical interaction features among traffic participants. This approach significantly compresses the model’s parameter count while enhancing prediction accuracy and physical interpretability. Experiments on the Argoverse motion forecasting benchmark demonstrate that our method maintains lightweight characteristics while achieving performance comparable to or even superior to that of mainstream approaches, validating its effectiveness and practicality.
For future work, we will further explore the framework’s application potential in closed-loop simulation testing, multi-scenario generalization, and real-time deployment, pushing trajectory prediction models toward practical engineering implementation.

Author Contributions

Conceptualization, W.Y. and F.L.; methodology, L.Z.; software, W.Y. and M.C.; validation, W.Y., F.L. and L.Z.; formal analysis, H.L.; investigation, H.L. and L.Z.; resources, F.L.; data curation, W.Y. and M.C.; writing—original draft preparation, W.Y.; writing—review and editing, W.Y. and F.L.; visualization, W.Y.; supervision, F.L.; project administration, F.L.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

Financial support was provided by the Guangxi Science and Technology Program of China under (Grant No. AB23026106) and the Graduate Innovation Project of Guilin University of Electronic Technology (Grants No. 2025YCXB003 and No. 2025YCXS021).

Data Availability Statement

The data used in this study are publicly available from the Argoverse dataset (https://www.argoverse.org/ (accessed on 15 January 2025)). The training and inference source code are available at GitHub: https://github.com/image-Q/PLMT-Net (accessed on 7 November 2025).

Acknowledgments

The authors would like to thank the open-source community for their valuable datasets and tools that supported this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

Algorithm A1 summarizes the four-step inference procedure of PLMT-Net described in Section 3.1 and Figure 2. The algorithm explicitly shows how physics-aware neighbor formation and physics-guided attention are implemented at inference time, ensuring physical plausibility with negligible computational overhead.
Algorithm A1 PLMT-Net Inference (Four-step Pipeline, per Agent i)
 1.
 2.
 3. Inputs:
 4.      H_i                                        ← historical trajectory of agent i
 5.      C_i                                        ← candidate neighbors with histories {H_j}
 6.      M                                          ← local map features
 7.      Params                                  ← {W^Q, W^K, W^V, TemporalEncoder, Decoder}
 8.      Hyperparams                        ← {a, b, c, α, β, λ}
 9.      K                                            ← ceil(0.8 × |C_i|)
10.
11. 1) Data vectorization:
12.        z_i ← EncAgent(H_i, M)
13.        z_ij ← EncPair(H_i, H_j, M)                          for each j ∈ C_i
14.        d_ij, Δv_ij, cosθ_ij ← compute_physical_cues(H_i, H_j)
15.
16. 2) Local encoder (our contributions):
17.        // (2a) Physics-aware neighbor formation
18.        A_ij = −a·d_ij − b·Δv_ij + c·cosθ_ij                                        // Eq. (4)
19.        N_i = TopK_by_value(A_ij, K)                                                           // Eq. (5)
20.        // (2b) Physics-guided local attention
21.        q_i = W^Q·z_i; k_ij = W^K·z_ij; v_ij = W^V·z_ij
22.        w_ij = −α·d_ij − β·Δv_ij + λ·cosθ_ij                                        // Eq. (8)
23.        ℓ_ij = (q_iT·k_ij)/√d_k + w_ij
24.        α_ij = softmax_over_j(ℓ_ij)
25.        c_i = Σ_{j∈N_i} α_ij·v_ij
26.        // (2c) Temporal modeling
27.        h_i = TemporalEncoder(concat(z_i, c_i))
28.
29. 3) Global interaction:
30.        h_i = GlobalAggregator({h_*})
31.
32. 4) Multimodal decoding:
33.        {Ŷ_i^(m), π_i^(m)}_{m=1..M} = Decoder(h_i)
34.
35. Notes:
36.        • TopK reduces local computation from O(|C_i|·d_k) to O(K·d_k).
37.        • softmax(x + w) ∝ exp(w)·softmax(x), interpreted as a physics prior on attention.
38.

Appendix A.2

In the main model, we compute interaction weights by sequentially evaluating relative distance, relative velocity, and relative angle to screen interaction partners. The initial implementation used an edge-by-edge computation approach, which suffered from significant efficiency bottlenecks during training and inference due to its high computational complexity and numerous redundant indexing operations. To address this issue, we vectorized the weight computation process to handle physical features of all interactions in parallel, fully leveraging GPU parallelism and thereby significantly improving efficiency.
The pseudocode for the original implementation is as follows:
# Loop-based version (simplified)
 for (i, j) in edges:
        d = ||pi - pj||
        dv = |vi - vj|
        theta = angle(vec_i, vec_j)
        score = -a* d - b* dv + c* cos(theta)
The vectorized implementation is as follows:
# Vectorized version (batch processing)
 pi1, pi2 = positions[row, t], positions[row, t+1]
 pj1, pj2 = positions[col, t], positions[col, t+1]
 vi, vj = velocity[row, t], velocity[col, t]
 d = ||pj1 - pi1|| (batch norm)
 dv = |vi - vj| (element-wise abs)
 theta = vectorized_angle(pi2 - pi1, pj2 - pj1)
 score = -a * d - b* dv + c * cos(theta)
As shown in Figure A1, the original implementation features a clear logical framework but suffers from low GPU utilization due to extensive Python loops and element-wise computations, becoming a bottleneck in system operation. The vectorized approach eliminates for-loops by indexing the feature information of all edge endpoints at once and leveraging PyTorch’s broadcasting mechanism for computation, significantly accelerating the workflow.
Figure A1. Vectorization Improvement.
Figure A1. Vectorization Improvement.
Drones 09 00826 g0a1
This vectorization improvement primarily targets the physical information-based interaction weight calculation process, achieving higher computational efficiency by eliminating loops and manual indexing operations. Although this optimization does not alter the model architecture, it provides a solid guarantee for the overall system’s training speed and inference throughput while also demonstrating the importance of system engineering optimization. We recommend adopting vectorized implementations as much as possible in similar tasks to fully leverage hardware performance.

References

  1. Huang, Y.; Du, J.; Yang, Z.; Zhou, Z.; Zhang, L.; Chen, H. A survey on trajectory-prediction methods for autonomous driving. IEEE Trans. Intell. Veh. 2022, 7, 652–674. [Google Scholar] [CrossRef]
  2. Liu, J.; Mao, X.; Fang, Y.; Zhu, D.; Meng, M.Q.-H. A survey on deep-learning approaches for vehicle trajectory prediction in autonomous driving. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021; pp. 978–985. [Google Scholar]
  3. Guo, X.; Jia, H.; Huang, Q.; Luo, Q.; Wang, N.; Mao, Z. A vehicle driving intention prediction method based on gated dual tower transformer model for autonomous driving. Expert Syst. Appl. 2025, 285, 128000. [Google Scholar] [CrossRef]
  4. Feng, Y.; Ye, Q.; Candela, E.; Escribano-Macias, J.J.; Hu, B.; Demiris, Y.; Angeloudis, P. Risk-Aware Stochastic Vehicle Trajectory Prediction With Spatial-Temporal Interaction Modeling. IEEE Open J. Intell. Transp. Syst. 2025, 6, 37–48. [Google Scholar] [CrossRef]
  5. Wang, Z.; Guo, J.; Hu, Z.; Zhang, H.; Zhang, J.; Pu, J. Lane Transformer: A High-Efficiency Trajectory Prediction Model. IEEE Open J. Intell. Transp. Syst. 2023, 4, 2–13. [Google Scholar] [CrossRef]
  6. Chai, Y.; Sapp, B.; Bansal, M.; Anguelov, D. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv 2019, arXiv:1910.05449. [Google Scholar] [CrossRef]
  7. Cui, H.; Radosavljevic, V.; Chou, F.-C.; Lin, T.-H.; Nguyen, T.; Huang, T.-K.; Schneider, J.; Djuric, N. Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 2090–2096. [Google Scholar]
  8. Liang, M.; Yang, B.; Hu, R.; Chen, Y.; Liao, R.; Feng, S.; Urtasun, R. Learning lane graph representations for motion forecasting. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Cham, Switzerland, 2020; pp. 541–556. [Google Scholar]
  9. Gao, J.; Sun, C.; Zhao, H.; Shen, Y.; Anguelov, D.; Li, C.; Schmid, C. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11525–11533. [Google Scholar]
  10. Liu, Y.; Zhang, J.; Fang, L.; Jiang, Q.; Zhou, B. Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7577–7586. [Google Scholar]
  11. Mercat, J.; Gilles, T.; El Zoghby, N.; Sandou, G.; Beauvois, D.; Gil, G.P. Multi-head attention for multi-modal joint vehicle motion forecasting. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 9638–9644. [Google Scholar]
  12. Ye, M.; Cao, T.; Chen, Q. Tpcn: Temporal point cloud networks for motion forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11318–11327. [Google Scholar]
  13. Zhou, Z.; Ye, L.; Wang, J.; Wu, K.; Lu, K. Hivt: Hierarchical vector transformer for multi-agent motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8823–8833. [Google Scholar]
  14. Zhou, M.; Qu, X.; Li, X. A recurrent neural network based microscopic car following model to predict traffic oscillation. Transp. Res. Part C Emerg. Technol. 2017, 84, 245–264. [Google Scholar] [CrossRef]
  15. Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
  16. Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2255–2264. [Google Scholar]
  17. Li, J.; Ma, H.; Zhang, Z.; Tomizuka, M. Social-wagdat: Interaction-aware trajectory prediction via wasserstein graph double-attention network. arXiv 2020, arXiv:2002.06241. [Google Scholar]
  18. Li, X.; Ying, X.; Chuah, M.C. Grip++: Enhanced graph-based interaction-aware trajectory prediction for autonomous driving. arXiv 2019, arXiv:1907.07792. [Google Scholar]
  19. Lee, N.; Choi, W.; Vernaza, P.; Choy, C.B.; Torr, P.H.; Chandraker, M. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 336–345. [Google Scholar]
  20. Sadeghian, A.; Kosaraju, V.; Sadeghian, A.; Hirose, N.; Rezatofighi, H.; Savarese, S. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1349–1358. [Google Scholar]
  21. Chandra, R.; Guan, T.; Panuganti, S.; Mittal, T.; Bhattacharya, U.; Bera, A.; Manocha, D. Forecasting trajectory and behavior of road-agents using spectral clustering in graph-lstms. IEEE Robot. Autom. Lett. 2020, 5, 4882–4890. [Google Scholar] [CrossRef]
  22. Mozaffari, S.; Al-Jarrah, O.Y.; Dianati, M.; Jennings, P.; Mouzakitis, A. Deep learning-based vehicle behavior prediction for autonomous driving applications: A review. IEEE Trans. Intell. Transp. Syst. 2020, 23, 33–47. [Google Scholar] [CrossRef]
  23. Raissi, M.; Perdikaris, P.; Karniadakis, G. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
  24. Li, H.; Liao, Z.; Rui, Y.; Li, L.; Ran, B. A physical law constrained deep learning model for vehicle trajectory prediction. IEEE Internet Things J. 2023, 10, 22775–22790. [Google Scholar] [CrossRef]
  25. Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer: Cham, Switzerland, 2020; pp. 683–700. [Google Scholar]
  26. Westerhout, F.S.; Schumann, J.F.; Zgonnikov, A. Smooth-Trajectron++: Augmenting the Trajectron++ behaviour prediction model with smooth attention. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; pp. 5423–5428. [Google Scholar]
  27. Jiang, H.; Zhao, B.; Hu, C.; Chen, H.; Zhang, X. Multi-Modal Vehicle Motion Prediction Based on Motion-Query Social Transformer Network for Internet of Vehicles. IEEE Internet Things J. 2025, 12, 28864–28875. [Google Scholar] [CrossRef]
  28. Huang, Z.; Mo, X.; Lv, C. Multi-modal motion prediction with transformer-based neural network for autonomous driving. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2605–2611. [Google Scholar]
  29. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
  30. Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. arXiv 2015, arXiv:1506.02626. [Google Scholar] [CrossRef]
  31. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  32. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  33. Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  34. Tan, J.; Li, H.; Zhang, Q. Trajectory Prediction for V2X Collision Warning Using Pruned Transformer Model. In Proceedings of the 2024 6th International Conference on Intelligent Control, Measurement and Signal Processing (ICMSP), Xi’an, China, 29 November–1 December 2024; pp. 549–556. [Google Scholar]
  35. Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. Assoc. Comput. Linguist. 2018, 2, 464–468. [Google Scholar]
  36. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  37. Chang, M.-F.; Lambert, J.; Sangkloy, P.; Singh, J.; Bak, S.; Hartnett, A.; Wang, D.; Carr, P.; Lucey, S.; Ramanan, D. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8748–8757. [Google Scholar]
  38. Gu, J.; Sun, C.; Zhao, H. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15303–15312. [Google Scholar]
  39. Gao, X.; Jia, X.; Li, Y.; Xiong, H. Dynamic scenario representation learning for motion forecasting with heterogeneous graph convolutional recurrent networks. IEEE Robot. Autom. Lett. 2023, 8, 2946–2953. [Google Scholar] [CrossRef]
Figure 1. Schematic of multi-agent trajectory prediction.
Figure 1. Schematic of multi-agent trajectory prediction.
Drones 09 00826 g001
Figure 2. Overall, four-step architecture of PLMT-Net. Step-1: data vectorization. Step-2 (Local Encoder): (2a) physics-aware neighbor formation via  A i j  (Equation (4)), (2b) physics-guided local attention with  w i j  (Equations (7)–(10)), and (2c) temporal modeling to produce  h i . Step-3: global interaction over local representations. Step-4: multimodal trajectory decoding.
Figure 2. Overall, four-step architecture of PLMT-Net. Step-1: data vectorization. Step-2 (Local Encoder): (2a) physics-aware neighbor formation via  A i j  (Equation (4)), (2b) physics-guided local attention with  w i j  (Equations (7)–(10)), and (2c) temporal modeling to produce  h i . Step-3: global interaction over local representations. Step-4: multimodal trajectory decoding.
Drones 09 00826 g002
Figure 3. Interactive neighbor screening strategy. (a) Fixed radius strategy, where  r  is the radius centered on the target agent. (b) Weighted score strategy.
Figure 3. Interactive neighbor screening strategy. (a) Fixed radius strategy, where  r  is the radius centered on the target agent. (b) Weighted score strategy.
Drones 09 00826 g003
Figure 4. Graph Attention Network.
Figure 4. Graph Attention Network.
Drones 09 00826 g004
Figure 5. Transformer Encoder.
Figure 5. Transformer Encoder.
Drones 09 00826 g005
Figure 6. Global interaction. Dashed lines represent local interaction. Solid lines represent global interaction.
Figure 6. Global interaction. Dashed lines represent local interaction. Solid lines represent global interaction.
Drones 09 00826 g006
Figure 7. Multimodal prediction.
Figure 7. Multimodal prediction.
Drones 09 00826 g007
Figure 8. Visualization of Prediction Results. The yellow dotted line represents the historical trajectory. The red solid line represents the actual trajectory. The blue dotted line represents the predicted trajectory. The results show that PLMT-Net predicts future trajectories accurately, even in complex scenarios like overtaking and diverging intersections, by incorporating physical priors into the interaction modeling. It ensures physically plausible predictions by focusing on relevant agents and guiding attention through these priors.
Figure 8. Visualization of Prediction Results. The yellow dotted line represents the historical trajectory. The red solid line represents the actual trajectory. The blue dotted line represents the predicted trajectory. The results show that PLMT-Net predicts future trajectories accurately, even in complex scenarios like overtaking and diverging intersections, by incorporating physical priors into the interaction modeling. It ensures physically plausible predictions by focusing on relevant agents and guiding attention through these priors.
Drones 09 00826 g008
Table 1. Ablation Studies on the Interactive Neighbor Filtering Strategy.
Table 1. Ablation Studies on the Interactive Neighbor Filtering Strategy.
Screening StrategyminADEminFDEMR
Top-K Selection (80%)0.671.020.09
Top-K Selection (70%)0.681.030.10
Top-K Selection (90%)0.671.030.10
Min-score Thresholding0.691.030.10
Table 2. Ablation Studies on Introducing Physical Information.
Table 2. Ablation Studies on Introducing Physical Information.
ModelminADEminFDEMR
Full Model0.671.020.09
w/o Physics in Selection0.681.030.09
w/o Physics in Attention0.691.030.10
w/o Physics Overall0.691.040.10
Table 3. Ablation Studies in Each Module.
Table 3. Ablation Studies in Each Module.
A-ATemporalA-LGlobalminADEminFDEMR
0.711.070.11
0.871.400.17
0.771.210.13
0.711.090.11
0.671.020.09
√ indicates that the module is included.
Table 4. End-to-end inference efficiency and memory footprint (RTX 3090).
Table 4. End-to-end inference efficiency and memory footprint (RTX 3090).
Setting (Batch)Per-Batch Latency (ms)Per-Sample Latency (ms)Throughput (Samples/s)Peak VRAM Allocated (MB)Peak VRAM Reserved (MB)Device
PLMT-Net (bs = 1)60.660.616.5168.492.0RTX 3090 (24 GB)
PLMT-Net (bs = 8)123.215.464.93224.0436.0RTX 3090 (24 GB)
Table 5. Comparison Results with the Existing Methods.
Table 5. Comparison Results with the Existing Methods.
ModelminADEminFDEMR# ParamLatency (ms)
PLMT-Net0.671.020.09652K60.6
HiVT0.691.040.10662K49.4
LaneGCN0.711.080.103701K55.32
DenseTNT0.751.050.101103K138.59
# Param denotes number of parameters.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, W.; Liu, F.; Liu, H.; Chen, M.; Zhao, L. PLMT-Net: A Physics-Aware Lightweight Network for Multi-Agent Trajectory Prediction in Interactive Driving Scenarios. Drones 2025, 9, 826. https://doi.org/10.3390/drones9120826

AMA Style

Yu W, Liu F, Liu H, Chen M, Zhao L. PLMT-Net: A Physics-Aware Lightweight Network for Multi-Agent Trajectory Prediction in Interactive Driving Scenarios. Drones. 2025; 9(12):826. https://doi.org/10.3390/drones9120826

Chicago/Turabian Style

Yu, Wan, Fuyun Liu, Huiqi Liu, Ming Chen, and Liangliang Zhao. 2025. "PLMT-Net: A Physics-Aware Lightweight Network for Multi-Agent Trajectory Prediction in Interactive Driving Scenarios" Drones 9, no. 12: 826. https://doi.org/10.3390/drones9120826

APA Style

Yu, W., Liu, F., Liu, H., Chen, M., & Zhao, L. (2025). PLMT-Net: A Physics-Aware Lightweight Network for Multi-Agent Trajectory Prediction in Interactive Driving Scenarios. Drones, 9(12), 826. https://doi.org/10.3390/drones9120826

Article Metrics

Back to TopTop