Frequency-Domain Trajectory Planning for Autonomous Driving in Highly Dynamic Scenarios

Xia, Jie; Kong, Zhuo; Wu, Xiaodong; Shi, Boran; Han, Yuanbo; Xu, Min

doi:10.3390/app16052447

Open AccessArticle

Frequency-Domain Trajectory Planning for Autonomous Driving in Highly Dynamic Scenarios

by

Jie Xia

¹

,

Zhuo Kong

^1,2,

Xiaodong Wu

^1,*

,

Boran Shi

¹,

Yuanbo Han

¹ and

Min Xu

³

¹

Institute of Intelligent Vehicles, Shanghai Jiao Tong University, Shanghai 200240, China

²

China National Heavy Duty Truck Group Co., Ltd., Jinan 250002, China

³

School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2447; https://doi.org/10.3390/app16052447

Submission received: 28 January 2026 / Revised: 25 February 2026 / Accepted: 26 February 2026 / Published: 3 March 2026

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

Trajectory planning is a central problem in autonomous driving, requiring long-horizon reasoning, strict safety guarantees, and robustness to rare but critical events. Recent learning-based planners increasingly formulate planning as an autoregressive sequence generation problem, analogous to large language models, where future motions are discretized into action tokens and predicted by Transformer-based neural sequence models. Despite promising empirical results, most existing approaches adopt time-domain action representations, in which consecutive actions are highly correlated. When combined with autoregressive decoding, this design induces degenerate generation behavior in learning-based planners, encouraging local action continuation and leading to rapid error accumulation during closed-loop execution, particularly in safety-critical corner cases such as sudden pedestrian emergence. To address this limitation of time-domain autoregressive planning, we propose a unified trajectory planning framework built upon three core ideas: (1) explicit action tokenization for long-horizon planning, (2) transformation of the action space from the time domain to the frequency domain, and (3) a hybrid learning paradigm that combines imitation learning with reinforcement learning. By representing future motion using compact frequency-domain action coefficients rather than per-timestep actions, the proposed planner is encouraged to reason about global motion intent before refining local details. This change in action representation fundamentally alters the inductive bias of learning-based autoregressive planning, mitigates exposure bias, and enables earlier and more decisive responses in complex and safety-critical environments. We present the model formulation, learning objectives, and training strategy, and outline a comprehensive experimental protocol.

Keywords:

autonomous driving; frequency-domain planning; trajectory representation; autoregressive models; highly dynamic scenarios

1. Introduction

Autonomous driving systems must continuously generate safe, feasible, and comfortable trajectories in dynamic and uncertain environments [1]. A planning module is required to anticipate interactions with surrounding agents, respect traffic rules, and respond robustly to unexpected events, all while operating under real-time constraints. Classical planning pipelines rely on explicit modeling and optimization, offering strong interpretability and safety guarantees, but often struggle to scale across diverse scenarios and complex multi-agent interactions [1,2,3].

Learning-based trajectory planning has emerged as a promising alternative by leveraging large-scale driving datasets and powerful neural sequence models [4]. In particular, recent approaches reformulate planning as a sequence modeling problem, where future motions are discretized into tokens and generated autoregressively [5,6,7]. This paradigm closely mirrors the autoregressive generation process of large language models, enabling the use of Transformer architectures to reason over long horizons and complex scene contexts.

Despite these advantages, existing planning-as-sequence-modeling methods share a critical limitation: actions are almost universally represented in the time domain. Future trajectories are discretized at fixed temporal intervals, and each timestep corresponds to an action token encoding a motion increment or control command. While intuitive, this design introduces severe temporal redundancy. In realistic driving data, consecutive actions are highly correlated, particularly at high planning frequencies [8]. As a result, time-domain autoregressive models are strongly biased toward copying previously generated actions, a behavior that minimizes short-term prediction error but undermines long-horizon reasoning [9,10].

This degeneracy manifests in two major failure modes. First, small prediction errors accumulate rapidly during closed-loop execution. Because autoregressive planners condition on their own outputs, deviations from the expert distribution compound over time, leading to unstable behavior [9]. Second, and more critically, time-domain action representations hinder the planner’s ability to revise global motion intent. Safety-critical scenarios such as sudden pedestrian emergence from occlusion, abrupt cut-ins, or unexpected rule violations require immediate and decisive trajectory modification rather than gradual local adjustment [11]. In such cases, incremental timestep-level corrections are often insufficient, as the planner must reshape its future trajectory holistically to ensure safety.

Figure 1 illustrates this limitation in a sudden hazard scenario. When a pedestrian unexpectedly emerges into the ego vehicle’s path, a time-domain autoregressive planner fails to initiate early braking. Instead, it relies on locally smooth, step-wise action updates that strongly couple consecutive timesteps, resulting in delayed response and an increased risk of collision. By contrast, a frequency-domain planner promptly revises the entire future velocity profile, enabling earlier and more decisive braking.

Beyond behavioral outcomes, the difference between time-domain and frequency-domain planning is fundamentally reflected in the autoregressive generation process itself. As illustrated in Figure 2, time-domain planning generates trajectories by predicting one incremental action at each autoregressive step, requiring many sequential decisions to form a complete trajectory. In contrast, frequency-domain planning predicts global motion coefficients, where each autoregressive step refines the entire trajectory by introducing progressively higher-frequency components. This enables rapid global intent revision under unexpected disturbances, rather than relying on slow, local corrections.

These observations indicate that the limitations of existing learning-based planners arise primarily from the choice of action representation under autoregressive modeling, rather than insufficient model capacity or imperfect supervision. To address this issue, we propose to transform the action space from the time domain to the frequency domain. By applying a discrete cosine transform (DCT) to future action sequences and tokenizing compact frequency coefficients, the planner is encouraged to reason about global trajectory structure before refining local details [8,12]. Low-frequency coefficients capture coarse motion intent, while higher-frequency components encode fine-grained adjustments, fundamentally altering the inductive bias of autoregressive planning.

Building on this representation, we introduce a complete trajectory planning framework that integrates frequency-domain action tokenization with a hybrid imitation and reinforcement learning strategy [13,14]. Imitation learning grounds the planner in expert behavior [15,16], while reinforcement learning refines trajectories with respect to safety, feasibility, and rule compliance [17].

The main contributions of this work are summarized as follows:

Autoregressive Action Tokenization for Long-Horizon Planning. We formulate trajectory planning as an autoregressive sequence modeling problem over discrete action tokens, enabling structured and scalable long-horizon plan generation.
Frequency-Domain Action Representation. We redesign the action space by transforming time-domain action sequences into compact frequency-domain coefficients, mitigating temporal redundancy and autoregressive degeneration in safety-critical scenarios.
Hybrid Imitation and Reinforcement Learning Framework. We propose a unified training paradigm that combines imitation learning for representation grounding with reinforcement learning for optimizing safety, feasibility, and rule compliance.

2. Related Work

Trajectory planning for autonomous driving has been extensively studied across learning-based and optimization-based paradigms. Learning-based approaches commonly rely on imitation learning, reinforcement learning, or hybrid training to map observations to planning-oriented outputs, with recent advances driven by end-to-end transformers and large-scale sequence modeling. In parallel, optimization-based and reactive planners formulate driving as constrained decision making and receding-horizon optimization, offering strong interpretability and explicit safety handling in safety-critical interactions. Prior work therefore spans structured-representation imitation planners, reinforcement-based long-horizon policy optimization, interaction-aware predictive control and cooperative decision making, as well as time-domain autoregressive trajectory generation. More recent studies further explore compact and structured trajectory representations that improve long-horizon reasoning and planning stability.

2.1. Optimization-Based and Reactive Planning Methods

Optimization-based and reactive planning methods remain widely deployed in industrial autonomous driving stacks due to their interpretability, explicit constraint handling, and reliable safety-critical responses. These methods typically formulate motion planning as constrained optimization, enabling direct incorporation of vehicle dynamics, collision avoidance, comfort bounds, and rule compliance, while providing predictable failure modes under rare but hazardous disturbances.

A representative line of work focuses on shared-space driving and cooperative decision making, where the vehicle must negotiate priority and intent with vulnerable road users under ambiguous right-of-way. Varga et al. propose a cooperative decision-making framework for shared spaces that explicitly models pedestrian behavior and interaction, and demonstrates improved safety compared with rule-based baselines and model predictive control variants in mixed traffic scenarios [18]. By explicitly accounting for interaction when selecting actions, cooperative formulations offer a principled mechanism to reduce conflict and improve social compliance in cluttered urban environments.

Closely related research investigates interaction-aware model predictive decision making, which couples the planning objective with predictive models of other agents so that the optimization process anticipates how surrounding traffic participants may react. Varga et al. introduce an interaction-aware model predictive decision-making approach that integrates pedestrian trajectory prediction and interaction characterization into a receding-horizon optimization scheme, and report empirical validation in simulator studies [19]. Complementary work on interaction-aware model predictive control similarly emphasizes the integration of multi-modal, interaction-aware predictions into stochastic or chance-constrained optimization, improving robustness to behavioral uncertainty while maintaining feasibility and safety guarantees [20].

Although optimization-based and reactive planners provide strong safety assurances and fast disturbance response, their performance depends on the fidelity of interaction models, handcrafted cost design, and scenario-specific tuning. These limitations motivate learning-based planners to incorporate representations and training objectives that retain the advantages of global intent revision and long-horizon reasoning, while improving stability and controllability under safety-critical interactions.

2.2. Imitation-Based Planning Methods

IL has become a dominant paradigm for learning-based autonomous driving due to its simplicity, scalability, and strong empirical performance in structured traffic environments. Recent end-to-end driving systems demonstrate that policies trained via supervised imitation can effectively map sensory observations to planning-oriented outputs, providing a data-driven alternative to modular pipelines while retaining strong closed-loop performance [21,22]. To improve robustness and controllability, many modern IL-based planners avoid directly regressing low-level control commands and instead predict intermediate planning representations, such as future waypoints or full trajectories, which decouple perception uncertainty from downstream control execution [23,24,25,26].

Several large-scale end-to-end planning frameworks extend imitation learning by integrating structured scene representations and transformer-based sequence modeling. TransFuser fuses multi-view image features with Bird’s-Eye-View representations to predict waypoints in a sensor-aligned space [23]. UniAD and VAD formulate planning as token-based reasoning over vectorized scene elements, where agent- and map-centric tokens interact through attention to produce trajectory-level plans under supervised learning [24,25]. GenAD and Think2Drive further strengthen long-horizon interaction and intent modeling by introducing instance-centric scene tokenization and explicit agent–map coupling, achieving state-of-the-art performance on challenging benchmarks such as nuPlan and Bench2Drive [26,27].

These transformer-based planners share a common characteristic in that they tokenize scene structure to facilitate interaction reasoning, while their planning outputs are still typically generated as time-domain trajectories or actions. By contrast, our work tokenizes the action or trajectory representation itself and adopts a frequency-structured parameterization to reduce temporal coupling and improve long-horizon autoregressive planning stability.

Despite their success, IL-based planners fundamentally rely on the coverage and quality of expert demonstrations. Large-scale driving datasets predominantly capture nominal driving behavior, whereas safety-critical scenarios such as rare pedestrian emergence, aggressive cut-ins, or multi-agent deadlocks are severely underrepresented [28,29]. As a result, imitation policies often suffer from distribution shift during deployment: small deviations from expert trajectories can accumulate over time, leading to compounding errors and unsafe behavior in long-horizon planning [21,27]. These limitations highlight the intrinsic difficulty of extrapolating beyond the expert data manifold and motivate the incorporation of additional learning signals beyond pure imitation, such as RL or hybrid training paradigms.

2.3. Reinforcement-Based and Hybrid Planning Methods

RL provides a principled framework for optimizing long-horizon objectives such as safety, efficiency, and comfort through interaction with the environment. In autonomous driving, RL has been widely investigated for decision making and motion planning, particularly in scenarios requiring interaction reasoning and constraint satisfaction [30]. However, despite its theoretical appeal, applying RL to realistic driving systems remains challenging due to the high dimensionality of state and action spaces, sparse and delayed reward signals, and the prohibitive sample complexity associated with closed-loop training in complex environments [30,31].

To address these limitations, hybrid paradigms that combine IL with RL have attracted increasing attention. A representative line of work introduces imitation-regularized RL objectives, where deviation from expert demonstrations is penalized during policy optimization, thereby stabilizing learning and mitigating unsafe exploration [32]. In particular, Lu et al. propose to jointly optimize a behavior cloning loss with a Soft Actor-Critic objective, showing that imitation regularization significantly improves robustness and safety in challenging driving scenarios compared to pure RL or IL baselines [32]. Similar ideas have also been explored in the broader robotics and control literature, where imitation priors are used to constrain policy updates and improve sample efficiency [33,34].

Another emerging direction follows a two-stage pre-training and fine-tuning paradigm inspired by large-scale sequence modeling. In this setting, a policy is first trained via IL to acquire basic driving competence and subsequently refined using RL to optimize task-level objectives and handle distributional shift [35,36]. Such approaches have been shown to improve closed-loop performance and generalization on large-scale benchmarks such as nuPlan, particularly in interactive and safety-critical scenarios [35]. More recent studies further explore RL guided by learned reward models or preference supervision, drawing inspiration from RL from human feedback [37]. While effective in aligning driving behavior with human preferences, these methods introduce additional annotation cost and subjectivity, raising concerns about scalability and reproducibility in real-world autonomous driving deployments.

Overall, hybrid IL+RL paradigms represent a practical compromise between stability and optimality. Nevertheless, most existing approaches remain tightly coupled to time-domain action representations and low-level control formulations, which limits their ability to reason over long horizons and revise global motion intent under safety-critical disturbances. This observation motivates exploring alternative action representations and optimization strategies that better align RL with long-horizon trajectory planning.

2.4. Time-Domain Autoregressive Planning

Time-domain autoregressive modeling has recently emerged as an important paradigm in autonomous driving for motion generation, agent simulation, and trajectory planning. In these approaches, future motions are generated sequentially in a causal manner, where each token represents a state, action, or motion primitive at a discrete future timestep and is predicted conditioned on all previously generated tokens. By formulating motion generation as a next-token prediction problem, autoregressive models naturally support closed-loop rollout and enable flexible modeling of multi-modal future behaviors.

A representative line of work applies time-domain autoregressive modeling to multi-agent motion generation and simulation. SMART formulates scalable multi-agent motion generation as a sequence modeling problem, where discretized motion tokens are generated autoregressively using a decoder-only transformer conditioned on historical agent states and map context [6]. This next-token prediction formulation demonstrates strong performance on large-scale simulation benchmarks and highlights the effectiveness of autoregressive decoding for modeling complex multi-agent interactions. Similarly, KiGRAS adopts an autoregressive factorization over time to generate physically plausible agent trajectories, incorporating kinematic constraints into the sequential generation process to ensure dynamic feasibility while preserving generative diversity [7].

Beyond simulation, time-domain autoregressive models have also been explored for trajectory planning. Plan-R1 explicitly formulates safe and feasible trajectory planning as a language modeling problem, where future trajectory segments are generated autoregressively in the time domain and optimized using a combination of IL and RL objectives [38]. By predicting future motion tokens sequentially, such planners can model long-horizon trajectories while maintaining causal consistency with past decisions.

Despite their expressive power, time-domain autoregressive planners exhibit inherent limitations. Because each future token is conditioned strictly on previously generated tokens, prediction errors can accumulate over time, leading to strong temporal coupling and local continuity bias. As a result, once a suboptimal trajectory prefix is generated, autoregressive planners may struggle to revise global motion intent under unexpected disturbances, such as sudden cut-ins or abrupt changes in traffic conditions. These limitations motivate the exploration of alternative representations and modeling paradigms that enable more global reasoning over long-horizon trajectories.

Notably, many recent transformer-based planning frameworks can also be interpreted through a sequence modeling lens: planning outputs are decoded from a set of interacting tokens (e.g., ego/agent/map queries) and are often generated autoregressively or causally conditioned on past context, even when the tokenization is performed on structured scene elements rather than per-timestep actions.

2.5. Frequency-Domain Trajectory Representations

Frequency-domain representations offer a structured alternative to time-domain action encoding by parameterizing long-horizon behaviors using compact basis functions or tokenized action segments, rather than per-timestep values. Instead of modeling trajectories as densely sampled time sequences, frequency-aware representations emphasize global motion patterns and suppress high-frequency redundancy, enabling models to reason about coarse structure before refining local details.

Recent advances in robotics provide concrete evidence that frequency-inspired action representations can significantly improve autoregressive sequence modeling. FAST introduces an efficient action tokenization scheme for vision-language-action models, where continuous robot actions are projected into a compact frequency-based token space using DCT, enabling autoregressive policies to predict informative action chunks rather than individual timesteps [8]. This design substantially reduces temporal redundancy and improves learning stability, especially for high-frequency and dexterous control tasks.

Building on this idea, FASTer further develops neural action tokenization for autoregressive control by learning structured action tokens that encode temporally extended motion segments [39]. By generating action sequences at the token level instead of the timestep level, FASTer demonstrates improved inference efficiency and robustness while maintaining precise control execution. These works collectively highlight that frequency-aware or chunk-level representations can fundamentally alleviate the limitations of time-domain autoregressive decoding, particularly the strong temporal coupling and error accumulation induced by next-step prediction.

While FAST and FASTer focus on robotic manipulation and vision-language control, their core insights are directly relevant to autonomous driving. Specifically, they suggest that representing long-horizon behaviors in a frequency-structured or temporally compressed space can improve both efficiency and stability of autoregressive planning. However, existing frequency-aware action tokenization methods have not yet been explored in the context of long-horizon autonomous driving planning under complex multi-agent interactions and safety constraints, leaving a clear gap for further investigation.

3. Methods

This section presents the proposed long-horizon planning framework based on frequency-domain autoregressive action generation. We first formulate trajectory planning as sequential action generation and introduce the time-domain incremental action representation used throughout the paper. We then describe the baseline time-domain autoregressive planning formulation and contrast it with our frequency-domain autoregressive planning strategy, including DCT-based parameterization, quantization, and BPE-based tokenization. Next, we present a unified autoregressive planning architecture that maps structured scene context to frequency-domain action tokens and reconstructs executable trajectories through deterministic decoding and kinematic rollout. Finally, we introduce the hybrid learning paradigm that combines IL and RL with GRPO to train the proposed planner.

3.1. Problem Formulation

We consider the problem of long-horizon trajectory planning for an autonomous agent operating in a dynamic environment. At each planning cycle, the agent observes a structured scene context denoted by

S

, which includes static map elements, the states of surrounding agents, traffic rules, and the current ego state. The objective of the planner is to generate a future motion plan over a finite horizon that is safe, dynamically feasible, and aligned with task-level goals.

Instead of directly predicting absolute future states, we formulate planning as generating a sequence of control actions over a time horizon H. Specifically, a trajectory is represented by a sequence of incremental motions

A = [a_{1}, a_{2}, \dots, a_{H}], a_{t} = (Δ x_{t}, Δ y_{t}, Δ θ_{t}),

(1)

where each

a_{t}

describes the ego agent’s relative displacement and heading change at step t. This action-based formulation is widely adopted in learning-based motion planning, as it supports explicit modeling of vehicle dynamics and feasibility constraints. Figure 3 provides a schematic illustration of the incremental action representation

(Δ x, Δ y, Δ θ)

. Throughout the paper, “action sequence” consistently refers to this time-domain incremental motion representation, while different planners differ only in how the sequence is produced, either by directly predicting time-domain action tokens in the baseline or by decoding frequency-domain tokens to reconstruct the same sequence in our method.

Conventional learning-based planners operate directly in the time domain by predicting

A

in an autoregressive manner. In contrast, we reformulate long-horizon planning by transforming the time-domain action sequence into a frequency-domain representation. Concretely, for each action dimension

(Δ x, Δ y, Δ θ)

, the sequence over the horizon H is mapped to a set of frequency coefficients via a deterministic transform. This representation decomposes the action sequence into components corresponding to different temporal frequencies, where low-frequency components capture global motion trends and high-frequency components encode local refinements.

For each action dimension, applying DCT to an H-step action sequence yields a full frequency coefficient sequence of length L, with

L = H

before truncation. To obtain a compact planning representation, we retain only the first K low-frequency coefficients, where

K ≪ L

. Planning is then performed in the frequency domain by predicting a sequence of frequency-domain action tokens

z = [z_{1}, z_{2}, \dots, z_{K}], z_{k} \in V,

(2)

where each token represents a discretized retained low-frequency coefficient drawn from a finite vocabulary

V

. Accordingly, the prediction length is K, rather than the full coefficient length L or the time-domain horizon H.

To avoid ambiguity, we distinguish among the full frequency coefficients, retained coefficients, and frequency-domain tokens throughout the paper. Let

C_{1 : L}

denote the full DCT coefficient sequence obtained from the complete time-domain action sequence. We retain only the first K low-frequency coefficients, denoted as

C_{1 : K}

with

K ≪ L

, for planning. Frequency-domain tokens refer to the discrete sequence

z

obtained by quantizing and encoding the retained coefficients

C_{1 : K}

, which is the representation consumed and generated by the autoregressive transformer.

Given a predicted token sequence, a deterministic decoding process maps tokens back to continuous frequency coefficients and reconstructs the full time-domain action sequence

A

via an inverse transform. The final trajectory is then obtained by integrating the reconstructed actions starting from the current ego state.

This formulation shifts long-horizon planning from timestep-wise action prediction to structured sequence generation over frequency components. By operating on frequency-domain actions, the planner makes decisions at the level of global motion patterns rather than individual timesteps, mitigating error accumulation and temporal redundancy commonly observed in autoregressive time-domain planners. The specific construction of the frequency-domain action space and the associated encoding and decoding procedures are described later in this section.

3.2. Time-Domain Autoregressive Planning via Tokenized Actions

In conventional learning-based trajectory planning, future motion is commonly parameterized in the time domain as a sequence of incremental control actions and generated in an autoregressive manner. Given the observed scene context

S

, the planner predicts a sequence of relative motions over a fixed planning horizon H, where each action is conditioned on all previously generated actions. This leads to the following autoregressive factorization:

P (A ∣ S) = \prod_{t = 1}^{H} P (a_{t} ∣ a_{< t}, S),

(3)

where each action

a_{t} = (Δ x_{t}, Δ y_{t}, Δ θ_{t})

represents the ego agent’s incremental displacement and heading change at timestep t. As illustrated in Figure 4, each incremental motion is treated as an action token drawn from a time-domain action vocabulary

V_{time}

, and tokens are generated sequentially along the temporal axis. The planning horizon H therefore directly determines the length of the token sequence.

During training, the conditional distribution is optimized via maximum likelihood with teacher forcing. Rather than conditioning directly on raw action vectors, ground-truth incremental actions are first propagated through a kinematic model to obtain ego poses, and the planner conditions on these poses when predicting subsequent actions. At inference time, actions are generated sequentially in a closed-loop manner: each predicted action is integrated by the same kinematic model, and the resulting pose is fed back as input for the next prediction. The final trajectory is obtained by accumulating the predicted incremental motions over the planning horizon.

Under this formulation, autoregressive generation proceeds at the level of fine-grained control increments. Because each action token corresponds to a local motion update and predictions are conditioned on previously generated states, long-horizon behavior is implicitly constructed through the sequential composition of these local decisions. This structural property of time-domain autoregressive planning motivates alternative representations that model long-horizon intent more directly, which we introduce next.

3.3. Frequency-Domain Autoregressive Planning

We construct the planner’s action space in the frequency domain and perform autoregressive generation over frequency components rather than timestep-wise actions. This section describes the complete transformation pipeline from time-domain incremental action sequences to discrete frequency-domain tokens, as well as the corresponding reconstruction process.

3.3.1. Action Tokenization via BPE-Based Compression

Direct discretization of action sequences in either the time domain or per-frequency coefficient space would result in an excessively large vocabulary whose size grows with planning horizon and action dimensionality. Such a vocabulary is impractical for autoregressive modeling and prone to overfitting.

To address this issue, we introduce a frequency-domain action tokenization scheme that compresses long-horizon action sequences into compact token sequences. Figure 5 illustrates the overall pipeline of the proposed approach. The key idea is to transform incremental control actions into the frequency domain and apply sub-sequence compression to obtain a reusable discrete vocabulary.

For tokenizer construction, we operate on full-length action sequences represented as incremental motions

a_{t} = (Δ x_{t}, Δ y_{t}, Δ θ_{t})

over the planning horizon. Given an action sequence

A

, we apply DCT independently along each action dimension:

C = ϕ_{DCT} (A) = [c_{1}, c_{2}, \dots, c_{L}], c_{k} \in R^{3},

(4)

where lower-index coefficients correspond to low-frequency components encoding global motion trends, and higher-index coefficients capture finer temporal variations of the action sequence.

The frequency coefficients are uniformly quantized to map continuous-valued coefficients into a discrete symbol space,

\tilde{C} = round (s \cdot C) .

(5)

We set the scaling factor s to map typical coefficient magnitudes to a stable integer range for tokenizer training. Concretely, s is chosen on the tokenizer training corpus using a robust statistic, and we fix

s = 20

in all experiments. This discretization is necessary because byte-pair encoding (BPE) [40] operates on discrete symbol sequences and relies on frequency statistics to reuse common patterns across trajectories. After quantization, coefficients with similar magnitudes are mapped to identical symbols, enabling BPE to identify and merge frequently co-occurring contiguous subsequences. The quantized coefficients are finally flattened into a one-dimensional symbol stream, which serves as the input for BPE training.

BPE iteratively merges frequent subsequences into higher-level symbols, producing a compact discrete vocabulary

V

with a predefined size.

The tokenizer is trained offline on the entire dataset of action sequences and remains fixed during policy learning and inference. By training on full-length frequency representations, the tokenizer captures common structural patterns across frequency bands while decoupling the vocabulary size from the planning horizon.

For BPE training, we use a subset of

1 \times 10^{5}

expert action sequences sampled from the nuPlan training split used in our experiments to fit the tokenizer. We start from the integer symbol stream obtained by flattening

\tilde{C}

, and learn a BPE vocabulary with a target size

| V |

by performing merge operations until the vocabulary budget is reached. In our implementation, we set

| V | = 2048

and run 478 merge iterations. The learned tokenizer is fixed after offline training and is shared by all models reported in this paper.

3.3.2. Frequency-Domain Action Parameterization

During planning, given a future action sequence

A

, we first compute its full frequency-domain representation via DCT,

C_{1 : L} = ϕ_{DCT} (A),

(6)

where L is the length of the full coefficient sequence along the temporal axis. For clarity, we use the DCT-II along the temporal axis. For each action dimension, given a length-H sequence

{a_{t}}_{t = 1}^{H}

, the k-th DCT coefficient is

c_{l} = α_{l} \sum_{t = 0}^{L - 1} a_{t} cos (\frac{π (2 t + 1) (l - 1)}{2 L}), l = 1, \dots, L .

(7)

where

α_{1} = \sqrt{\frac{1}{L}}

and

α_{l} = \sqrt{\frac{2}{L}}

for

l \geq 2

. For each action dimension, the length of the full DCT coefficient sequence is equal to the temporal horizon, so

L = H

before coefficient truncation. Here, L denotes the full-frequency coefficient length, while K denotes the number of retained low-frequency coefficients used for planning. We apply this transform independently to

Δ x

,

Δ y

, and

Δ θ

. We then retain only the first K low-frequency coefficients,

C_{1 : K} = [c_{1}, \dots, c_{K}], K ≪ L .

(8)

Intuitively, the low-frequency components capture the dominant long-horizon structure that determines the overall motion intent, whereas the discarded high-frequency terms primarily reflect short-horizon execution-level fluctuations and noise in demonstrations, rather than planning-level decisions. Empirically, keeping only

C_{1 : K}

leads to almost no loss in closed-loop accuracy, and the effect of the truncation length K is further analyzed in Section 4.6.1.

The retained coefficients are then quantized and flattened,

q = flatten (round (s \cdot C_{1 : K})),

(9)

yielding an integer-valued symbol sequence

q

.

This sequence is subsequently mapped to a compact discrete token sequence via a pretrained BPE tokenizer,

\tilde{z} = T_{BPE} (q), z = {PadTrunc}_{K} (\tilde{z}) = [z_{1}, z_{2}, \dots, z_{K}], z_{k} \in V,

(10)

where

T_{BPE}

denotes a deterministic tokenization/compression operator with a fixed vocabulary

V

, and

{PadTrunc}_{K} (\cdot)

converts the variable-length BPE output into a fixed-length token sequence of length K by zero-padding when the output length is smaller than K and truncation otherwise. We refer to

C_{1 : L}

as the full frequency coefficient sequence,

C_{1 : K}

as the retained low-frequency coefficients used for planning, and

z

as the corresponding fixed-length discrete frequency-domain token sequence used by the autoregressive planner. Here,

z

is the tokenized representation of the retained coefficient subset

C_{1 : K}

, while L denotes the length of the full DCT coefficient sequence before truncation.

3.3.3. Autoregressive Modeling with Frequency-Domain Tokens

The planner models the conditional distribution over frequency-domain action tokens as

P (z ∣ S) = \prod_{k = 1}^{K} P (z_{k} ∣ z_{< k}, S),

(11)

and generates tokens autoregressively conditioned on the observed scene context

S

.

Unlike timestep-wise autoregression over incremental actions, autoregressive prediction in the frequency domain operates over semantically meaningful components. Early tokens primarily determine low-frequency coefficients and thus commit to global motion intent, while later tokens progressively refine higher-frequency details. Because frequency bases are orthogonal, predicting subsequent tokens cannot be trivially reduced to copying or smoothing previous outputs.

3.3.4. Trajectory Synthesis from Frequency-Domain Actions

After token generation, the discrete token sequence

z

is decoded back into quantized frequency coefficients via the inverse BPE mapping,

{\hat{C}}_{1 : K} = reshape (T_{BPE}^{- 1} (z)) / s .

(12)

The time-domain action sequence is then reconstructed deterministically using the inverse DCT,

\hat{A} = ϕ_{DCT}^{- 1} ({\hat{C}}_{1 : K}) = [{\hat{a}}_{1}, \dots, {\hat{a}}_{H}] .

(13)

The predicted frequency-domain tokens are deterministically decoded to reconstruct a time-domain incremental action sequence in the same action space as in Equation (1). These reconstructed incremental actions are then integrated by a kinematic model to obtain the executable ego trajectory. Operating in the frequency domain, together with BPE-based token compression, mitigates timestep-wise autoregressive degeneration. In particular, generation errors manifest as small perturbations across frequency components of the full action sequence, rather than being propagated step by step during rollout, which leads to more stable planning and more coherent long-horizon behavior.

3.4. Unified Autoregressive Planning Architecture

We propose a modular trajectory planning architecture that maps structured scene context to a continuous future trajectory through an intermediate sequence of frequency-domain action tokens. As illustrated in Figure 6, the model decomposes the planning process into three components: scene context encoding, autoregressive frequency-token generation, and deterministic trajectory reconstruction. This design separates high-level intent inference from low-level trajectory realization, improving interpretability and modeling stability in long-horizon planning.

3.4.1. Scene Representation

At each planning step, the agent observes a scene context

S

composed of heterogeneous environment elements, including static map geometry and lane topology, traffic lights and traffic rules, the ego vehicle state, and the states of surrounding agents. These elements jointly describe the spatial layout, dynamic interactions, and driving constraints of the current environment.

Each element in

S

is embedded into a shared D-dimensional latent space using a type-specific embedding function. This allows the model to preserve the semantic characteristics of different scene elements while enabling unified reasoning in a common representation space.

3.4.2. Action Token Embedding

All embedded scene elements are collected into a set of scene tokens

E_{scene} \in R^{N \times D}

, where N denotes the number of elements in the scene, including map geometry, surrounding agents, traffic lights, and the ego vehicle. Each element is represented in the same D-dimensional latent space.

Previously generated frequency-domain action tokens, when available, are embedded through a learned token embedding layer and mapped to the same latent dimension D. At the initial autoregressive step, no action token is provided and the input consists solely of scene tokens. From the second step onward, the embedding of the previously generated token is appended to the scene token set and jointly processed by the Transformer to predict the next frequency token.

3.4.3. Unified Transformer Blocks

The scene token embeddings and action token embeddings are concatenated to form a single input sequence,

X_{0} = Concat (E_{scene}, E_{token}) .

(14)

This sequence is processed by a stack of

L_{b l k} = 6

Transformer blocks with shared architecture, each consisting of layer normalization, multi-head self-attention, and a position-wise feed-forward network with residual connections.

Self-attention enables joint reasoning over static map structure, dynamic agent interactions, and previously generated action tokens. Causal masking is applied to the action token positions to ensure autoregressive consistency, such that each token prediction depends only on the scene context and earlier tokens.

3.4.4. Autoregressive Token Prediction

The hidden representation corresponding to the current token position is projected onto the discrete frequency-domain token vocabulary via a linear layer followed by a softmax operation,

P (z_{k} ∣ z_{< k}, S) = softmax (W_{out} h_{k}),

(15)

where

h_{k}

denotes the output embedding at the current token position. The next action token is sampled from this distribution and fed back into the model for the next autoregressive step.

By modeling planning as autoregressive generation over frequency-domain action tokens conditioned on a unified scene representation, the proposed architecture naturally captures long-horizon dependencies while maintaining a compact and interpretable action space. The simplicity of the architecture facilitates stable training and efficient inference, while the Transformer-based formulation allows flexible interaction modeling across diverse scene elements.

3.4.5. Trajectory Reconstruction

Once the predicted token sequence corresponding to the retained low-frequency coefficients,

z_{1 : K}

, is generated, it is first decoded into a sequence of quantized frequency coefficients via the inverse tokenization process. After de-quantization, we obtain the continuous low-frequency coefficients

{\hat{C}}_{1 : K}

, which parameterize the future action sequence in the frequency domain. These coefficients are then mapped back to the time domain through an inverse frequency transform, yielding a sequence of incremental control actions

\hat{A}

. Because the reconstruction operates on a truncated set of low-frequency coefficients, the resulting action sequence is globally smooth by construction, reflecting coherent long-horizon motion intent rather than timestep-level corrections.

Finally, the continuous ego trajectory is obtained by integrating the reconstructed action sequence through a deterministic vehicle kinematic model,

\hat{X} = K (\hat{A}), \hat{X} = [{\hat{x}}_{1}, \dots, {\hat{x}}_{H}],

(16)

where

K (\cdot)

denotes forward kinematic rollout from incremental actions to absolute ego poses.

By explicitly separating frequency-domain plan generation, action-space realization, and kinematic integration, the proposed framework decouples long-horizon decision making from low-level trajectory execution, providing a structured and physically consistent foundation for autonomous planning.

3.5. Hybrid Learning for Frequency-Domain Planning

The proposed planner is trained using a hybrid learning paradigm that combines IL and RL. IL provides a strong initialization from expert demonstrations, while RL enables explicit optimization of task objectives and safety-related criteria through closed-loop interaction. The frequency-domain trajectory representation plays a central role in stabilizing both learning stages by exposing a compact, globally meaningful action space.

3.5.1. Imitation Learning Pretraining

We first train the planner using IL on a dataset of expert demonstrations. Each demonstration consists of a scene context

S

and an expert action sequence

A^{*}

, which specifies the incremental motions executed by the expert over the planning horizon. The expert action sequence is transformed into the frequency domain using the encoding function

ϕ_{D C T}

, yielding a set of low-frequency coefficients

c^{*}

, which compactly parameterize the long-horizon motion intent. These coefficients are subsequently quantized and tokenized into a discrete frequency-domain token sequence

z^{*}

.

The planner is trained to maximize the likelihood of expert tokens under an autoregressive policy,

L_{IL} = - E_{(S, z^{*})} \sum_{k = 1}^{K} log P (z_{k}^{*} ∣ z_{< k}^{*}, S),

(17)

Supervision in the frequency domain exhibits substantially lower variance than timestep-level imitation, as low-frequency action coefficients encode stable global motion intent and are less sensitive to local execution noise. As a result, imitation learning constrains the policy to a compact manifold of reasonable long-horizon action plans and provides a strong initialization for subsequent reinforcement learning.

3.5.2. Reinforcement Learning with GRPO

After IL, we further refine the planner using RL in a closed-loop simulation environment. For a given scene context

S

, the policy samples a sequence of frequency-domain tokens

z_{1 : K} \sim P_{θ} (\cdot ∣ S)

, which are deterministically decoded into a continuous trajectory and executed in the environment to obtain a trajectory-level return R. RL therefore operates directly on complete trajectory hypotheses rather than individual low-level control actions. The overall workflow of the grouped relative policy optimization (GRPO) algorithm is illustrated in Figure 7, providing a visual summary of the sampling, decoding, execution, and return computation steps.

We adopt GRPO to update the policy. For each scene context, the policy samples a group of M candidate token sequences

{z^{(i)}}_{i = 1}^{M}

, yielding corresponding returns

{R^{(i)}}

. Advantages are computed by normalizing returns within the group,

{\hat{A}}^{(i)} = \frac{R^{(i)} - μ_{R}}{σ_{R} + ϵ},

(18)

where

μ_{R}

and

σ_{R}

denote the mean and standard deviation of returns in the group. This group-relative formulation eliminates the need for a learned value function and focuses policy updates on relative trajectory quality within the same scene.

The policy is optimized using a clipped surrogate objective,

L_{GRPO} = E_{i} [min (r_{θ}^{(i)} {\hat{A}}^{(i)}, clip (r_{θ}^{(i)}, 1 - ϵ, 1 + ϵ) {\hat{A}}^{(i)})],

(19)

where

r_{θ}^{(i)} = \frac{P_{θ} (z^{(i)} ∣ S)}{P_{θ_{old}} (z^{(i)} ∣ S)} .

Because the policy autoregressively predicts the retained low-frequency coefficient tokens (of length K), the log-probability decomposes as

log P_{θ} (z ∣ S) = \sum_{k = 1}^{K} log P_{θ} (z_{k} ∣ z_{< k}, S)

, where K is the number of retained coefficients used for planning and L denotes the full coefficient length before truncation.

Compared to conventional proximal policy optimization (PPO), GRPO is particularly well-suited to long-horizon trajectory planning. PPO relies on value function approximation at intermediate timesteps, which is challenging in settings where rewards are sparse and trajectory-level constraints dominate performance. In contrast, GRPO operates on complete trajectory rollouts and updates the policy based on relative ranking among multiple trajectory hypotheses generated under the same scene context. This property significantly reduces variance, avoids value function bias, and aligns naturally with the frequency-domain representation, where each action token sequence encodes a coherent global plan.

The RL reward is defined at the trajectory level as a weighted combination of task and safety objectives,

R = λ_{prog} R_{progress} + λ_{comfort} R_{comfort} + λ_{safety} R_{safety} + λ_{rule} R_{rule} .

(20)

We combine progress, comfort, safety, and rule compliance in Equation (20) using fixed coefficients

λ_{prog} = 0.15

,

λ_{comfort} = 0.05

,

λ_{safety} = 0.55

, and

λ_{rule} = 0.25

. These weights reflect the priority structure of safety-critical driving. The safety term is assigned the largest weight because collision avoidance and maintaining a safe margin are non-negotiable requirements in closed-loop deployment, and unsafe behavior must be corrected even at the cost of reduced efficiency. The rule term receives the second-largest weight since violations such as running red lights, ignoring right-of-way, or illegal lane usage can trigger hazardous interactions and systematic failure modes that are not well captured by progress alone. The progress term is weighted to encourage task completion and steady motion toward the route goal, but it is deliberately lower than safety and rule to prevent aggressive behaviors that trade risk for short-term advancement. The comfort term is given the smallest weight because smoothness is desirable but should not override safety, feasibility, and legal compliance. The same set of coefficients is used for all experiments to ensure consistent optimization objectives across different planning representations.

Based on repeated pilot experiments during reward design, we observed that the four reward coefficients control a practical trade-off among safety, compliance, efficiency, and smoothness. Increasing

λ_{safety}

generally makes the policy more conservative, which tends to improve collision-related metrics and time-to-collision margins, but may reduce route progress in dense interactions due to earlier braking or more cautious yielding. Increasing

λ_{rule}

strengthens compliance with traffic rules and lane usage constraints, which can reduce risky rule-violating behaviors, but may also lower short-term efficiency when aggressive maneuvers would otherwise increase progress. Increasing

λ_{prog}

encourages more assertive task completion and forward motion, but if set too high, it can bias the policy toward efficiency-seeking behaviors that compress safety margins. Increasing

λ_{comfort}

promotes smoother trajectories with lower actuation variation, but an overly large comfort weight may suppress necessary reactive maneuvers in safety-critical situations. In practice, the selected coefficients are chosen to prioritize safety and legal compliance while preserving sufficient progress and maintaining reasonable comfort.

Overall, IL anchors the policy within a structured region of the frequency coefficient space corresponding to expert-like behavior, while GRPO performs localized refinement guided by explicit trajectory-level rewards. This combination yields stable policy optimization and enables effective long-horizon planning under complex safety and rule constraints.

4. Experiments

Following the nuPlan evaluation protocol, we evaluate our method using both in-house controlled comparisons and official benchmark settings. Unless otherwise stated, ablations and controlled comparisons are conducted on the nuPlan validation split, while benchmark comparisons with prior work follow the official Val14/Test14-hard/Test14-random protocol. The goal of the experimental study is threefold: (1) to examine the impact of frequency-domain action representations compared with conventional time-domain continuous actions, (2) to analyze the effect of RL, particularly GRPO, under different action spaces, and (3) to evaluate robustness in both standard driving scenarios and safety-critical corner cases.

4.1. Experimental Setup and Evaluation Protocol

The planner generates a future trajectory with a 6-s horizon. At each planning cycle, it receives updated scene context and replans in a receding-horizon manner.

Two types of action representations are evaluated. In the continuous action setting, the planner directly predicts time-domain control sequences or future states at each timestep. In the frequency-domain tokenized action setting, the planner predicts a compact sequence of discrete tokens corresponding to truncated frequency-domain trajectory coefficients, which are then deterministically decoded into continuous trajectories.

For frequency-domain methods, future trajectories are transformed using DCT. Only the first K low-frequency coefficients are retained to capture global motion structure, while higher-frequency components are discarded. These coefficients are discretized and compressed using a learned BPE tokenizer. Unless otherwise specified, we use

K = 16

and a vocabulary size of 2048.

All methods share the same scene encoder to isolate the effect of action representation and learning paradigm. Training follows a two-stage procedure consisting of IL from expert demonstrations, followed by RL fine-tuning. To benchmark our approach, we fine-tune with the GRPO algorithm and contrast its performance against the widely-used PPO baseline.

Planning performance is assessed in closed-loop simulation using the official nuPlan evaluation protocol. We report both the Non-Reactive Closed-Loop Score (NR-CLS) and the Reactive Closed-Loop Score (R-CLS), which together characterize planner behavior under different interaction assumptions. NR-CLS evaluates the planned trajectory while replaying logged motions for surrounding agents, isolating the planner’s ability to follow traffic rules and generate feasible motion in a fixed environment. In contrast, R-CLS introduces interactive agents controlled by an Intelligent Driver Model (IDM), requiring the planner to respond to dynamic and reactive behaviors in a more realistic setting. Both metrics aggregate performance over 15-s closed-loop rollouts and capture multiple driving objectives, including collision avoidance, progress efficiency, and compliance with speed and road constraints. Scores are normalized to the range

[0, 100]

, with higher values indicating better overall driving performance.

4.2. Computational Cost Analysis

We report the computational requirements of the time-domain and frequency-domain planners, including training cost, peak memory usage, and deployment-oriented inference latency statistics, as summarized in Table 1. Unless otherwise stated, both models use the same backbone and transformer capacity and are trained under the same hardware setup. The training-time difference mainly arises from the supervision format and decoding workload induced by the action representation. In the IL stage, the time-domain model is trained with dense timestep-level labels, whereas the frequency-domain model is supervised on truncated low-frequency coefficients that are discretized and autoregressively decoded into token sequences. This changes both the effective target length and the token-processing cost during optimization, leading to different pretraining throughput under the same hardware budget. The same representation effect carries over to RL fine-tuning, where the frequency-domain policy typically decodes more tokens per planning cycle than the time-domain baseline, resulting in higher compute per update under an identical rollout protocol.

In closed-loop driving, however, a predicted trajectory is not executed for the full nominal horizon. Replanning is triggered at every planning cycle as the scene evolves, and a new trajectory is generated from the latest observations. Therefore, practical deployment overhead is better characterized by per-cycle inference latency than by nominal horizon length alone. To reflect this, Table 1 reports not only mean inference latency but also tail-latency statistics, including P95 and maximum latency. These statistics provide a clearer characterization of runtime variability in dynamic closed-loop interactions beyond a single average latency value.

4.3. Overall Performance on the nuPlan Validation Split

We begin by evaluating planner performance on the full nuPlan validation set, which covers diverse driving conditions such as urban driving, intersections, highway merging, and dense traffic. We compare six configurations by combining two action representations, continuous actions and frequency-domain tokens, with three learning paradigms, IL, IL + PPO, and IL + GRPO.

Several consistent trends emerge from Table 2. First, replacing continuous time-domain actions with frequency-domain tokens leads to a substantial improvement across all metrics, even under pure IL. This indicates that frequency-domain representations provide a more structured and learnable action space, enabling the planner to capture long-horizon motion intent more effectively. As further evidenced by Figure 8, the frequency-domain representation exhibits smoother optimization dynamics, with consistently lower training and validation losses across training steps. Second, RL further improves performance in both action spaces. Under continuous actions, PPO reduces collision and off-road rates but still suffers from limited gains in success rate, suggesting difficulties in stable credit assignment over long action sequences. GRPO provides additional improvements by regularizing policy updates, yet the overall performance remains constrained by the time-domain representation. The training dynamics are also reflected in the learning curves: as shown in Figure 9, GRPO yields more stable reward improvement across training steps compared to PPO. In contrast, frequency-token planners benefit significantly more from RL. PPO already yields a notable increase in success rate and comfort, while GRPO further improves safety-related metrics, achieving the best Collision, TTC, Drivable, and R-CLS. Although Freq. + IL + GRPO does not achieve the highest Progress, slightly lower than Freq. + IL + PPO, it yields the best overall reactive score and safety-related metrics, indicating a better safety–efficiency trade-off under interactive closed-loop conditions. This result highlights a strong synergy between frequency-domain action representations and RL, particularly when using GRPO.

We further examine statistical robustness by running multiple random seeds and conducting significance tests on the key comparison between the strongest time-domain baseline and our frequency-domain planner. The results show that the gains are consistent across seeds and are unlikely to be explained by random initialization alone. In particular, for the overall reactive score R-CLS, our method yields a statistically significant improvement over the time-domain GRPO baseline, with p-values well below the 0.01 threshold. Similar significance is observed for core safety-related metrics such as Collision, TTC, and Drivable, indicating that the improvement is not limited to a single metric but reflects a broad enhancement in closed-loop robustness. We note that some near-saturated metrics, such as Comfort, exhibit weaker statistical significance because the absolute gap is small and the metric is close to its upper bound, where numerical differences become less discriminative. Overall, the statistical results support that the performance advantages reported in Table 2 are stable and reproducible under different random seeds.

We compare our approach with representative planning methods evaluated on the nuPlan benchmark, covering rule-based pipelines, hybrid planners with learning components, and fully learning-based end-to-end approaches. All results are reported under the official nuPlan closed-loop evaluation protocol, including both non-reactive (NR) and reactive (R) settings. In the non-reactive mode, surrounding agents follow logged trajectories, whereas in the reactive mode, other agents are controlled by an IDM-based policy, resulting in more interactive and challenging scenarios.

Table 3 presents a comprehensive comparison on the nuPlan benchmark under both non-reactive (NR) and reactive (R) closed-loop evaluation. Several important observations can be drawn from these results. First, when comparing different planning paradigms, purely rule-based methods such as IDM exhibit limited performance, particularly on the Test14-hard split, where complex interactions and long-horizon decision making are required. Although hybrid approaches with rule-based post-processing, such as PDM-Closed and PDM-Hybrid, achieve strong NR scores, their performance advantage diminishes in reactive settings. This suggests that rule-based safety layers are effective at enforcing feasibility under static assumptions but struggle to adapt to dynamically changing agent behaviors. Second, among fully learning-based planners, our method consistently outperforms prior end-to-end approaches across all splits in the reactive mode. On Test14-hard (R), our planner achieves a score of 78.19, exceeding diffusion-based planners and transformer-based baselines by a clear margin. This margin is particularly informative because diffusion planners are often designed to maintain multi-modal futures through iterative sampling, which is advantageous when multiple plausible behaviors must be preserved. In highly dynamic and abrupt interactions, however, we observed in trajectory and speed-profile visualizations that diffusion baselines can require multiple refinement steps before reflecting sudden disturbances, whereas our frequency-domain tokens enable earlier global reshaping of the planned motion, such as initiating decisive braking with reduced hesitation. This advantage is especially critical on Test14-hard, which concentrates rare, high-risk interactions where delayed or incremental reactions frequently lead to failure. The gain in reactive performance indicates that our planner is better at committing to globally coherent trajectories under strong interaction. Third, the gap between NR and R performance reveals important differences in policy robustness. Many learning-based planners experience a substantial drop when moving from NR to R evaluation, reflecting sensitivity to deviations in surrounding agent behavior. In contrast, our method exhibits a smaller NR–R gap, especially on Val14 and Test14-random. This robustness can be attributed to frequency-domain action modeling, where low-frequency tokens encode global motion intent and remain stable under moderate interaction-induced perturbations. Finally, compared with hybrid planners that rely on rule-based post-processing, our approach achieves competitive or superior performance without any handcrafted safety heuristics. For example, while diffusion planners with post-processing achieve strong NR scores, our model attains comparable performance in NR and clearly outperforms them in R evaluation. This suggests that the proposed frequency-domain tokenization, combined with GRPO-based optimization, enables RL to directly shape long-horizon safety behavior rather than relying on external corrective modules. Finally, we note that strongly coupled multi-agent interactions, where multiple actors change intent within a short time window, can further increase non-stationarity and remain challenging for closed-loop planning. Since nuPlan is an open benchmark with predefined splits, we do not construct an additional hand-labeled subset for this specific subcategory in this work. We leave targeted subset construction and explicit intent-uncertainty modeling with risk-aware optimization as future directions.

4.4. Performance Under Safety-Critical and Highly Dynamic Scenarios

We evaluate the proposed frequency-domain action representation under a set of safety-critical corner cases, where timely braking and rapid behavioral adaptation are essential for collision avoidance. To isolate the effect of action space design, we focus our comparison on the strongest variants of both paradigms, namely Cont. + IL + GRPO and Freq. + IL + GRPO. Both planners are optimized using the same IL initialization and GRPO fine-tuning procedure, and are evaluated under identical closed-loop reactive settings.

A key observation across all evaluated scenarios is that planners operating in the time-domain action space consistently exhibit delayed braking and evasive responses, even after GRPO fine-tuning. This behavior is not merely a consequence of insufficient training, but is fundamentally tied to the inductive bias of time-domain action representations. As discussed in earlier sections, time-domain planners tend to generate actions that are locally consistent with recent history, resulting in strong temporal autocorrelation and an implicit preference for repeating previously executed control patterns. In safety-critical situations that require abrupt global changes, such inertia significantly delays corrective actions such as emergency braking.

In contrast, planners based on frequency-domain action tokens respond to hazards in a more timely and anticipatory manner. By directly modifying the low-frequency components of the planned trajectory, the planner can reshape the global velocity and curvature profile in a single decision step, enabling earlier deceleration or smoother evasive maneuvers across the entire planning horizon. This property is particularly advantageous in corner cases with limited reaction time, where delayed local corrections are often insufficient to avoid collisions.

Below, we provide a detailed case-wise analysis to further illustrate these differences.

4.4.1. Case 1: Sudden Cut-In with Rapid Intent Change

A neighboring vehicle abruptly cuts into the ego lane at a short longitudinal distance, creating an immediate collision risk that requires timely and sustained deceleration.

As illustrated in Figure 10, the time-domain planner (Cont.) exhibits a conservative and inertia-dominated response. Following the cut-in event, the planned velocity remains close to the pre-cut-in cruising speed for an extended duration, with only a mild and gradual decrease over time. Even as the situation becomes increasingly critical, deceleration is applied incrementally, resulting in a delayed reduction of speed and a relatively small instantaneous deceleration magnitude. This behavior reflects the step-wise nature of time-domain autoregressive planning, where corrective actions are accumulated gradually across timesteps.

In contrast, the frequency-domain planner (Freq.) reacts more decisively once the cut-in is detected. While maintaining a similar cruising speed in the early phase, it initiates a coordinated reduction of velocity around the onset of risk by modifying low-frequency trajectory components. This leads to a pronounced and temporally concentrated deceleration phase, during which the vehicle rapidly transitions to a significantly lower speed. Importantly, this deceleration is applied consistently over the remaining horizon, indicating a global replanning of motion intent rather than incremental local corrections.

As a result, the frequency-domain planner achieves earlier separation from the hazardous configuration and increases the available time-to-collision margin. The observed velocity profile demonstrates that frequency-domain planning enables decisive long-horizon intervention when rapid response is required, whereas time-domain planning remains biased toward smooth continuation of prior actions, even under imminent risk.

4.4.2. Case 2: Emergency Braking of the Lead Vehicle

The leading vehicle performs an emergency braking maneuver with minimal prior indication, requiring the ego vehicle to promptly reduce speed in order to maintain a safe following distance.

As shown in Figure 11, the time-domain planner (Cont.) exhibits a largely inertia-preserving behavior. Following the onset of the lead vehicle’s braking, the planned ego velocity remains close to the original cruising speed and decreases only marginally over the entire horizon. The resulting deceleration magnitude is small and temporally diffuse, indicating that the planner primarily maintains previously generated actions rather than committing to a decisive braking maneuver. This behavior reflects the difficulty of inducing a strong global response through timestep-wise autoregressive updates.

In contrast, the frequency-domain planner (Freq.) initiates a clear and sustained deceleration shortly after the emergency braking event. By adjusting low-frequency components of the action sequence, the planner globally reshapes the velocity profile, resulting in a continuous and monotonic reduction of speed over the remaining horizon. Unlike abrupt reactive braking, this deceleration is distributed over time while still achieving a substantially lower terminal speed.

This behavior enables the frequency-domain planner to rapidly increase longitudinal safety margins without relying on late-stage aggressive braking. The qualitative velocity profile demonstrates that frequency-domain planning supports early commitment to a new motion intent under emergency conditions, whereas time-domain planning remains biased toward preserving previously planned control patterns.

4.4.3. Case 3: Abrupt Obstruction by a Stopped Vehicle

The ego vehicle encounters a stationary vehicle blocking its current lane, presenting a scenario that requires either a complete stop or a lane-changing maneuver to continue forward progress.

As depicted in Figure 12, the time-domain planner (Cont.) adopts a conservative, inertia-dominated strategy. Upon detecting the obstructing vehicle, it quickly reduces speed to near-zero and maintains this minimal velocity over the planning horizon. The response reflects a local, incremental adjustment pattern, prioritizing immediate collision avoidance over proactive maneuver regeneration. The planner essentially “freezes” in place, preserving the previously established longitudinal control pattern without generating a new global intent to overcome the obstruction.

In contrast, the frequency-domain planner (Freq.) demonstrates a more decisive and globally coordinated response. Shortly after identifying the stationary obstacle, it initiates a lane-change maneuver by adjusting both lateral and longitudinal trajectory components in the frequency domain. This allows the planner to reshape the motion profile holistically, transitioning smoothly from the original lane to an adjacent free lane while maintaining forward momentum. The resulting velocity profile shows a temporary moderate speed reduction during the lane-change phase, followed by a return to cruising speed once the obstacle is circumvented.

This behavioral divergence highlights a fundamental difference in planning philosophy: the frequency-domain planner actively recomposes the motion sequence to achieve a new global objective, whereas the time-domain planner remains constrained by its step-wise autoregressive structure, tending to perpetuate prior actions even when they no longer serve the overall goal. Consequently, the frequency-domain approach enables timely and efficient obstacle avoidance through maneuver replanning, while the time-domain approach remains passive, opting to stop indefinitely rather than execute a coordinated lateral maneuver.

4.4.4. Case 4: Sudden Pedestrian Crossing at a Crosswalk

A pedestrian is legally crossing the roadway at a marked crosswalk ahead of the ego vehicle. The scenario requires a timely and smooth deceleration to yield right-of-way and avoid encroaching on the pedestrian’s path, testing the planner’s ability to proactively manage a predictable but safety-critical interaction.

As shown in Figure 13, the time-domain planner (Cont.) exhibits a characteristic incremental and delayed response. Upon perceiving the crossing pedestrian, its planned velocity initially remains near the original cruising speed. Deceleration is introduced gradually over successive timesteps, resulting in a drawn-out speed reduction profile. This leads to a later initiation of significant braking and a more extended period where the vehicle approaches the conflict zone at a relatively higher speed, compressing the available safety margin.

In contrast, the frequency-domain planner (Freq.) demonstrates a more anticipatory and decisive reaction. It promptly initiates a coordinated deceleration by adjusting the low-frequency components of its planned trajectory. This enables a smoother, more front-loaded speed reduction that begins earlier and achieves a lower speed sooner in the scenario timeline. The velocity profile reflects a global recomputation of the motion intent, prioritizing the early establishment of a safe longitudinal gap.

The observed behavioral difference stems from the fundamental representational disparity: the frequency-domain planner can efficiently reshape the entire planned speed profile to align with the new yielding objective, while the time-domain planner is constrained by its step-wise autoregressive nature, which biases it toward continuing prior motions and applying corrections locally and incrementally. Consequently, in this common urban interaction, frequency-domain planning facilitates earlier and more comfortable compliance with yielding norms, enhancing both safety and ride comfort.

4.4.5. Discussion: Robustness to Abrupt Behavioral Changes

Across all safety-critical corner cases, the qualitative behaviors are highly consistent with the quantitative closed-loop metrics. Even after GRPO fine-tuning, time-domain planners remain constrained by their strong temporal autocorrelation, leading to delayed and reactive braking. In contrast, frequency-domain action encoding enables earlier, smoother, and more globally coherent responses, which are crucial for effective RL in safety-critical driving scenarios. These results confirm that the observed performance gains stem from a fundamental representational advantage, rather than differences in optimization or training procedures.

4.5. Limitations and Failure Modes

Although frequency-domain tokenization improves long-horizon stability and reactive closed-loop performance, we observe a characteristic artifact when the ego vehicle operates in near-stop regimes. In congested stop-and-go traffic or when yielding at low speed, the planner can occasionally generate a sequence of future points that remains spatially confined while exhibiting noticeable lateral and longitudinal oscillations. Figure 14 illustrates such a case. The predicted points drift only a few centimeters overall, yet the intermediate points repeatedly deviate and return, forming a wavy, almost in-place trajectory rather than a clean stationary or monotonic creeping plan.

This behavior is closely related to global coefficient control in the frequency domain. The trajectory over the full horizon is determined by a compact set of retained low-frequency coefficients. When the ego speed is close to zero, many candidate trajectories become nearly indistinguishable in terms of short-horizon displacement and collision outcome, while differing mainly in subtle curvature and heading evolution. Under this degeneracy, small changes in low-frequency coefficients can reshape the entire horizon and manifest primarily as geometric oscillations in the reconstructed point sequence, even though the net displacement remains negligible. Because each replanning cycle revises the coefficients globally, these small oscillations can persist across cycles instead of being damped out by local corrections.

The training objective can further encourage this local optimum. Closed-loop rewards prioritize safety and comfort and often treat near-zero-speed trajectories as similarly safe as long as collisions and rule violations are avoided. If residual heading or curvature variations under low speed are weakly penalized, the policy may settle on a stable but unproductive solution that maintains safety while producing little progress, expressed as oscillatory point patterns as shown in Figure 14. In contrast, time-domain planners that output dense timestep actions can suppress this artifact more directly through local damping of yaw-rate updates and explicit stabilization of stopping behavior at the timestep level.

This limitation suggests that frequency-domain planning may benefit from additional near-stop mechanisms, such as low-speed heading-stability regularization, progress-aware constraints, or a lightweight residual controller that attenuates oscillatory components when the ego speed falls below a small threshold.

4.6. Ablation Studies on Frequency-Domain Design Choices

We conduct ablation studies to analyze the impact of key design choices in the proposed frequency-domain action framework, including frequency truncation length K, token vocabulary size, and planning time horizon. All experiments are evaluated on the nuPlan validation set under the same closed-loop protocol. Unless otherwise specified, all ablations use frequency-domain actions with GRPO fine-tuning.

4.6.1. Effect of Frequency Truncation Length K

The truncation length K controls how many low-frequency components are retained to represent a trajectory. Larger K allows finer-grained motion control but requires predicting more tokens during inference, which increases latency.

Increasing K consistently improves safety-related metrics such as collision avoidance and time to collision (TTC), as richer frequency components enable more precise trajectory shaping. However, the gains saturate beyond

K = 16

, while inference latency grows rapidly due to the increased number of predicted tokens. This highlights a clear performance–efficiency trade-off, and we adopt

K = 16

as a balanced configuration.

Note that the latency statistics in Table 1 and Table 4 use different aggregation levels. Table 1 reports per-cycle latency distribution statistics in closed-loop evaluation, whereas the “Infer. Time” entries in Table 4 report run-level average latency with run-to-run standard deviation.

4.6.2. Effect of Token Vocabulary Size V

The vocabulary size determines the expressiveness of the frequency-domain action space. A sufficiently large vocabulary is required to capture diverse motion primitives, while overly small vocabularies limit representation capacity. Table 5 summarizes the ablation results under different vocabulary sizes.

Performance improves significantly as the vocabulary grows from 512 to 2048, indicating that capturing diverse motion patterns requires sufficient token expressiveness. Beyond this point, performance converges, suggesting that the action space has reached adequate coverage of relevant trajectory primitives.

4.6.3. Effect of Tokenization Scheme

Our frequency-domain planner discretizes truncated low-frequency DCT coefficients and converts them into a compact token sequence using a BPE tokenizer, which constrains the vocabulary size via offline subsequence merging. To examine whether the gains depend on the tokenizer choice, we perform a controlled ablation that fixes the DCT transform, truncation length K, coefficient scaling and rounding, reconstruction pipeline, and planner architecture. Only the mapping from discretized coefficients to tokens is changed. All variants follow the same training protocol on nuPlan, with IL pretraining and GRPO fine-tuning. We use

K = 16

and match the vocabulary size to

| V | = 2048

for fair comparison.

We compare three schemes, including BPE tokens, fixed-bin tokens without subsequence merging, and a K-means codebook that assigns each coefficient token to its nearest centroid. Table 6 reports the results. This ablation is designed to attribute gains to the frequency-domain representation rather than the choice of tokenizer. Differences among tokenization schemes are smaller than the gap between time-domain and frequency-domain planning, indicating that the main benefit comes from the frequency-domain representation. BPE remains competitive while providing a compact vocabulary and efficient decoding.

4.6.4. Effect of Planning Horizon H

We further study the effect of the planning horizon while keeping the number of predicted tokens fixed. Short horizons limit the representational advantage of frequency-domain coefficients, while overly long horizons dilute token capacity across extended trajectories. Table 7 reports the ablation results under different planning horizons.

When the horizon is too short, frequency coefficients struggle to represent meaningful global motion patterns, leading to degraded planning quality. Conversely, excessively long horizons reduce per-step representational resolution under a fixed token budget, making it difficult to accurately encode long trajectories. A moderate horizon of 6 s achieves the best balance between expressiveness and controllability.

Overall, these ablation results demonstrate that the proposed performance gains arise from carefully balanced representation-level design choices, rather than incidental architectural or optimization factors.

5. Conclusions

This paper revisits long-horizon trajectory planning from the perspective of action representation and sequence generation. While recent end-to-end planners based on transformer token reasoning have achieved strong closed-loop performance by modeling agent and map interactions through structured tokens [24,25,26], most learning-based planners still generate trajectories in the time domain, where timestep-wise autoregressive decoding suffers from strong temporal coupling and compounding errors under safety-critical disturbances. To address this limitation, we propose a frequency-domain trajectory planning framework that tokenizes future motion in a compact coefficient space, enabling autoregressive generation over temporally extended and semantically meaningful units rather than per-timestep actions.

Our approach introduces three key components. First, we represent the incremental motion sequence using a DCT-based frequency decomposition and retain only a small number of low-frequency coefficients to capture dominant long-horizon intent. Second, we discretize and compress the coefficient stream using a BPE tokenizer, producing a bounded vocabulary and an efficient token sequence for transformer decoding. Third, we adopt a two-stage training paradigm with imitation learning pretraining followed by reinforcement learning fine-tuning, allowing the policy to learn reactive long-horizon behaviors while remaining stable and sample-efficient. Together, these design choices reshape the learning and optimization landscape by distributing generation errors across global frequency components instead of propagating them step by step through time-domain rollout.

Extensive closed-loop evaluation on nuPlan demonstrates that frequency-structured action tokens yield consistently stronger reactive performance than time-domain baselines under identical training settings, improving robustness in highly dynamic scenarios that require rapid global intent revision. In particular, our method shows a smaller performance drop from non-reactive to reactive evaluation, indicating improved stability under interaction-induced perturbations. Ablations further confirm that the main gains stem from the frequency-domain representation, while the tokenizer mainly serves as a practical mechanism to control vocabulary size and decoding efficiency. Compared with hybrid pipelines that rely on rule-based post-processing or optimization-based safety layers [19,20,35,41], our method achieves competitive safety behavior without handcrafted heuristics, suggesting that long-horizon safety objectives can be directly shaped through representation-aware sequence learning and RL fine-tuning.

Despite these advantages, frequency-domain planning also exhibits characteristic limitations. For example, in near-stop regimes the global nature of coefficient control can occasionally induce oscillatory point patterns with negligible progress, motivating additional low-speed stabilization mechanisms. Looking forward, several directions remain promising, including incorporating explicit risk-sensitive objectives for strong multi-agent coupling, improving real-time deployment efficiency on embedded platforms, and combining frequency-domain tokens with complementary generative planners for multi-modality and uncertainty modeling. We hope this work encourages broader exploration of frequency-structured action representations as a principled route toward robust and reactive long-horizon planning in autonomous driving.

Author Contributions

Conceptualization, J.X. and X.W.; methodology, J.X. and Z.K.; software, J.X. and Z.K.; validation, J.X., B.S. and Y.H.; formal analysis, J.X. and B.S.; investigation, J.X. and Y.H.; resources, M.X.; data curation, B.S. and Y.H.; writing—original draft preparation, J.X. and Z.K.; writing—review and editing, X.W. and M.X.; visualization, B.S. and Y.H.; supervision, X.W. and M.X.; project administration, X.W. and M.X.; funding acquisition, X.W. and M.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 52472413.

Institutional Review Board Statement

Not acceptable.

Informed Consent Statement

Not acceptable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Zhuo Kong was employed by the company China National Heavy Duty Truck Group Co., LTD. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DCT	Discrete cosine transform
IL	Imitation learning
RL	Reinforcement learning
BPE	Byte-pair encoding
GRPO	Grouped relative policy optimization
PPO	Proximal policy optimization

References

Paden, B.; Cap, M.; Yong, S.Z.; Yershov, D.; Frazzoli, E. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 2016, 1, 33–55. [Google Scholar] [CrossRef]
Shalev-Shwartz, S.; Shammah, S.; Shashua, A. On a formal model of safe and scalable self-driving cars. arXiv 2017, arXiv:1708.06374. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar] [CrossRef]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 30, 5998–6008. [Google Scholar]
Wu, W.; Feng, X.; Gao, Z.; Kan, Y. SMART: Scalable multi-agent real-time motion generation via next-token prediction. Adv. Neural Inf. Process. Syst. (NeurIPS) 2024, 37, 114048–114071. [Google Scholar]
Zhao, J.; Zhuang, J.; Zhou, Q.; Ban, T.; Xu, Z.; Zhou, H.; Wang, J.; Wang, G.; Li, Z.; Li, B. KiGRAS: Kinematic-driven generative model for realistic agent simulation. IEEE Robot. Autom. Lett. 2025, 10, 1082–1089. [Google Scholar] [CrossRef]
Pertsch, K.; Stachowicz, K.; Ichter, B.; Driess, D.; Nair, S.; Vuong, Q.; Mees, O.; Finn, C.; Levine, S. FAST: Efficient action tokenization for vision-language-action models. arXiv 2025, arXiv:2501.09747. [Google Scholar]
Ranzato, M.; Chopra, S.; Auli, M.; Zaremba, W. Sequence level training with recurrent neural networks. arXiv 2016, arXiv:1511.06732. [Google Scholar] [CrossRef]
Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 2015, 28, 1171–1179. [Google Scholar]
Philion, J.; Kar, A.; Fidler, S. Learning to evaluate perception models using planner-centric metrics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14055–14064. [Google Scholar]
Oppenheim, A.V.; Schafer, R.W. Discrete-Time Signal Processing, 3rd ed.; Pearson: London, UK, 2010. [Google Scholar]
Nair, A.; Gupta, A.; Dalal, M.; Levine, S. AWAC: Accelerating online reinforcement learning with offline datasets. arXiv 2020, arXiv:2006.09359. [Google Scholar]
Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643. [Google Scholar] [CrossRef]
Pomerleau, D.A. ALVINN: An autonomous land vehicle in a neural network. Adv. Neural Inf. Process. Syst. (NIPS) 1988, 1, 305–313. [Google Scholar]
Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 627–635. [Google Scholar]
García, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]
Varga, B.; Yang, D.; Hohmann, S. Cooperative decision-making in shared spaces: Making urban traffic safer through human-machine cooperation. arXiv 2023, arXiv:2306.14617. [Google Scholar]
Varga, B.; Br, T.; Schmitz, M.; Hashemi, E. Interaction-aware model predictive decision-making for socially-compliant autonomous driving in mixed urban traffic scenarios. arXiv 2025, arXiv:2503.01852. [Google Scholar]
Wang, R.; Schuurmans, M.; Patrinos, P. Interaction-aware model predictive control for autonomous driving. In 2023 European Control Conference (ECC); IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Chen, Y.; Vondrick, C.; Malik, J. Learning to drive from a world on rails. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15590–15599. [Google Scholar]
Renz, K.; Chitta, K.; Mercea, O.B.; Koepke, A.S.; Akata, Z.; Geiger, A. PlanT: Explainable planning transformers via object-level representations. arXiv 2022, arXiv:2210.14222. [Google Scholar]
Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. TransFuser: Imitation with transformer-based sensor fusion for autonomous driving. arXiv 2022, arXiv:2205.15997. [Google Scholar] [CrossRef]
Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. UniAD: Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17853–17862. [Google Scholar] [CrossRef]
Jiang, B.; Chen, S.; Xu, Q.; Liao, B.; Chen, J.; Zhou, H.; Zhang, Q.; Liu, W.; Huang, C.; Wang, X. VAD: Vectorized scene representation for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 18395–18405. [Google Scholar]
Zheng, W.; Song, R.; Guo, X.; Zhang, C.; Chen, L. GenAD: Generative end-to-end autonomous driving. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXV; Springer Nature: Cham, Switzerland, 2024; pp. 87–104. [Google Scholar] [CrossRef]
Li, Q.; Jia, X.; Wang, S.; Yan, J. Think2Drive: Efficient reinforcement learning by thinking in latent world model for quasi-realistic autonomous driving (in CARLA-v2). In European Conference on Computer Vision (ECCV); Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Ettinger, S.; Cheng, S.; Caine, B.; Liu, C.; Zhao, H.; Pradhan, S.; Chai, Y.; Sapp, B.; Qi, C.R.; Zhou, Y.; et al. Large scale interactive motion forecasting for autonomous driving: The Waymo Open Motion Dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9710–9719. [Google Scholar]
Caesar, H.; Kabzan, J.; Tan, K.S.; Fong, W.K.; Wolff, E.; Lang, A.; Fletcher, L.; Beijbom, O.; Omari, S. nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. arXiv 2021, arXiv:2106.11810. [Google Scholar]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef]
Lu, Y.; Fu, J.; Tucker, G.; Pan, X.; Bronstein, E.; Roelofs, R.; Sapp, B.; White, B.; Faust, A.; Whiteson, S.; et al. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. arXiv 2023, arXiv:2212.11419. [Google Scholar] [CrossRef]
Peng, X.B.; Ma, Z.; Abbeel, P.; Levine, S.; Kanazawa, A. AMP: Adversarial motion priors for stylized physics-based character control. ACM Trans. Graph. 2021, 40, 1–20. [Google Scholar] [CrossRef]
Kalashnikov, D.; Varley, J.; Chebotar, Y.; Swanson, B.; Jonschkowski, R.; Finn, C.; Levine, S.; Hausman, K. MT-Opt: Continuous multi-task robotic reinforcement learning at scale. arXiv 2021, arXiv:2104.08212. [Google Scholar]
Dauner, D.; Hallgarten, M.; Geiger, A.; Chitta, K. Parting with misconceptions about learning-based vehicle motion planning. In Proceedings of the 7th Conference on Robot Learning (CoRL), Atlanta, GA, USA, 6–9 November 2023; PMLR: Atlanta, GA, USA, 2023; Volume 229, pp. 1268–1281. Available online: https://proceedings.mlr.press/v229/dauner23a.html (accessed on 25 February 2026).
Cheng, J.; Chen, Y.; Chen, Q. PLUTO: Pushing the limit of imitation learning-based planning for autonomous driving. arXiv 2024, arXiv:2404.14327. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar] [CrossRef]
Tang, X.; Kan, M.; Shan, S.; Chen, X. Plan-R1: Safe and feasible trajectory planning as language modeling. arXiv 2025, arXiv:2505.17659. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, S.; Dong, Z.; Ye, B.; Yuan, T.; Yu, X.; Yin, L.; Lu, C.; Shi, J.; Yu, L.-J.; et al. FASTer: Toward efficient autoregressive vision language action modeling via neural action tokenization. arXiv 2025, arXiv:2512.04952. [Google Scholar] [CrossRef]
Kozma, L.; Voderholzer, J. Theoretical analysis of byte-pair encoding. arXiv 2024, arXiv:2411.08671. [Google Scholar] [CrossRef]
Treiber, M.; Hennecke, A.; Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 2000, 62, 1805–1824. [Google Scholar] [CrossRef]
Huang, Z.; Liu, H.; Lv, C. GameFormer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3903–3913. [Google Scholar]
Zheng, Y.; Xing, Z.; Zhang, Q.; Jin, B.; Li, P.; Zheng, Y.; Xia, Z.; Zhan, K.; Lang, X.; Chen, Y.; et al. PlanAgent: A multi-modal large language agent for closed-loop vehicle motion planning. arXiv 2024, arXiv:2406.01587. [Google Scholar] [CrossRef]
Zheng, Y.; Liang, R.; Zheng, K.; Zheng, J.; Mao, L.; Li, J.; Gu, W.; Ai, R.; Li, S.E.; Zhan, X.; et al. Diffusion-based planning for autonomous driving with flexible guidance. arXiv 2025, arXiv:2501.15564. [Google Scholar] [CrossRef]
Zhang, D.; Liang, J.; Guo, K.; Lu, S.; Wang, Q.; Xiong, R.; Miao, Z.; Wang, Y. CarPlanner: Consistent auto-regressive trajectory planning for large-scale reinforcement learning in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Scheel, O.; Bergamini, L.; Wolczyk, M.; Osiński, B.; Ondruska, P. Urban Driver: Learning to drive from real-world demonstrations using policy gradients. In Proceedings of The 5th Conference on Robot Learning (CoRL), London, UK, 8–11 November 2021; PMLR: London, UK, 2021; Volume 164, pp. 718–728. Available online: https://proceedings.mlr.press/v164/scheel22a.html (accessed on 25 February 2026).
Cheng, J.; Chen, Y.; Mei, X.; Yang, B.; Li, B.; Liu, M. Rethinking imitation-based planner for autonomous driving. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2024; pp. 14123–14130. [Google Scholar]

Figure 1. Comparison of time-domain and frequency-domain planning in a sudden hazard scenario. (a) Time-Domain Planner: Delayed Response. (b) Frequency-Domain Planner: Prompt Global Adjustment. Red vehicles denote the ego vehicle and blue vehicles denote surrounding vehicles; the pedestrian icon indicates the hazard, and the speed–time curve illustrates the braking response over time.

Figure 2. Comparison of autoregressive trajectory generation processes. (Left): time-domain autoregression generates one incremental action per step, forming a trajectory through local accumulation. (Right): frequency-domain autoregression progressively refines a global trajectory by predicting frequency coefficients from low to high.

Figure 3. Schematic of the incremental action representation

(Δ x, Δ y, Δ θ)

for trajectory planning.

Figure 3. Schematic of the incremental action representation

(Δ x, Δ y, Δ θ)

for trajectory planning.

Figure 4. Schematic of time-domain autoregressive planning with incremental action tokens and pose-based feedback.

Figure 5. Pipeline of frequency-domain action tokenization using DCT and Byte Pair Encoding (BPE) [40] for compact vocabulary generation. The blue/orange/green curves and bars correspond to the three action dimensions

(d_{x}, d_{y}, d_{yaw})

, respectively. The colored blocks use the same color coding and represent the corresponding quantized DCT-coefficient tokens after flattening. The red dashed line marks the truncation boundary, and the red curve highlights the merge process applied afterward.

Figure 5. Pipeline of frequency-domain action tokenization using DCT and Byte Pair Encoding (BPE) [40] for compact vocabulary generation. The blue/orange/green curves and bars correspond to the three action dimensions

(d_{x}, d_{y}, d_{yaw})

, respectively. The colored blocks use the same color coding and represent the corresponding quantized DCT-coefficient tokens after flattening. The red dashed line marks the truncation boundary, and the red curve highlights the merge process applied afterward.

Figure 6. Model architecture. The framework consists of three key components: (1) a multimodal scene encoder that aggregates context from maps, agents, traffic lights, and ego-state; (2) an autoregressive frequency-domain token decoder that generates a sequence of action tokens; and (3) a deterministic trajectory reconstructor that decodes tokens into frequency coefficients and reconstructs motion plans.

Figure 7. Workflow of the GRPO algorithm. The process involves: (1) generating candidate action token sequences through an autoregressive model; (2) decoding and executing these sequences in a simulated environment to compute a multi-component reward (safety, comfort, rule compliance, progress); and (3) using the reward returns to compute advantages for policy optimization.

Figure 8. Comparison of training and validation losses between time-domain and frequency-domain action representations across training steps. (a) Training loss comparison. (b) Validation loss comparison.

Figure 9. Learning curves of reward versus training steps for different RL algorithms. The line styles distinguish different methods as indicated in the legend, and the shaded regions show ± one standard deviation over N random seeds.

Figure 10. Comparative response characteristics of two motion-planning paradigms in a sudden cut-in scenario. (a) Scenario BEV (blue: frequency-domain, orange: time-domain). (b) Ego vehicle speed profiles under time-domain vs. frequency-domain planning.

Figure 11. Comparative response characteristics of two motion-planning paradigms in an emergency braking scenario. (a) Scenario BEV (blue: frequency-domain, orange: time-domain). (b) Ego vehicle speed profiles under time-domain vs. frequency-domain planning.

Figure 12. Comparative control characteristics of two motion-planning paradigms in an abrupt obstruction by a stopped vehicle scenario. (a) Scenario BEV (blue: frequency-domain, orange: time-domain). (b) Ego vehicle speed profiles under time-domain vs. frequency-domain planning.

Figure 13. Comparative control characteristics of two motion-planning paradigms in a sudden pedestrian crossing at a crosswalk scenario. (a) Scenario BEV (blue: frequency-domain, orange: time-domain). (b) Ego vehicle speed profiles under time-domain vs. frequency-domain planning.

Figure 14. Near-stop failure mode. The predicted trajectory points exhibit centimeter-level net displacement but persistent oscillatory geometry, resembling an inplace spinning or jittering behavior under near-zero speed. The blue circles denote the generated trajectory points, and the blue curve shows the resulting trajectory. The numbers indicate the sequential index of the points in the order they are generated. The arrow represents the displacement vector from the first point to the last point.

Table 1. Computational cost comparison between time-domain and frequency-domain planners under the same hardware setting. Training time is reported in wall-clock hours using 24 A100 GPUs. Peak memory is measured as the maximum GPU memory footprint during training. Inference latency is measured per planning cycle in closed-loop evaluation, and we report the latency distribution statistics (mean, P95, and maximum) over all planning cycles.

Method	IL Time	RL Time	Peak Mem	Mean Lat.	P95 Lat.	Max Lat.
	(h)	(h)	(GB)	(ms)	(ms)	(ms)
Time-domain	30	9	76	72.6	88.4	103.1
Freq-domain	40	14	69	81.8	101.9	118.5

Table 2. Overall closed-loop planning performance on the nuPlan validation set reported as mean ± standard deviation over N random seeds.

Planner	NR-CLS	Collision	TTC	Drivable	Speed	Comfort	Progress	R-CLS
Cont. + IL	$77.82 \pm 1.05$	$92.65 \pm 0.38$	$88.49 \pm 0.75$	$92.80 \pm 0.33$	$90.22 \pm 0.70$	$98.60 \pm 0.25$	$84.07 \pm 0.95$	$75.06 \pm 1.10$
Cont. + IL + PPO	$81.64 \pm 1.15$	$93.92 \pm 0.42$	$90.31 \pm 0.85$	$94.56 \pm 0.36$	$92.18 \pm 0.78$	$98.77 \pm 0.28$	$87.05 \pm 1.05$	$78.93 \pm 1.20$
Cont. + IL + GRPO	$85.96 \pm 0.90$	$95.48 \pm 0.32$	$92.74 \pm 0.65$	$96.03 \pm 0.28$	$94.21 \pm 0.62$	$99.35 \pm 0.20$	$90.37 \pm 0.85$	$83.91 \pm 0.95$
Freq. + IL	$83.87 \pm 1.00$	$95.01 \pm 0.36$	$91.86 \pm 0.70$	$95.42 \pm 0.30$	$93.67 \pm 0.65$	$98.97 \pm 0.22$	$88.94 \pm 0.90$	$81.02 \pm 1.05$
Freq. + IL + PPO	$88.72 \pm 1.10$	$96.83 \pm 0.40$	$94.25 \pm 0.80$	$96.94 \pm 0.34$	$96.02 \pm 0.72$	$99.14 \pm 0.25$	$92.88 \pm 1.00$	$87.46 \pm 1.15$
Freq. + IL + GRPO (Ours)	$91.53 \pm 0.95$	$97.52 \pm 0.30$	$95.27 \pm 0.65$	$97.52 \pm 0.28$	$99.55 \pm 0.55$	$99.65 \pm 0.18$	$92.24 \pm 0.85$	$90.44 \pm 0.95$

Note: Bold numbers indicate the best performance in each column.

Table 3. Comparison with representative planning methods on the nuPlan benchmark. * denotes with rule-based post-processing. NR/R denote non-reactive and reactive closed-loop evaluation, respectively.

Type	Planner	Val14		Test14-Hard		Test14-Random
		NR	R	NR	R	NR	R
Expert	Log-Replay	93.53	80.32	85.96	68.80	94.03	75.86
Rule-based & Hybrid	IDM [41]	75.60	77.33	56.15	62.26	70.39	72.42
	PDM-Closed* [35]	92.84	92.12	65.08	75.19	90.05	91.64
	PDM-Hybrid* [35]	92.77	92.11	65.99	76.07	90.10	91.28
	GameFormer* [42]	79.94	79.78	68.70	67.05	83.88	82.05
	PLUTO* [36]	92.88	89.84	80.08	76.88	92.23	90.29
	PlanAgent* [43]	93.26	92.75	72.51	76.82	-	-
	Diffusion [44]	94.26	92.90	78.87	82.00	94.80	91.75
	Carplanner* [45]	-	-	-	-	94.07	91.10
Learning-based	UrbanDriver [46]	68.57	64.11	50.40	49.95	51.83	67.15
	PDM-Open [35]	53.53	54.24	33.51	35.83	52.81	57.23
	PlanTF [47]	84.27	76.95	69.70	61.61	85.62	79.58
	PLUTO [36]	88.89	78.11	70.03	59.74	89.90	78.62
	Diffusion Planner [44]	89.87	82.80	75.99	69.22	89.19	82.93
	Ours (Freq. Tokens + GRPO)	90.82	88.31	79.62	78.19	92.44	91.08

Note: Bold numbers indicate the best performance in each column.

Table 4. Effect of frequency truncation length K reported as mean ± standard deviation over N independent runs. For “Infer. Time (ms)”, the reported value is the per-run average inference latency, and the standard deviation reflects run-to-run variation rather than per-cycle tail latency.

K	NR-CLS	Collision	TTC	Drivable	Speed	Comfort	Progress	R-CLS	Infer. Time (ms)
8	$84.14 \pm 1.20$	$90.01 \pm 0.50$	$88.25 \pm 0.85$	$91.12 \pm 0.45$	$94.21 \pm 0.75$	$96.58 \pm 0.30$	$86.42 \pm 1.10$	$80.33 \pm 1.25$	$42.6 \pm 0.9$
12	$88.72 \pm 1.05$	$94.38 \pm 0.42$	$92.87 \pm 0.70$	$95.04 \pm 0.38$	$97.88 \pm 0.60$	$99.61 \pm 0.18$	$90.16 \pm 0.95$	$85.61 \pm 1.10$	$66.9 \pm 1.4$
16	$91.53 \pm 0.95$	$97.52 \pm 0.30$	$95.27 \pm 0.65$	$97.52 \pm 0.28$	$99.55 \pm 0.55$	$99.65 \pm 0.18$	$92.24 \pm 0.85$	$90.44 \pm 0.95$	$81.8 \pm 1.8$
24	$91.65 \pm 1.10$	$97.81 \pm 0.32$	$96.12 \pm 0.60$	$98.05 \pm 0.25$	$99.41 \pm 0.50$	$99.61 \pm 0.16$	$94.05 \pm 0.90$	$91.54 \pm 1.05$	$130.4 \pm 2.6$

Note: Bold numbers indicate the best performance in each column.

Table 5. Effect of token vocabulary size

| V |

reported as mean ± standard deviation over N runs.

Table 5. Effect of token vocabulary size

| V |

reported as mean ± standard deviation over N runs.

$\| V \|$	NR-CLS	Collision	TTC	Drivable	Speed	Comfort	Progress	R-CLS
512	$87.23 \pm 1.15$	$93.02 \pm 0.45$	$91.18 \pm 0.80$	$94.26 \pm 0.40$	$97.02 \pm 0.70$	$99.60 \pm 0.20$	$89.04 \pm 1.00$	$83.76 \pm 1.20$
1024	$89.11 \pm 1.05$	$95.08 \pm 0.38$	$93.62 \pm 0.70$	$96.18 \pm 0.32$	$98.21 \pm 0.62$	$99.61 \pm 0.18$	$91.26 \pm 0.92$	$86.97 \pm 1.10$
2048	$91.53 \pm 0.95$	$97.52 \pm 0.30$	$95.27 \pm 0.65$	$97.52 \pm 0.28$	$99.55 \pm 0.55$	$99.65 \pm 0.18$	$92.24 \pm 0.85$	$90.44 \pm 0.95$
4096	$91.55 \pm 1.05$	$97.68 \pm 0.32$	$95.71 \pm 0.62$	$97.29 \pm 0.30$	$99.63 \pm 0.52$	$99.62 \pm 0.16$	$92.16 \pm 0.88$	$90.91 \pm 1.00$

Note: Bold numbers indicate the best performance in each column.

Table 6. Effect of tokenization scheme under the same frequency-domain representation reported as mean ± standard deviation over N runs. All frequency-domain methods use

K = 16

and are fine-tuned with GRPO. The vocabulary budget is matched to

| V | = 2048

whenever applicable.

Table 6. Effect of tokenization scheme under the same frequency-domain representation reported as mean ± standard deviation over N runs. All frequency-domain methods use

K = 16

and are fine-tuned with GRPO. The vocabulary budget is matched to

| V | = 2048

whenever applicable.

Planner	NR-CLS	Collision	TTC	Drivable	Speed	Comfort	Progress	R-CLS
Cont.	$85.96 \pm 0.90$	$95.48 \pm 0.32$	$92.74 \pm 0.65$	$96.03 \pm 0.28$	$94.21 \pm 0.62$	$99.35 \pm 0.20$	$90.37 \pm 0.85$	$83.91 \pm 0.95$
Freq. + Bin	$90.70 \pm 0.95$	$96.90 \pm 0.34$	$94.70 \pm 0.70$	$97.10 \pm 0.30$	$98.70 \pm 0.58$	$99.60 \pm 0.18$	$92.10 \pm 0.88$	$89.50 \pm 0.98$
Freq. + KM	$91.10 \pm 0.95$	$97.20 \pm 0.32$	$95.00 \pm 0.68$	$97.30 \pm 0.30$	$99.10 \pm 0.56$	$99.62 \pm 0.16$	$92.20 \pm 0.86$	$90.00 \pm 0.96$
Freq. + BPE	$91.53 \pm 0.95$	$97.52 \pm 0.30$	$95.27 \pm 0.65$	$97.52 \pm 0.28$	$99.55 \pm 0.55$	$99.65 \pm 0.18$	$92.24 \pm 0.85$	$90.44 \pm 0.95$

Note: Bold numbers indicate the best performance in each column.

Table 7. Effect of planning horizon reported as mean ± standard deviation over N runs.

Horizon (s)	NR-CLS	Collision	TTC	Drivable	Speed	Comfort	Progress	R-CLS
4.0	$87.46 \pm 1.05$	$93.88 \pm 0.40$	$92.01 \pm 0.75$	$95.12 \pm 0.34$	$97.84 \pm 0.70$	$99.63 \pm 0.18$	$88.91 \pm 1.00$	$84.02 \pm 1.10$
6.0	$91.53 \pm 0.95$	$97.52 \pm 0.30$	$95.27 \pm 0.65$	$97.52 \pm 0.28$	$99.55 \pm 0.55$	$99.65 \pm 0.18$	$92.24 \pm 0.85$	$90.44 \pm 0.95$
8.0	$89.73 \pm 1.10$	$95.41 \pm 0.38$	$94.12 \pm 0.70$	$96.48 \pm 0.32$	$98.02 \pm 0.72$	$99.59 \pm 0.20$	$91.34 \pm 0.95$	$86.87 \pm 1.15$

Note: Bold numbers indicate the best performance in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xia, J.; Kong, Z.; Wu, X.; Shi, B.; Han, Y.; Xu, M. Frequency-Domain Trajectory Planning for Autonomous Driving in Highly Dynamic Scenarios. Appl. Sci. 2026, 16, 2447. https://doi.org/10.3390/app16052447

AMA Style

Xia J, Kong Z, Wu X, Shi B, Han Y, Xu M. Frequency-Domain Trajectory Planning for Autonomous Driving in Highly Dynamic Scenarios. Applied Sciences. 2026; 16(5):2447. https://doi.org/10.3390/app16052447

Chicago/Turabian Style

Xia, Jie, Zhuo Kong, Xiaodong Wu, Boran Shi, Yuanbo Han, and Min Xu. 2026. "Frequency-Domain Trajectory Planning for Autonomous Driving in Highly Dynamic Scenarios" Applied Sciences 16, no. 5: 2447. https://doi.org/10.3390/app16052447

APA Style

Xia, J., Kong, Z., Wu, X., Shi, B., Han, Y., & Xu, M. (2026). Frequency-Domain Trajectory Planning for Autonomous Driving in Highly Dynamic Scenarios. Applied Sciences, 16(5), 2447. https://doi.org/10.3390/app16052447

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Frequency-Domain Trajectory Planning for Autonomous Driving in Highly Dynamic Scenarios

Abstract

1. Introduction

2. Related Work

2.1. Optimization-Based and Reactive Planning Methods

2.2. Imitation-Based Planning Methods

2.3. Reinforcement-Based and Hybrid Planning Methods

2.4. Time-Domain Autoregressive Planning

2.5. Frequency-Domain Trajectory Representations

3. Methods

3.1. Problem Formulation

3.2. Time-Domain Autoregressive Planning via Tokenized Actions

3.3. Frequency-Domain Autoregressive Planning

3.3.1. Action Tokenization via BPE-Based Compression

3.3.2. Frequency-Domain Action Parameterization

3.3.3. Autoregressive Modeling with Frequency-Domain Tokens

3.3.4. Trajectory Synthesis from Frequency-Domain Actions

3.4. Unified Autoregressive Planning Architecture

3.4.1. Scene Representation

3.4.2. Action Token Embedding

3.4.3. Unified Transformer Blocks

3.4.4. Autoregressive Token Prediction

3.4.5. Trajectory Reconstruction

3.5. Hybrid Learning for Frequency-Domain Planning

3.5.1. Imitation Learning Pretraining

3.5.2. Reinforcement Learning with GRPO

4. Experiments

4.1. Experimental Setup and Evaluation Protocol

4.2. Computational Cost Analysis

4.3. Overall Performance on the nuPlan Validation Split

4.4. Performance Under Safety-Critical and Highly Dynamic Scenarios

4.4.1. Case 1: Sudden Cut-In with Rapid Intent Change

4.4.2. Case 2: Emergency Braking of the Lead Vehicle

4.4.3. Case 3: Abrupt Obstruction by a Stopped Vehicle

4.4.4. Case 4: Sudden Pedestrian Crossing at a Crosswalk

4.4.5. Discussion: Robustness to Abrupt Behavioral Changes

4.5. Limitations and Failure Modes

4.6. Ablation Studies on Frequency-Domain Design Choices

4.6.1. Effect of Frequency Truncation Length K

4.6.2. Effect of Token Vocabulary Size V

4.6.3. Effect of Tokenization Scheme

4.6.4. Effect of Planning Horizon H

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI