Author Contributions
Conceptualization, Y.Y.; Methodology, Y.Y.; Software, Y.Y.; Validation, Y.Y.; Formal analysis, Y.Y.; Investigation, Y.Y.; Resources, Y.Y. and K.R.; Data curation, Y.Y.; Writing—original draft preparation, Y.Y.; Writing—review and editing, Y.Y. and K.R.; Visualization, Y.Y.; Supervision, K.R.; Project administration, K.R. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Taxonomy of related work on off-road autonomous navigation. The four solid-line branches are classical geometric approaches, deep learning perception (semantic segmentation and traversability), end-to-end imitation learning, and the more recent world-model-plus-sequence-policy family. TerrainFormer sits at the leaf of the last branch. Dashed red arrows mark the prior components TerrainFormer reuses: BEV world models from urban driving [
24,
25], decision-transformer architectures from offline RL [
9,
10,
26], and self-supervised traversability/segmentation targets from deep learning perception [
1,
21,
22].
Figure 1.
Taxonomy of related work on off-road autonomous navigation. The four solid-line branches are classical geometric approaches, deep learning perception (semantic segmentation and traversability), end-to-end imitation learning, and the more recent world-model-plus-sequence-policy family. TerrainFormer sits at the leaf of the last branch. Dashed red arrows mark the prior components TerrainFormer reuses: BEV world models from urban driving [
24,
25], decision-transformer architectures from offline RL [
9,
10,
26], and self-supervised traversability/segmentation targets from deep learning perception [
1,
21,
22].
Figure 2.
High-level dataflow of TerrainFormer. A raw LiDAR point cloud (left) passes through four stages: the PointPillars encoder produces a
BEV feature map, the ViT-based world model compresses it to 64 latent tokens
, the decision transformer produces
chunk logits, and TemporalEnsemble aggregates overlapping chunks across the last
frames into a single discrete action. Component internals are deliberately omitted here and are expanded in
Figure 3 (encoder),
Figure 4 (world model), and
Figure 5 (decision transformer). Per-stage latency and parameter counts are annotated below each block; total end-to-end inference is ∼20 ms (∼50 FPS) on a NVIDIA Quadro RTX 8000 GPU (NVIDIA Corporation, Santa Clara, CA, USA). Bottom braces show the two-phase training schedule: encoder + world model pretrained self-supervised on LidarDustX + GOOSE-3D (Phase 1, blue), then frozen while the decision transformer is trained with behavioral cloning on RELLIS-3D (Phase 2, green).
Figure 2.
High-level dataflow of TerrainFormer. A raw LiDAR point cloud (left) passes through four stages: the PointPillars encoder produces a
BEV feature map, the ViT-based world model compresses it to 64 latent tokens
, the decision transformer produces
chunk logits, and TemporalEnsemble aggregates overlapping chunks across the last
frames into a single discrete action. Component internals are deliberately omitted here and are expanded in
Figure 3 (encoder),
Figure 4 (world model), and
Figure 5 (decision transformer). Per-stage latency and parameter counts are annotated below each block; total end-to-end inference is ∼20 ms (∼50 FPS) on a NVIDIA Quadro RTX 8000 GPU (NVIDIA Corporation, Santa Clara, CA, USA). Bottom braces show the two-phase training schedule: encoder + world model pretrained self-supervised on LidarDustX + GOOSE-3D (Phase 1, blue), then frozen while the decision transformer is trained with behavioral cloning on RELLIS-3D (Phase 2, green).
![Sensors 26 03795 g002 Sensors 26 03795 g002]()
Figure 3.
Component view of the PointPillars LiDAR encoder used by TerrainFormer. The pipeline is read left-to-right (top row) then folded back right-to-left (bottom row). Critically, the per-point MLP operates independently on every point and is not pooled itself; max-pooling is applied to the MLP outputs within each pillar (
Section 3.1). Tensor shapes are annotated below each block.
Figure 3.
Component view of the PointPillars LiDAR encoder used by TerrainFormer. The pipeline is read left-to-right (top row) then folded back right-to-left (bottom row). Critically, the per-point MLP operates independently on every point and is not pooled itself; max-pooling is applied to the MLP outputs within each pillar (
Section 3.1). Tensor shapes are annotated below each block.
Figure 4.
Component view of the world model. The three internal modules (terrain tokenizer, dynamics transformer, and latent compression) and the four self-supervised heads used during Phase 1 pretraining are shown explicitly. Only (64 latent tokens) and are passed on to the decision transformer in Phase 2; the four prediction heads remain unused at inference, and after Phase 1, the entire world model is frozen.
Figure 4.
Component view of the world model. The three internal modules (terrain tokenizer, dynamics transformer, and latent compression) and the four self-supervised heads used during Phase 1 pretraining are shown explicitly. Only (64 latent tokens) and are passed on to the decision transformer in Phase 2; the four prediction heads remain unused at inference, and after Phase 1, the entire world model is frozen.
Figure 5.
Component view of the decision transformer. The 80-token sequence (top) is composed of one context token , the 64 world latent tokens, 10 action-history tokens, and 5 learnable chunk-query tokens. Dashed red arrows from one chunk query illustrate the cross-modal attention pattern: every chunk query attends to all spatial (world) and temporal (action-history) tokens simultaneously. The context-token output predicts the current action and confidence; the last five output positions predict the action chunk ; temporal prediction aggregation at inference aggregates overlapping chunks with exponential-decay weighting.
Figure 5.
Component view of the decision transformer. The 80-token sequence (top) is composed of one context token , the 64 world latent tokens, 10 action-history tokens, and 5 learnable chunk-query tokens. Dashed red arrows from one chunk query illustrate the cross-modal attention pattern: every chunk query attends to all spatial (world) and temporal (action-history) tokens simultaneously. The context-token output predicts the current action and confidence; the last five output positions predict the action chunk ; temporal prediction aggregation at inference aggregates overlapping chunks with exponential-decay weighting.
Figure 6.
Representative LiDAR scans from each dataset, projected to BEV and colored by height (clip range m to m). From left to right: RELLIS-3D—vegetation-heavy off-road, Ouster OS1-64 (∼37 k points/scan); LidarDustX—dusty construction environment, multi-sensor capture (∼111 k points/scan with the LS128 sensor shown); and GOOSE-3D—mixed European outdoor (forests, fields, campus), Velodyne VLS128 (∼199 k points/scan). The red triangle marks the ego vehicle.
Figure 6.
Representative LiDAR scans from each dataset, projected to BEV and colored by height (clip range m to m). From left to right: RELLIS-3D—vegetation-heavy off-road, Ouster OS1-64 (∼37 k points/scan); LidarDustX—dusty construction environment, multi-sensor capture (∼111 k points/scan with the LS128 sensor shown); and GOOSE-3D—mixed European outdoor (forests, fields, campus), Velodyne VLS128 (∼199 k points/scan). The red triangle marks the ego vehicle.
Figure 7.
Training curves for both phases. Phase 1 (left): world-model training and validation total loss vs. epoch (best validation loss at epoch 20, red star). Phase 2 (right): decision-transformer focal loss (green, left axis) and validation accuracy (red, right axis) vs. epoch (best validation accuracy at epoch 3, black star). Solid lines show training metrics; dashed lines show validation metrics. Each curve is from a single training run on a Quadro RTX 8000.
Figure 7.
Training curves for both phases. Phase 1 (left): world-model training and validation total loss vs. epoch (best validation loss at epoch 20, red star). Phase 2 (right): decision-transformer focal loss (green, left axis) and validation accuracy (red, right axis) vs. epoch (best validation accuracy at epoch 3, black star). Solid lines show training metrics; dashed lines show validation metrics. Each curve is from a single training run on a Quadro RTX 8000.
Figure 8.
Confusion matrix of the decision transformer on the RELLIS-3D test set. Left: raw counts; Right: row-normalized per-class recall. Dominant classes (stop, fwd_fast) achieve high diagonal values; all classes show strong diagonal recall.
Figure 8.
Confusion matrix of the decision transformer on the RELLIS-3D test set. Left: raw counts; Right: row-normalized per-class recall. Dominant classes (stop, fwd_fast) achieve high diagonal values; all classes show strong diagonal recall.
Figure 9.
Per-class F1 score with sample support counts. The dashed line shows the macro-averaged F1.
Figure 9.
Per-class F1 score with sample support counts. The dashed line shows the macro-averaged F1.
Figure 10.
Comparison of decision accuracy using current frames (ground truth) versus predicted future frames from the world model.
Figure 10.
Comparison of decision accuracy using current frames (ground truth) versus predicted future frames from the world model.
Figure 11.
Real-time simulation interface used to exercise TerrainFormer at 10 Hz on RELLIS-3D test sequences. The large left panel shows the BEV point cloud (height-colored) with the ego vehicle marker; the upper-right panel shows the predicted traversability map (computed from a height-variance proxy of the point cloud); the lower-right card shows the decision-transformer’s predicted action with a confidence bar and a ground-truth-match indicator; the bottom panel shows the per-action probability distribution across all 12 classes. Each predicted decision is also published over UDP for external monitoring (
Section 7.7).
Figure 11.
Real-time simulation interface used to exercise TerrainFormer at 10 Hz on RELLIS-3D test sequences. The large left panel shows the BEV point cloud (height-colored) with the ego vehicle marker; the upper-right panel shows the predicted traversability map (computed from a height-variance proxy of the point cloud); the lower-right card shows the decision-transformer’s predicted action with a confidence bar and a ground-truth-match indicator; the bottom panel shows the per-action probability distribution across all 12 classes. Each predicted decision is also published over UDP for external monitoring (
Section 7.7).
Table 1.
Methodological comparison, Part 1 of 2: architectural backbones. “WM” = world model with latent dynamics; “DT” = decision transformer (transformer-based sequence policy); “E2E” = direct sensors-to-action backbone without an intermediate latent.
Table 1.
Methodological comparison, Part 1 of 2: architectural backbones. “WM” = world model with latent dynamics; “DT” = decision transformer (transformer-based sequence policy); “E2E” = direct sensors-to-action backbone without an intermediate latent.
| System | Domain | Perception | Policy |
|---|
| BADGR [33] | off-road | RGB + IMU CNN | E2E MLP |
| TartanDrive [34] | off-road | LiDAR + RGB + IMU | E2E MLP |
| Dreamer/DreamerV3 [7,8] | cont. control | RGB CNN | WM imagination |
| MILE [24] | urban | camera → BEV | WM + planner |
| PCWM [25] | urban | LiDAR point clouds | WM |
| Decision Transformer [9] | offline RL | state vector | DT |
| Wayformer [28] | urban | multi-modal BEV | transformer |
| TerrainFormer (ours) | off-road | LiDAR → BEV (PointPillars) | WM + DT |
Table 2.
Methodological comparison, Part 2 of 2: deployment-relevant capabilities. “Cross-dataset” = perception backbone is pretrained on data disjoint from the policy-training set. “Real-time” = end-to-end inference FPS on a single GPU at the system’s stated input size.
Table 2.
Methodological comparison, Part 2 of 2: deployment-relevant capabilities. “Cross-dataset” = perception backbone is pretrained on data disjoint from the policy-training set. “Real-time” = end-to-end inference FPS on a single GPU at the system’s stated input size.
| System | Action Repr. | Cross-Dataset | Real-Time |
|---|
| BADGR [33] | continuous | — | yes (10 Hz) |
| TartanDrive [34] | continuous | — | yes (10 Hz) |
| Dreamer/DreamerV3 [7,8] | continuous | no | no (planning) |
| MILE [24] | continuous traj. | no | near (CARLA) |
| PCWM [25] | — | no | no |
| Decision Transformer [9] | discrete | — | yes |
| Wayformer [28] | continuous traj. | no | yes |
| TerrainFormer (ours) | discrete (12 actions) | yes | yes (∼50 FPS) |
Table 3.
LiDAR Encoder Comparison: PointPillars (Ours) vs. PointNet++.
Table 3.
LiDAR Encoder Comparison: PointPillars (Ours) vs. PointNet++.
| Metric | PointPillars (Ours) | PointNet++ |
|---|
| Encoder latency | ∼5 ms | ∼25 ms |
| Encoder FPS | ∼200 FPS | ∼40 FPS |
| Full-system FPS | ∼50 FPS | ∼25 FPS |
| Parameters | 0.15 M | 1.48 M |
| Output format | BEV (direct) | Point features |
| BEV projection | Built-in | Requires additional |
| TensorRT support | Full | Limited |
| Real-time (10 Hz) | ✓ | × |
Table 4.
Discrete Action Space (12 Forward-Motion Actions).
Table 4.
Discrete Action Space (12 Forward-Motion Actions).
| ID | Action | ID | Action |
|---|
| 0 | Stop | 6 | Left slight |
| 1 | Forward slow | 7 | Right slight |
| 2 | Forward medium | 8 | Right medium |
| 3 | Forward fast | 9 | Right sharp |
| 4 | Left sharp | 10 | Forward left |
| 5 | Left medium | 11 | Forward right |
Table 5.
Dataset assignment across the two training phases.
Table 5.
Dataset assignment across the two training phases.
| Phase | RELLIS-3D | LidarDustX | GOOSE-3D |
|---|
| Phase 1 (World Model) | — | ✓ | ✓ |
| Phase 2 (Decision T.) | ✓ | — | — |
| Phase 2 (test) | ✓ | — | — |
Table 6.
Decision Transformer Overall Performance.
Table 6.
Decision Transformer Overall Performance.
| Metric | Validation | Test |
|---|
| Accuracy | 90.03% | 87.31% |
| Precision (macro) | — | 0.7828 |
| Recall (macro) | — | 0.8127 |
| F1 Score (macro) | — | 0.7948 |
Table 7.
Per-Class Metrics on Test Set—12 Forward-Only Action Space.
Table 7.
Per-Class Metrics on Test Set—12 Forward-Only Action Space.
| ID | Action | Support | Prec. | Rec. | F1 |
|---|
| 0 | Stop | 677 (33.3%) | 0.991 | 0.959 | 0.975 |
| 1 | Fwd slow | 55 (2.7%) | 0.560 | 0.764 | 0.646 |
| 2 | Fwd med | 204 (10.0%) | 0.728 | 0.618 | 0.668 |
| 3 | Fwd fast | 707 (34.8%) | 0.882 | 0.910 | 0.896 |
| 4 | L sharp | 51 (2.5%) | 0.813 | 0.765 | 0.788 |
| 5 | L med | 51 (2.5%) | 0.707 | 0.804 | 0.752 |
| 6 | L slight | 78 (3.8%) | 0.733 | 0.705 | 0.719 |
| 7 | R sharp | 4 (0.2%) | 0.600 | 0.750 | 0.667 |
| 8 | R med | 63 (3.1%) | 0.908 | 0.937 | 0.922 |
| 9 | R slight | 61 (3.0%) | 0.618 | 0.689 | 0.651 |
| 10 | Fwd left | 42 (2.1%) | 0.905 | 0.905 | 0.905 |
| 11 | Fwd right | 40 (2.0%) | 0.950 | 0.950 | 0.950 |
Table 8.
Predictive evaluation: current vs. predicted future frames.
Table 8.
Predictive evaluation: current vs. predicted future frames.
| Metric | Current | Predicted |
|---|
| Accuracy | 87.31% | 86.52% |
| Precision (macro) | 0.7828 | 0.7687 |
| Recall (macro) | 0.8127 | 0.8118 |
| F1 Score (macro) | 0.7948 | 0.7864 |
| Accuracy difference: 0.79% |
| Prediction agreement: 98.82% |
Table 9.
Component-wise ablation results on the RELLIS-3D test split. Each variant removes one component from the full TerrainFormer model while retaining the remaining architecture and evaluation settings. D1 removes the world-model representation and directly uses the raw BEV features; D2 disables action chunking by setting K = 1; D3 disables TemporalEnsemble while retaining action chunking; and D4 removes the strict cross-dataset pretraining protocol by including RELLIS-3D during Phase 1 pretraining. accuracy denotes the change in test accuracy, measured in percentage points, relative to the full TerrainFormer baseline.
Table 9.
Component-wise ablation results on the RELLIS-3D test split. Each variant removes one component from the full TerrainFormer model while retaining the remaining architecture and evaluation settings. D1 removes the world-model representation and directly uses the raw BEV features; D2 disables action chunking by setting K = 1; D3 disables TemporalEnsemble while retaining action chunking; and D4 removes the strict cross-dataset pretraining protocol by including RELLIS-3D during Phase 1 pretraining. accuracy denotes the change in test accuracy, measured in percentage points, relative to the full TerrainFormer baseline.
| Variant | Component Removed | Test Acc. (%) | Macro F1 | Δ Acc. (pp) |
|---|
| Full model | — (baseline) | | | — |
| D1 | World model (raw BEV) | | | |
| D2 | Action chunking (K = 1) | | | |
| D3 | TemporalEnsemble | | | |
| D4 | Cross-dataset protocol | | | |
Table 10.
Summary of ablations reported in this section.
Table 10.
Summary of ablations reported in this section.
| Ablation | Variants Compared | Empirical? | Effect |
|---|
| Encoder backbone | PointPillars vs. PointNet++ | yes | end-to-end throughput |
| Predictive eval | GT obs. vs. predicted obs. | yes | 0.79% accuracy drop, 98.82% agreement |
| Focal-loss recipe | focal vs. plain CE (early run) | partial | lifts minority F1 from ∼0 to ≥0.65 |
| Chunk size K | | no (design) | matches [12] default |
| Goal lookahead k | | no (design) | balances pose noise and horizon |
Table 11.
Inference Latency and FPS: TerrainFormer vs. PointNet++ Variant (NVIDIA Quadro RTX 8000).
Table 11.
Inference Latency and FPS: TerrainFormer vs. PointNet++ Variant (NVIDIA Quadro RTX 8000).
| Component | TerrainFormer (Ours) | with PointNet++ |
|---|
| LiDAR Encoder | ∼5 ms | ∼25 ms |
| World Model | ∼10 ms | ∼10 ms |
| Decision Transformer | ∼5 ms | ∼5 ms |
| Total | ∼20 ms | ∼40 ms |
| Throughput | ∼50 FPS | ∼25 FPS |
| 10 Hz real-time | ✓ | × |