Author Contributions
Conceptualization, M.-J.-S.W.; methodology, Y.J., D.T., and X.-J.C.; software, Z.-Y.W. and C.-W.L.; validation, Y.J. and D.T.; formal analysis, X.-J.C.; investigation, Y.J.; resources, M.-J.-S.W.; data curation, D.T.; writing—original draft preparation, Y.J. and D.T.; writing—review and editing, M.-J.-S.W. and X.-J.C.; visualization, Z.-Y.W.; supervision, M.-J.-S.W.; project administration, M.-J.-S.W.; funding acquisition, M.-J.-S.W. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Motivation. Standard VLA policies are brittle in long-horizon manipulation due to error accumulation and delayed failures. The figure illustrates three key observations: (left) accumulated errors over time steps; (middle) the gap between predicted and actual states; (right) how PI-VLA’s uncertainty-aware replanning mitigates these issues. We observe that robustness hinges on preserving action–state symmetry while explicitly detecting and resolving symmetry breaking. PI-VLA addresses this via predictive modeling and interactive replanning.
Figure 1.
Motivation. Standard VLA policies are brittle in long-horizon manipulation due to error accumulation and delayed failures. The figure illustrates three key observations: (left) accumulated errors over time steps; (middle) the gap between predicted and actual states; (right) how PI-VLA’s uncertainty-aware replanning mitigates these issues. We observe that robustness hinges on preserving action–state symmetry while explicitly detecting and resolving symmetry breaking. PI-VLA addresses this via predictive modeling and interactive replanning.
Figure 2.
Algorithm architecture. PI-VLA consists of a Cognitive–Motor Synergy (CMS) core with dual discrete and continuous action heads, state prediction, and value estimation. The architecture is organized into three main components: (1) Visual-Language Encoder: Processes RGB observations through a vision encoder (SigLIP) and fuses them with language instructions via a pre-trained VLM backbone (Prismatic-7B). (2) CMS Module: Four parallel projection heads generate discrete action tokens (), continuous action chunks (), predicted future states (), and value estimates (). (3) AURD Module: Monitors cross-modal action discrepancy and state prediction error to compute uncertainty signals, triggering closed-loop replanning when thresholds are exceeded. Solid arrows indicate forward computation; dashed arrows indicate uncertainty-driven feedback.
Figure 2.
Algorithm architecture. PI-VLA consists of a Cognitive–Motor Synergy (CMS) core with dual discrete and continuous action heads, state prediction, and value estimation. The architecture is organized into three main components: (1) Visual-Language Encoder: Processes RGB observations through a vision encoder (SigLIP) and fuses them with language instructions via a pre-trained VLM backbone (Prismatic-7B). (2) CMS Module: Four parallel projection heads generate discrete action tokens (), continuous action chunks (), predicted future states (), and value estimates (). (3) AURD Module: Monitors cross-modal action discrepancy and state prediction error to compute uncertainty signals, triggering closed-loop replanning when thresholds are exceeded. Solid arrows indicate forward computation; dashed arrows indicate uncertainty-driven feedback.
Figure 3.
Performance comparison on the LIBERO benchmark across four task suites. This figure provides a visual summary of the results in
Table 2; readers are referred to the table for precise numerical values.
Figure 3.
Performance comparison on the LIBERO benchmark across four task suites. This figure provides a visual summary of the results in
Table 2; readers are referred to the table for precise numerical values.
Figure 4.
Inference rate comparison across methods. PI-VLA maintains competitive throughput while providing uncertainty-aware planning capabilities.
Figure 4.
Inference rate comparison across methods. PI-VLA maintains competitive throughput while providing uncertainty-aware planning capabilities.
Figure 5.
Per-task success rate breakdown on the LIBERO Spatial suite. PI-VLA demonstrates consistent improvements across all manipulation tasks.
Figure 5.
Per-task success rate breakdown on the LIBERO Spatial suite. PI-VLA demonstrates consistent improvements across all manipulation tasks.
Figure 6.
Real-world success rate heatmap across different objects (Block, Ball, Rock) and instructions (Away, Left, Right). PI-VLA demonstrates consistent high performance across all evaluated conditions.
Figure 6.
Real-world success rate heatmap across different objects (Block, Ball, Rock) and instructions (Away, Left, Right). PI-VLA demonstrates consistent high performance across all evaluated conditions.
Figure 7.
Radar chart visualizing generalization performance across different symmetry-breaking conditions. PI-VLA exhibits superior robustness in all evaluated scenarios.
Figure 7.
Radar chart visualizing generalization performance across different symmetry-breaking conditions. PI-VLA exhibits superior robustness in all evaluated scenarios.
Figure 8.
Robustness degradation under increasing levels of visual distraction. PI-VLA demonstrates superior resilience under extreme symmetry-breaking conditions.
Figure 8.
Robustness degradation under increasing levels of visual distraction. PI-VLA demonstrates superior resilience under extreme symmetry-breaking conditions.
Figure 9.
Cross-dataset generalization matrix illustrating transfer performance across different benchmarks. Diagonal entries correspond to in-distribution evaluation.
Figure 9.
Cross-dataset generalization matrix illustrating transfer performance across different benchmarks. Diagonal entries correspond to in-distribution evaluation.
Figure 10.
Ablation study results showing the contribution of each PI-VLA component. The full model significantly outperforms all ablated variants.
Figure 10.
Ablation study results showing the contribution of each PI-VLA component. The full model significantly outperforms all ablated variants.
Figure 11.
Incremental component contribution analysis illustrating how each PI-VLA module improves performance over the baseline.
Figure 11.
Incremental component contribution analysis illustrating how each PI-VLA module improves performance over the baseline.
Figure 12.
Training dynamics of PI-VLA. Left: Loss curves for the three components. Right: Validation success rate over training iterations.
Figure 12.
Training dynamics of PI-VLA. Left: Loss curves for the three components. Right: Validation success rate over training iterations.
Figure 13.
Uncertainty analysis over action horizon steps. The combined uncertainty signal increases with horizon length, enabling adaptive re-planning.
Figure 13.
Uncertainty analysis over action horizon steps. The combined uncertainty signal increases with horizon length, enabling adaptive re-planning.
Figure 14.
Action horizon analysis. Left: Success rate comparison between PI-VLA’s adaptive horizon and fixed-horizon baselines. Right: Distribution of executed action horizons showing AURD’s dynamic adjustment behavior.
Figure 14.
Action horizon analysis. Left: Success rate comparison between PI-VLA’s adaptive horizon and fixed-horizon baselines. Right: Distribution of executed action horizons showing AURD’s dynamic adjustment behavior.
Figure 15.
AURD decision timeline during task execution. The uncertainty signal (blue line) triggers re-planning events (red vertical dashed lines) when exceeding the high threshold (upper horizontal line). The low threshold (lower horizontal line) defines the safe execution zone. Green regions indicate successful action execution; yellow regions indicate cautious execution with elevated uncertainty; red markers indicate re-planning triggers. This visualization demonstrates how AURD enables adaptive behavior in uncertain situations.
Figure 15.
AURD decision timeline during task execution. The uncertainty signal (blue line) triggers re-planning events (red vertical dashed lines) when exceeding the high threshold (upper horizontal line). The low threshold (lower horizontal line) defines the safe execution zone. Green regions indicate successful action execution; yellow regions indicate cautious execution with elevated uncertainty; red markers indicate re-planning triggers. This visualization demonstrates how AURD enables adaptive behavior in uncertain situations.
Figure 16.
Parameter sensitivity analysis for key hyperparameters. Red dashed lines indicate optimal values used in our experiments.
Figure 16.
Parameter sensitivity analysis for key hyperparameters. Red dashed lines indicate optimal values used in our experiments.
Figure 17.
Scalability analysis. Left: Performance vs. number of training demonstrations showing superior data efficiency. Right: Performance vs. model size demonstrating consistent improvements with scale.
Figure 17.
Scalability analysis. Left: Performance vs. number of training demonstrations showing superior data efficiency. Right: Performance vs. model size demonstrating consistent improvements with scale.
Figure 18.
Loss landscape visualization for the three components of the unified training objective. The imitation loss exhibits a smooth, well-behaved landscape, while the reinforcement loss shows more complex structure.
Figure 18.
Loss landscape visualization for the three components of the unified training objective. The imitation loss exhibits a smooth, well-behaved landscape, while the reinforcement loss shows more complex structure.
Table 1.
Comparative summary of PI-VLA and representative baseline methods. Key architectural and methodological differences are highlighted. × indicates the feature is not supported; ✓ indicates the feature is supported.
Table 1.
Comparative summary of PI-VLA and representative baseline methods. Key architectural and methodological differences are highlighted. × indicates the feature is not supported; ✓ indicates the feature is supported.
| Method | Action Type | Dual Heads | World Model | Adaptive Horizon | Params |
|---|
| Diffusion Policy | Continuous | × | × | × | 0.3 B |
| OpenVLA | Discrete | × | × | × | 7 B |
| OpenVLA-OFT | Continuous | × | × | × | 7 B |
| FAST | Tokenized | × | × | × | 7 B |
| HybridVLA | Hybrid | ✓ | × | × | 7 B |
| PI-VLA (Ours) | Hybrid | ✓ | ✓ | ✓ | 7 B |
Table 2.
Success rates (%) on the LIBERO benchmark. Best scores are shown in bold, and second-best results are underlined. Avg. denotes the overall success rate reported on LIBERO (computed as an official aggregate across all tasks rather than a simple arithmetic mean of the four suites).
Table 2.
Success rates (%) on the LIBERO benchmark. Best scores are shown in bold, and second-best results are underlined. Avg. denotes the overall success rate reported on LIBERO (computed as an official aggregate across all tasks rather than a simple arithmetic mean of the four suites).
| Method | Spatial | Object | Goal | Long | Avg. |
|---|
| Diffusion Policy | 65.2 | 60.1 | 58.7 | 52.3 | 59.1 |
| Octo | 68.9 | 64.3 | 62.5 | 56.8 | 63.1 |
| DiT Policy | 70.5 | 66.7 | 65.2 | 59.4 | 65.5 |
| OpenVLA | 72.1 | 68.5 | 67.8 | 61.9 | 67.6 |
| OpenVLA-OFT | 73.8 | 70.2 | 69.5 | 63.4 | 69.2 |
| EverydayVLA | 74.4 | 67.5 | 64.3 | 54.7 | 65.2 |
| PI-VLA (Ours) | 79.5 | 73.4 | 73.3 | 66.6 | 73.2 |
Table 3.
Real-world in-distribution pick-and-place success rates (%). Best results are shown in bold.
Table 3.
Real-world in-distribution pick-and-place success rates (%). Best results are shown in bold.
| Method | Block | Ball | Rock | Avg. |
|---|
| Away | Left | Right | Away | Left | Right | Away | Left | Right |
|---|
| OpenVLA | 80 | 45 | 50 | 75 | 60 | 70 | 70 | 50 | 55 | 61.7 |
| OpenVLA-OFT | 85 | 60 | 65 | 80 | 65 | 75 | 75 | 60 | 65 | 70.0 |
| EverydayVLA | 90 | 80 | 75 | 90 | 85 | 80 | 85 | 80 | 75 | 82.2 |
| PI-VLA (Ours) | 95 | 90 | 85 | 95 | 90 | 85 | 90 | 85 | 80 | 88.3 |
Table 4.
Generalization and robustness evaluation (success rate %). Bold values indicate the best result in each column.
Table 4.
Generalization and robustness evaluation (success rate %). Bold values indicate the best result in each column.
| Method | Unseen Tasks | Unseen Env. | Static Dist. | Dynamic Dist. |
|---|
| OpenVLA | 55.2 | 58.7 | 50.1 | 45.3 |
| OpenVLA-OFT | 60.5 | 63.4 | 55.8 | 50.2 |
| EverydayVLA | 65.8 | 68.9 | 64.2 | 63.5 |
| PI-VLA (Ours) | 72.4 | 75.1 | 71.8 | 70.1 |
Table 5.
Ablation study on CMS action heads (LIBERO Spatial, success rate %). Best results are shown in bold.
Table 5.
Ablation study on CMS action heads (LIBERO Spatial, success rate %). Best results are shown in bold.
| Variant | Spatial Suite (%) |
|---|
| PI-VLA (Full CMS) | 79.5 |
| w/ Discrete-only head | 75.3 |
| w/ Continuous-only head | 73.9 |
Table 6.
Ablation on unified training objective (LIBERO Spatial, success rate %). Bold indicates the best result.
Table 6.
Ablation on unified training objective (LIBERO Spatial, success rate %). Bold indicates the best result.
| Training Objective Variant | Spatial Suite (%) |
|---|
| Full | 79.5 |
| w/o State Prediction Loss () | 76.8 |
| w/o Reinforcement Loss () | 78.2 |
| Imitation-only ( only) | 74.0 |
Table 7.
Ablation on AURD module (LIBERO Spatial, success rate %). Bold indicates the best result.
Table 7.
Ablation on AURD module (LIBERO Spatial, success rate %). Bold indicates the best result.
| Execution Strategy | Spatial Suite (%) |
|---|
| PI-VLA (Full with AURD) | 79.5 |
| w/o AURD (Fixed Horizon = 5) | 74.1 |
| Replace AURD with AdaHorizon | 76.4 |
| Replace AURD with ACT (Temporal Ens.) | 73.2 |
Table 8.
Extended ablation study across all LIBERO suites (success rate %). Results demonstrate that the performance gains from PI-VLA components are consistent across different task types. Bold indicates the best result.
Table 8.
Extended ablation study across all LIBERO suites (success rate %). Results demonstrate that the performance gains from PI-VLA components are consistent across different task types. Bold indicates the best result.
| Variant | Spatial | Object | Goal | Long |
|---|
| PI-VLA (Full) | 79.5 | 73.4 | 73.3 | 66.6 |
| w/ Discrete-only head | 75.3 | 69.8 | 69.1 | 62.4 |
| w/ Continuous-only head | 73.9 | 68.2 | 67.5 | 60.8 |
| w/o | 76.8 | 71.0 | 70.2 | 63.5 |
| w/o | 78.2 | 72.1 | 71.8 | 64.9 |
| w/o AURD | 74.1 | 68.5 | 68.0 | 61.2 |
Table 9.
Comprehensive ablation incrementally adding PI-VLA components. Bold indicates the best result.
Table 9.
Comprehensive ablation incrementally adding PI-VLA components. Bold indicates the best result.
| Model Configuration | Spatial Suite (%) |
|---|
| OpenVLA-OFT | 73.8 |
| + Dual-Heads (Discrete + Continuous) | 74.8 |
| + Unified Loss () | 77.1 |
| + AURD (Full PI-VLA) | 79.5 |