PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation

Jian, Yina; Tian, Di; Chen, Xuan-Jing; Wei, Zhen-Yuan; Liang, Chen-Wei; Wang, Mu-Jiang-Shan

doi:10.3390/sym18030394

Open AccessArticle

PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation

by

Yina Jian

¹,

Di Tian

²,

Xuan-Jing Chen

³,

Zhen-Yuan Wei

⁴

,

Chen-Wei Liang

⁴ and

Mu-Jiang-Shan Wang

^4,5,*

¹

Department of Computer Science, Columbia University, New York, NY 10027, USA

²

School of Automobile, Chang’an University, Xi’an 710064, China

³

Columbia Business School, Columbia University, New York, NY 10027, USA

⁴

Shenzhen Kaihong Digital Industry Development Co., Ltd., Shenzhen 518000, China

⁵

Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(3), 394; https://doi.org/10.3390/sym18030394

Submission received: 8 January 2026 / Revised: 26 January 2026 / Accepted: 11 February 2026 / Published: 24 February 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Vision–language–action (VLA) models often suffer from limited robustness in long-horizon manipulation tasks—where robots must execute extended sequences of actions over multiple time steps to achieve complex goals—due to their inability to explicitly exploit structural symmetries and to react adaptively when such symmetries are violated by environmental uncertainty. To address this limitation, this paper proposes PI-VLA, a symmetry-aware predictive and interactive VLA framework for robust robotic manipulation. PI-VLA is built upon three key symmetry-driven principles. First, a Cognitive–Motor Synergy (CMS) module jointly generates discrete and continuous action chunks together with predictive world-model features in a single forward pass, enforcing cross-modal action consistency as an implicit symmetry constraint across heterogeneous action representations. Second, a unified training objective integrates imitation learning, reinforcement learning, and state prediction, encouraging invariance to task-relevant transformations while enabling adaptive symmetry breaking when long-horizon deviations emerge. Third, an Active Uncertainty-Resolving Decider (AURD) explicitly monitors action consensus discrepancies and state prediction errors as symmetry-breaking signals, dynamically adjusting the execution horizon through closed-loop replanning. Extensive experiments on long-horizon benchmarks demonstrate that PI-VLA achieves state-of-the-art performance, attaining a 73.2% average success rate on the LIBERO benchmark (with particularly strong gains on the Long-Horizon suite) and an 88.3% success rate in real-world manipulation tasks under visual distractions and unseen conditions. Ablation studies confirm that symmetry-aware action consensus and uncertainty-triggered replanning are critical to robust execution. These results establish PI-VLA as a principled framework that leverages symmetry preservation and controlled symmetry breaking to enable reliable and interactive robotic manipulation.

Keywords:

symmetry-aware learning; vision–language–action; robotic manipulation; action consistency; symmetry breaking; uncertainty-aware planning; predictive world models; robust control; long-horizon decision making; robot learning

1. Introduction

Terminology Note. Throughout this paper, we use the following terms consistently: symmetry refers to the invariance of policy behavior under task-preserving transformations; symmetry breaking refers to deliberate deviation from symmetric execution when uncertainty is detected; consensus (or action consistency) refers to agreement between discrete and continuous action predictions; uncertainty quantifies the degree of model disagreement or prediction error; robustness refers to maintained performance under perturbations. These concepts are interrelated but distinct, as clarified in Section 3.1.

As illustrated in Figure 1, long-horizon manipulation suffers from accumulated errors and symmetry breaking between predicted and actual states, motivating uncertainty-aware replanning.

Vision–language–action (VLA) models [1] have recently emerged as a powerful paradigm for robotic manipulation, enabling end-to-end mappings from raw visual observations and natural-language instructions to low-level motor commands. By leveraging large-scale vision–language models (VLMs) [2] pre-trained on extensive web data, VLA systems exhibit strong semantic understanding and instruction-following capabilities. From a structural perspective, such models implicitly aim to learn representations that are invariant to task-irrelevant transformations, such as changes in viewpoint, lighting, or object appearance. However, despite internet-scale pre-training, existing VLA models often fail to preserve these desirable invariances in practice, leading to brittle behavior under unfamiliar lighting conditions [3], novel objects [4], and visual distractors [5], as well as limited generalization to out-of-distribution tasks [6].

Beyond algorithmic challenges, the deployment of VLA models is further constrained by hardware asymmetries. State-of-the-art robotic manipulators typically rely on high-precision actuators and custom components, resulting in system costs of several thousand dollars [7]. This hardware asymmetry—where advanced learning algorithms are coupled with expensive and specialized platforms—significantly limits accessibility. Moreover, the reliance on teleoperated data collection for fine-tuning [8] introduces additional costs and complexity, further impeding large-scale adoption [9]. These factors collectively highlight a critical gap between the theoretical potential of VLA models and their practical robustness and scalability.

Most existing VLA approaches rely predominantly on imitation learning or adopt passive, threshold-based uncertainty handling strategies. As detailed in Section 3.1, such designs implicitly assume symmetric and consistent action–state relationships over long horizons—an assumption that frequently breaks down in practice.

To address these challenges, we propose the Predictive and Interactive Vision–Language–Action model (PI-VLA), a symmetry-aware framework that advances both algorithmic robustness and system accessibility. PI-VLA is built upon the principle that robust robotic manipulation requires maintaining consistency across multiple action representations while actively adapting when such consistency is violated. Specifically, our contributions are fourfold. First, we introduce a novel Cognitive–Motor Synergy (CMS) architecture that jointly generates discrete and continuous action chunks, predicts future states, and estimates state values in a single forward pass. This joint formulation enforces cross-modal action consistency, which can be interpreted as an implicit symmetry constraint across heterogeneous action representations. Second, we propose a unified training objective that combines imitation learning, reinforcement learning for long-horizon optimization, and state prediction loss for world-model learning, promoting invariance to task-relevant transformations while enabling adaptive behavior beyond pure behavior cloning. Third, we develop an Active Uncertainty-Resolving Decider (AURD) that explicitly treats action disagreement and state prediction error as symmetry-breaking signals, dynamically adjusting the execution horizon through closed-loop replanning to restore stability and robustness. Fourth, we demonstrate that such symmetry-aware control principles can be realized on a low-cost, integrated 6-DOF robotic manipulator with a repeatability better than 10 mm and an approximate cost of $300, significantly reducing hardware barriers without sacrificing performance. In addition, subject to company policy and confidentiality constraints, we will consider providing access to the automated data-collection pipeline and the dataset (over 1200 task executions with paired language instructions, video recordings, and end-effector poses) to qualified researchers upon reasonable request and appropriate agreements, to support reproducibility at the methodological level.

From a broader reliability perspective, long-horizon robotic manipulation shares conceptual parallels with fault diagnosis and robustness analysis in large-scale interconnection networks, where global functionality must be maintained under local uncertainty and partial observability. Recent advances in reliable diagnosis under self-comparative models and g-good-neighbor properties provide principled views on how consistency constraints and redundancy can improve system-level robustness and recovery capability under noisy observations [10,11,12,13,14,15]. These ideas support our motivation that action consensus checking and uncertainty-triggered replanning can be interpreted as an online “diagnosis–recovery” mechanism for VLA policies.

Extensive experiments demonstrate that PI-VLA achieves state-of-the-art performance, attaining an average success rate of 73.2% on the LIBERO simulation benchmark [16] and 88.3% on real-world in-distribution manipulation tasks. Comprehensive ablation studies confirm that each proposed component contributes to improved robustness, with symmetry-aware action consistency and uncertainty-triggered replanning playing a central role. By simultaneously addressing algorithmic symmetry, adaptive symmetry breaking, and hardware accessibility, PI-VLA aims to democratize robust robotic foundation models for real-world deployment.

The remainder of this paper is organized as follows. Section 2 reviews related research. Section 3 details the PI-VLA methodology. Section 4 describes the experimental setup, followed by results and analysis in Section 5. We conclude in Section 11.

2. Related Work

2.1. Vision–Language–Action Models from a Symmetry Perspective

Recent advances in vision–language models (VLMs) have enabled vision–language–action (VLA) systems that directly generate low-level control commands for robotic manipulation [1,3,6,17,18,19]. By leveraging internet-scale pre-training [20] and large cross-embodiment datasets [21,22,23], these models achieve strong language grounding and scene understanding [24]. From a structural viewpoint, VLA models implicitly aim to learn representations that are invariant to task-irrelevant transformations, such as changes in viewpoint, illumination, and object appearance. However, preserving such invariances over long horizons remains a fundamental challenge.

Early VLA approaches adopted autoregressive next-token prediction paradigms inherited from language modeling [1,6,17]. While effective at capturing sequential dependencies, these methods often struggle with high-frequency and dexterous control, where small symmetry violations can rapidly accumulate and destabilize execution [25]. To address this limitation, subsequent work shifted toward continuous action generation using diffusion or flow-matching techniques [3,19,26,27]. These approaches improve action smoothness and throughput, but introduce additional computational complexity, including slower training dynamics [28,29] and multi-step inference procedures that can obscure the explicit detection of symmetry-breaking events [30,31].

More recent efforts have explored tokenization-based or hybrid action representations to balance efficiency and expressiveness. FAST encodes continuous trajectories into compact tokens, achieving significantly faster training than diffusion-based methods [25], but its autoregressive decoding incurs high inference latency. OpenVLA-OFT [30] mitigates this limitation by combining parallel decoding, action chunking [32], and regression-based objectives, enabling entire action segments to be generated in a single forward pass. Despite these advances, existing methods typically treat discrete and continuous action representations independently, without explicitly enforcing consistency across modalities. As a result, discrepancies between action representations can be viewed as latent symmetry violations that remain unaddressed during execution.

More generally, exploiting structural priors and symmetry constraints has also been shown to be effective beyond robotics, where learning-based systems benefit from self-supervised objectives that preserve combinatorial consistency. For instance, self-supervised graph representation learning has been used to improve efficiency and stability in structured optimization problems such as maximum matching in bipartite graphs, highlighting the value of symmetry-aware representations and consistency-driven training [33]. This line of evidence further motivates our design choice of enforcing cross-modal action consistency as an implicit symmetry constraint in PI-VLA.

Hybrid approaches that ensemble autoregressive and continuous predictions [34] further improve robustness by aggregating multiple action hypotheses. However, such methods still rely on slow autoregressive components and lack a principled mechanism for actively responding to symmetry breaking over long horizons. In contrast, our work builds upon these foundations by jointly predicting discrete and continuous action chunks within a unified architecture, explicitly leveraging cross-modal action consistency as a symmetry constraint and using prediction disagreement as a signal for adaptive replanning.

2.2. Low-Cost Robotic Manipulators and Hardware Symmetry

Parallel to algorithmic advances, prior research has sought to reduce the cost of robotic manipulators to improve accessibility and reproducibility [7,35,36,37,38]. Nevertheless, many commercially available “low-cost” robotic arms still retail for over $1000, limiting their adoption by home users and student researchers. Designing an affordable manipulator inherently involves symmetry-breaking trade-offs among workspace, degrees of freedom, payload capacity, speed, and repeatability, where improving one dimension often degrades another.

Existing sub-$1000 platforms frequently sacrifice kinematic symmetry or control precision, reducing their suitability for learning-based manipulation research. Furthermore, reliance on specialized software stacks, such as custom drivers [39] or complex ROS-based frameworks [40], introduces additional asymmetries in usability and system integration, particularly for non-expert users.

Motivated by these limitations, we develop an open-source 6-DOF robotic arm with an approximate cost of $300, achieving a balanced trade-off between kinematic symmetry, workspace coverage, and control accuracy. The platform supports a 0.2 kg payload, a 382 mm reach, end-effector speeds up to 0.7 m/s, and repeatability within 10 mm. By minimizing software dependencies and ensuring OS-agnostic operation, the proposed hardware platform complements our symmetry-aware VLA framework, enabling reproducible and accessible experimentation without relying on specialized or asymmetric system configurations.

3. Method

3.1. Symmetry Formalization

Clarification on “Symmetry” Terminology. In this work, we use “symmetry” in a control-theoretic sense to denote invariance or equivariance of policy behavior under task-preserving transformations. This notion is closely related to—but distinct from—the conventional concepts of robustness and generalization in manipulation. Specifically, robustness typically refers to maintaining performance under perturbations (e.g., noise, disturbances), while generalization refers to transferring learned behaviors to unseen conditions. Our symmetry-centric formulation unifies these concepts: a policy that preserves symmetry under task-equivalent transformations naturally exhibits robustness to nuisance variations and generalizes across symmetric task instances. Conversely, “symmetry breaking” in our framework refers to detectable deviations from expected invariance patterns, which serve as signals for adaptive intervention rather than failure modes. This formulation provides a principled lens for understanding when and how VLA policies should adapt during long-horizon execution.

In PI-VLA, we interpret symmetry as task-equivalent transformations under which the desired manipulation behavior should remain stable, and symmetry breaking as measurable deviations induced by uncertainty during long-horizon execution. Let

o_{t}

, l, and

a_{t} \in A \subset R^{D}

denote observation, instruction, and action at time t. We consider a family of task-preserving transformations

G

acting on the observation–action space

g : (o, l, a) \mapsto (T_{o}^{g} (o), T_{l}^{g} (l), T_{a}^{g} (a)), g \in G,

(1)

which captures nuisance variations such as viewpoint, illumination, and minor appearance changes that do not alter task intent.

A policy

π_{θ}

is symmetry-preserving if its behavior is approximately invariant (or equivariant) under

g \in G

π_{θ} (\cdot ∣ T_{o}^{g} (o), T_{l}^{g} (l)) \approx T_{a}^{g} (π_{θ} (\cdot ∣ o, l)),

(2)

ensuring consistent actions under task-equivalent perturbations. In PI-VLA, this property is encouraged by jointly learning action generation and predictive state transitions.

PI-VLA produces both discrete and continuous action representations; their agreement reflects an implicit symmetry between symbolic and geometric action spaces. Discrepancies between these actions, together with prediction errors of future states, are treated as explicit symmetry-breaking signals. When such signals exceed predefined thresholds, PI-VLA intentionally breaks open-loop execution and triggers interactive replanning, enabling controlled adaptation under uncertainty rather than assuming that symmetry always holds.

3.2. Overview of PI-VLA

We present the Predictive and Interactive Vision–Language–Action model (PI-VLA), a symmetry-aware framework that integrates imitation learning, reinforcement learning, and predictive world modeling for robust robotic manipulation. As illustrated in Figure 2, PI-VLA departs from purely autoregressive or diffusion-based designs by explicitly enforcing cross-modal action consistency and by actively responding to symmetry-breaking events during execution. The framework consists of two tightly coupled components: a unified Cognitive–Motor Synergy (CMS) architecture for parallel action chunk generation and state prediction, and an Active Uncertainty-Resolving Decider (AURD) for closed-loop, uncertainty-aware decision-making.

From a symmetry perspective, PI-VLA is designed to preserve structural invariances across heterogeneous action representations while enabling controlled symmetry breaking when long-horizon deviations occur. The formal definitions and mechanisms for symmetry preservation and breaking are detailed in Section 3.1 and Section 3.5, respectively.

3.3. Cognitive–Motor Synergy (CMS) Architecture

At each time step t, PI-VLA operates on a demonstration tuple

D = (o_{t}, l_{t}, a_{t}, o_{t + 1}, r_{t})

, where

o_{t} \in R^{H \times W \times C}

denotes the visual observation,

l_{t}

is a natural-language instruction,

a_{t} \in A \subset R^{7}

is a 7-DOF end-effector action comprising 3D translation, 3D rotation (Euler angles) and gripper state,

o_{t + 1}

is the resulting next observation, and

r_{t} \in R

is a scalar reward indicating task progress. The policy

π_{θ}

aims to maximize the expected discounted return

E [\sum_{k} γ^{k} r_{t + k}]

, with discount factor

γ \in (0, 1)

.

The CMS module maps multi-modal perception

(o_{t}, l_{t})

into two complementary latent representations that jointly support decision-making and prediction:

{CMS}_{θ} : (o_{t}, l_{t}) ⟶ (h_{t}^{cog}, h_{t}^{mot}),

(3)

where

h_{t}^{cog}

encodes high-level semantic and predictive information, and

h_{t}^{mot}

encodes fine-grained motor control features. In plain terms, CMS takes the robot’s camera image and language instruction as input, and produces two internal representations—one for “thinking” (cognitive) and one for “acting” (motor). The cognitive representation helps the robot predict what will happen next, while the motor representation directly drives the robot’s movements. Both representations are extracted from the final hidden states of the Prismatic-7B vision–language backbone [41], which we adopt as our pre-trained encoder. This dual-latent formulation establishes a structural separation between cognitive reasoning and motor execution while preserving their coupling through shared perception.

To explicitly enforce cross-modal action consistency, CMS employs four specialized projection heads that operate in parallel. The discrete action head

f_{d}

maps

h_{t}^{mot}

to a chunk of K discretized action tokens using a linear projection followed by a softmax over 256 bins:

A_{t}^{d} = f_{d} (h_{t}^{mot}) \in {[0, 1]}^{K \times D \times 256} .

(4)

In parallel, the continuous action head

f_{c}

outputs a chunk of K continuous actions through a multi-layer perceptron:

A_{t}^{c} = f_{c} (h_{t}^{mot}) \in R^{K \times D} .

(5)

The simultaneous generation of discrete and continuous action representations enables the model to maintain symmetry across heterogeneous action spaces and provides a basis for detecting action-level inconsistencies.

To support predictive reasoning, the State Predictor Head

f_{s}

estimates the latent representation of the next observation:

{\hat{z}}_{t + 1} = f_{s} (h_{t}^{cog}, A_{t}^{c}) \in R^{d_{z}},

(6)

which is decoded into the predicted observation

{\hat{o}}_{t + 1} = g ({\hat{z}}_{t + 1})

via a lightweight decoder g. State Predictor Architecture. The state predictor

f_{s}

is implemented as a 3-layer MLP with hidden dimensions [1024, 512, 384], using GELU activations and layer normalization. The input concatenates the cognitive feature

h_{t}^{cog} \in R^{4096}

(from the VLM backbone) with the flattened continuous action chunk

A_{t}^{c} \in R^{K \times 7}

, yielding an input dimension of

4096 + 7 K

. The output latent dimension is

d_{z} = 384

, matching the DINOv2 ViT-S feature space. The decoder g consists of a single linear projection followed by reshape operations. During training, we observe that the state prediction loss

L_{SP}

decreases from an initial value of ∼2.5 to ∼0.15 over 100k iterations, with the predicted latent features achieving a cosine similarity of 0.87 with ground-truth DINOv2 features on the validation set. This predictive pathway equips the model with an internal world model that captures expected state transitions under planned actions. Finally, the State-Value Head

f_{v}

estimates the expected cumulative return from the current cognitive state:

V_{t} = f_{v} (h_{t}^{cog}) \in R^{H_{nominal}} .

(7)

By jointly generating action chunks, state predictions, and value estimates in a single forward pass, the CMS architecture enforces structural coherence across perception, action, and prediction. From a symmetry viewpoint, this design preserves consistency across action modalities when environmental dynamics are stable, while exposing deviations that signal potential symmetry breaking during long-horizon execution.

3.4. Unified Training Objective

To jointly support action consistency, long-horizon optimization, and predictive reasoning, PI-VLA adopts a unified training objective that integrates imitation learning, reinforcement learning, and state prediction. From a symmetry perspective, the objective is designed to preserve invariance across heterogeneous action representations when expert supervision is reliable, while allowing for controlled symmetry breaking when adaptive long-horizon behavior is required. The composite loss is defined as

L_{PI - VLA} = \underset{Action Consistency}{\underset{⏟}{L_{IL}}} + α \underset{Long - Horizon Adaptation}{\underset{⏟}{L_{RL}}} + β \underset{Predictive Consistency}{\underset{⏟}{L_{SP}}}

(8)

where

L_{IL}

is the imitation learning loss enforcing expert action consistency,

L_{RL}

is the reinforcement learning loss for long-horizon return optimization, and

L_{SP}

is the state prediction loss for world model learning. Here,

α

and

β

are balancing coefficients.

3.4.1. Imitation Learning Loss ( $L_{IL}$ )

This component enforces consistency between predicted actions and expert demonstrations, serving as a primary symmetry-preserving constraint during training. Extending the collaborative training idea from [34] to the action-chunk level, for a ground-truth action chunk

A_{t}

, we define

L_{IL} = L_{CE} (A_{t}, A_{t}^{d}) + λ L_{Smooth - L 1} (A_{t}, A_{t}^{c}) .

(9)

Here,

L_{CE}

penalizes discrepancies between expert actions and discretized predictions, while the Smooth L1 loss provides robust gradients for continuous action regression. By jointly constraining discrete and continuous outputs,

L_{IL}

promotes cross-modal action invariance and reduces representational asymmetry during early-stage learning.

3.4.2. Reinforcement Learning Loss ( $L_{RL}$ )

While imitation learning preserves expert-level symmetry, long-horizon manipulation often requires adaptive deviations from demonstrated behavior. To accommodate such controlled symmetry breaking, we incorporate an off-policy actor–critic objective [42]. Using generated continuous actions

a_{t, k}^{c}

(the k-th action in the chunk) as policy outputs, the loss maximizes the expected advantage:

L_{RL} = - \frac{1}{K} \sum_{k = 1}^{K} log π_{θ} (a_{t, k}^{c} ∣ o_{t}, l_{t}) \cdot {\hat{A}}_{t, k},

(10)

where

{\hat{A}}_{t, k}

denotes the advantage estimate computed via Generalized Advantage Estimation (GAE) [43], with the CMS value head

V_{t}

serving as the baseline. This term enables PI-VLA to selectively break imitation-induced symmetry when doing so improves long-term task performance.

3.4.3. State Prediction Loss ( $L_{SP}$ )

To instill predictive consistency across state transitions, PI-VLA enforces alignment between internally predicted future states and actual observations in a learned latent space:

L_{SP} = {∥ z_{t + 1} - {\hat{z}}_{t + 1} ∥}_{2}^{2} .

(11)

Here,

z_{t + 1} = Enc (o_{t + 1})

is the latent representation of the true next observation produced by a fixed encoder (e.g., DINOv2), and

{\hat{z}}_{t + 1}

is the CMS prediction. This loss constrains the learned world model to remain invariant under consistent action execution, while exposing deviations that indicate potential symmetry violations.

3.5. Active Uncertainty-Resolving Decider (AURD)

The Active Uncertainty-Resolving Decider (AURD) governs online execution by explicitly detecting and responding to symmetry-breaking signals during long-horizon manipulation. Rather than committing to a fixed execution horizon, AURD dynamically adjusts planning depth based on both action-level consistency and predictive reliability.

To quantify execution-time uncertainty and detect symmetry breaking, we define a composite uncertainty signal that measures both cross-modal action inconsistency and predictive state deviation. Since the discrete action head outputs a categorical distribution, we first map discrete actions to the continuous action space via a de-quantization operator.

Let

A_{t}^{d} [k, d] \in {[0, 1]}^{B}

denote the discrete probability distribution over B quantization bins for the d-th action dimension at step k, and let

c_{d} \in R^{B}

be the corresponding bin centers. We define the de-quantized action as

{\tilde{a}}_{t, k, d}^{d} = D (A_{t}^{d} [k, d]) = \sum_{b = 1}^{B} A_{t}^{d} [k, d, b] \cdot c_{d, b} .

(12)

Using this continuous representation, the composite uncertainty at horizon step k is defined as

U_{t, k} = η \cdot U_{t, k}^{act} + (1 - η) \cdot U_{t, k}^{pred},

(13)

In plain terms: The uncertainty signal

U_{t, k}

combines two intuitive measures: (1) action disagreement—how much the discrete and continuous action predictions differ from each other, indicating internal model inconsistency; (2) prediction error—how much the robot’s predicted future state differs from what actually happens, indicating environmental surprise. When either signal is high, the robot becomes “uncertain” and may need to replan.

U_{t, k}^{act} = \frac{1}{D} \sum_{d = 1}^{D} |a_{t, k, d}^{c} - {\tilde{a}}_{t, k, d}^{d}|, U_{t, k}^{pred} = {∥z_{t + k}^{obs} - {\hat{z}}_{t + k}∥}_{2},

(14)

where

z_{t + k}^{obs} = Enc (o_{t + k})

is obtained only after executing actions up to step

t + k

and observing

o_{t + k}

. Therefore, AURD is implemented in a receding-horizon (MPC-style) manner: it executes actions sequentially and updates

U_{t, k}

online when the corresponding observation becomes available.

AURD sequentially evaluates

U_{t, k}

together with the value estimate

V_{t, k}

. If

U_{t, k}

exceeds a high threshold

τ_{high}

, a re-planning event is triggered to restore execution stability. If

U_{t, k}

lies between

τ_{low}

and

τ_{high}

but the estimated value is low, the corresponding action is pruned. Through this mechanism, uncertainty is explicitly treated as a manifestation of symmetry breaking and directly regulates execution depth. The complete AURD procedure is formalized in Algorithm 1, which provides step-by-step details of the uncertainty computation and decision-making process.

Algorithm 1: Active Uncertainty-Resolving Decider (AURD). Given an observation

o_{t}

, instruction

l_{t}

, and CMS policy

π_{θ}

, AURD dynamically adjusts the execution horizon based on uncertainty signals. Note: We have corrected the typographical error “Active dissertion” to “Active Uncertainty-Resolving Decider” in the algorithm title and ensured all text labels are clearly rendered.

Require: Observation $o_{t}$ , instruction $l_{t}$ , CMS policy $π_{θ}$ , nominal horizon $H_{nominal}$ ,
uncertainty thresholds $τ_{low}, τ_{high}$ , value threshold $V_{thresh}$ .
Ensure: Executed action sequence $A_{t}^{*}$ , updated planning horizon.
$(A_{t}^{c}, A_{t}^{d}, {\hat{Z}}_{t + 1 : t + H_{nominal}}, V_{t}) \leftarrow π_{θ} (o_{t}, l_{t})$
$A_{t}^{*} \leftarrow []$ , $H_{exec} \leftarrow H_{nominal}$
for $k = 1$ to $H_{nominal}$ do
$a_{t, k} \leftarrow A_{t}^{c} [k]$
Execute action $a_{t, k}$ and observe $o_{t + k}$
$z_{t + k}^{obs} \leftarrow Enc (o_{t + k})$
Compute $U_{t, k}$ using Equation (13)
if $U_{t, k} > τ_{high}$ then
Break symmetry: trigger re-planning
$H_{exec} \leftarrow k$
break
else if $U_{t, k} > τ_{low}$ and $V_{t} [k] < V_{thresh} then$
Prune inconsistent action
$H_{exec} \leftarrow k$
break
else
$A_{t}^{*} \leftarrow A_{t}^{*} \cup {a_{t, k}}$
end if
end for
if $H_{exec} < 1$ then
Execute Pause-and-Observe for one timestep
end if
return $A_{t}^{*}$ , $H_{exec}$

3.6. Hardware Setup

To evaluate the proposed PI-VLA framework under realistic physical constraints, we deploy a cost-effective 6-DOF robotic manipulator with a total hardware cost of approximately $300. The arm provides a workspace reach of 382 mm, supports a maximum payload of 0.2 kg, and achieves an end-effector repeatability within 10 mm, which is sufficient for tabletop manipulation tasks considered in this study.

From a symmetry perspective, the hardware platform exhibits inherent structural asymmetries arising from actuator tolerances, joint backlash, and non-uniform link dynamics. These physical asymmetries introduce execution-level perturbations that cannot be fully compensated by calibration alone, making the platform an appropriate testbed for evaluating symmetry-aware decision mechanisms. In particular, repeated executions of nominally symmetric action sequences may lead to divergent outcomes due to cumulative mechanical uncertainty.

The PI-VLA policy is executed on an external computing unit, while low-level motor control is handled by an Arduino Uno via serial communication. This decoupled architecture intentionally separates high-level perception and decision-making from low-level actuation, allowing the proposed Active Uncertainty-Resolving Decider (AURD) to operate under asymmetric sensing–actuation feedback loops. As a result, the hardware setup highlights the role of predictive consistency and symmetry breaking in robust real-world robotic manipulation.

4. Experiments

This section evaluates PI-VLA from the perspectives of task performance, robustness, and symmetry-aware decision consistency in both simulated and real-world environments. We specifically focus on long-horizon manipulation scenarios where tasks require 10–50 sequential action steps, as these settings best reveal the benefits of uncertainty-aware replanning and symmetry preservation. All experiments are designed to assess how predictive modeling and uncertainty-driven symmetry breaking contribute to stable long-horizon execution.

4.1. Dataset

We fine-tuned the OpenVLA-7B model [17] on a custom dataset comprising 1200 human-demonstrated task executions collected using the proposed $300 low-cost manipulator. Each demonstration consists of a natural-language instruction, an RGB observation sequence, and corresponding end-effector poses.

The dataset covers a diverse set of tabletop manipulation tasks, including pick-and-place, drawer opening, and block stacking. While many task instructions are linguistically symmetric (e.g., left/right or near/far), their physical execution introduces asymmetric visual observations and actuation noise. This property makes the dataset particularly suitable for studying symmetry preservation and symmetry breaking in vision–language–action policies under real-world uncertainty.

Reward Design. For reinforcement learning integration, we employ a sparse binary reward structure:

r_{t} = 1

if the task goal is achieved at timestep t, and

r_{t} = 0

otherwise. To provide denser learning signals during training, we additionally incorporate intermediate shaping rewards based on task progress: (1) reaching reward:

r_{reach} = 0.1 \cdot exp (- ∥ p_{ee} - p_{target} ∥_{2})

, where

p_{ee}

and

p_{target}

denote end-effector and target object positions; (2) grasp reward:

r_{grasp} = 0.2

upon successful object grasping; (3) placement reward:

r_{place} = 0.3

for correct object placement. The total reward is

r_{t} = r_{sparse} + r_{reach} + r_{grasp} + r_{place}

. This reward structure provides sufficient gradient signal for the value head while maintaining alignment with task completion objectives.

4.2. Implementation Details

We extended the OpenVLA-OFT codebase [30] to jointly train autoregressive and regression heads within the proposed Cognitive–Motor Synergy (CMS) architecture. Fine-tuning was performed using Low-Rank Adaptation (LoRA) [44,45] with rank 32, batch size 8, and four-step gradient accumulation.

Training consisted of 100k iterations in simulation (using two NVIDIA A100 GPUs) followed by 50k iterations on real-world data. For PI-VLA, we built upon the OpenVLA-7B backbone and added CMS projection heads together with the Active Uncertainty-Resolving Decider (AURD). The state predictor was trained using DINOv2 ViT-S [46] as a fixed target encoder to enforce consistency between predicted and observed latent dynamics.

The loss balancing coefficients were set to

α = 0.1

and

β = 0.05

, while the uncertainty weighting factor was fixed at

η = 0.7

. These values were selected to balance imitation consistency, long-horizon return optimization, and symmetry-aware uncertainty estimation across execution horizons.

Reproducibility Details. All experiments use a fixed random seed of 42 for reproducibility. For LIBERO evaluation, we report success rates averaged over 50 episodes per task (500 episodes total per suite), following the official evaluation protocol. Real-world experiments are averaged over 20 trials per condition. All reported metrics include standard errors computed across three independent training runs with different random seeds (42, 123, 456). The uncertainty thresholds are set to

τ_{low} = 0.15

and

τ_{high} = 0.35

, with value threshold

V_{thresh} = 0.3

. The nominal action horizon is

H_{nominal} = 5

steps.

4.3. Baselines

On the LIBERO simulation benchmark, we compare PI-VLA against representative state-of-the-art VLA methods, including Diffusion Policy [3], Octo [5], DiT Policy [47], OpenVLA [17], OpenVLA-OFT [30], and EverydayVLA.

In addition, we benchmark against action-ensemble-based approaches, including ACT [32], HybridVLA [34], and COGAct [48]. These baselines either rely on fixed-horizon execution or passive aggregation strategies and therefore do not explicitly address symmetry breaking induced by uncertainty accumulation during long-horizon execution.

As shown in Table 1, PI-VLA differs from prior methods by integrating dual action heads, a world model, and adaptive horizon planning.

4.4. Symmetry-Centric Evaluation

In addition to overall success rates, we evaluate PI-VLA from a symmetry-centric perspective by examining its behavior under task-equivalent perturbations and uncertainty-induced deviations. Results on unseen environments and visual distractions show that PI-VLA maintains higher performance than baselines, indicating effective symmetry preservation under task-equivalent transformations. Moreover, increases in cross-modal action disagreement and state prediction error consistently precede task failures, and adaptive replanning via AURD yields higher success rates than fixed-horizon execution, demonstrating the benefit of controlled symmetry breaking under uncertainty.

5. Results

This section presents experimental results for PI-VLA (Predictive and Interactive Vision–Language–Action model) across simulation and real-world benchmarks.

5.1. Results on the LIBERO Simulation Benchmark

As shown in Table 2, PI-VLA achieves the highest average success rate across all four LIBERO task suites. The proposed method attains an overall success rate of 73.2%, outperforming OpenVLA-OFT (69.2%) by 4.0 percentage points and EverydayVLA (65.2%) by 8.0 percentage points.

Note on Performance Comparison. We acknowledge that some prior works report higher success rates (exceeding 90%) on LIBERO. This discrepancy arises from several factors: (1) Evaluation protocol: We follow the official LIBERO evaluation with 50 episodes per task and report success rates without cherry-picking. (2) Training data: We use only 50 demonstrations per task (the standard LIBERO-50 setting), while some methods use additional data augmentation or larger demonstration sets. (3) Model capacity: Our 7 B-parameter model is smaller than some recent methods that use 13 B+ parameters. (4) No task-specific tuning: We apply the same hyperparameters across all suites without per-task optimization. Despite these constraints, PI-VLA achieves consistent improvements over comparable baselines under identical evaluation conditions.

The performance gains are particularly pronounced in long-horizon and goal-conditioned tasks, where accumulated prediction errors and execution asymmetries are more likely to disrupt fixed-horizon policies. These results indicate that uncertainty-aware symmetry breaking and adaptive horizon selection play a critical role in maintaining execution consistency over extended action sequences.

The performance trends are further visualized in Figure 3.

PI-VLA achieves an inference rate of 105.2 Hz, incurring an average overhead of approximately 1.2 ms relative to OpenVLA-OFT. This overhead corresponds to the additional forward passes required for state prediction and uncertainty evaluation in the AURD module.

Despite this additional computational complexity, PI-VLA maintains real-time performance while enabling adaptive symmetry breaking and consistency preservation during execution.

As shown in Figure 4, PI-VLA maintains competitive inference speed while enabling uncertainty-aware planning.

Figure 5 provides a detailed per-task breakdown on the LIBERO Spatial suite. PI-VLA consistently outperforms baseline methods across all task categories, indicating that the proposed framework effectively mitigates performance degradation caused by asymmetric observations and execution noise, even in tasks that are nominally symmetric at the instruction level.

5.2. Results on Real-World Tests

We further evaluate PI-VLA on real-world pick-and-place tasks executed on the low-cost robotic platform. As reported in Table 3, PI-VLA achieves a new state-of-the-art average success rate of 88.3%, outperforming EverydayVLA (82.2%) by 6.1 percentage points and OpenVLA-OFT (70.0%) by 18.3 percentage points.

These real-world tasks exhibit substantial execution asymmetries caused by non-uniform lighting, actuator backlash, and object-dependent interaction dynamics. Despite these symmetry-breaking factors, PI-VLA maintains consistently high success rates across all object categories and instruction directions. Figure 6 visualizes the success rate distribution, demonstrating that the proposed uncertainty-aware execution mechanism effectively preserves action consistency under asymmetric physical conditions.

5.3. Generalization and Robustness

We next analyze the generalization and robustness properties of PI-VLA under symmetry-breaking distribution shifts. As summarized in Table 4, PI-VLA consistently outperforms baseline methods on unseen tasks, unseen environments, and both static and dynamic visual distractions.

These evaluation settings introduce varying degrees of symmetry violation between training and testing conditions, including changes in scene layout, object appearance, and background motion. The superior performance of PI-VLA indicates that predictive modeling and uncertainty-aware replanning effectively mitigate performance degradation caused by accumulated asymmetries.

The importance of robustness under visual distribution shifts is also widely recognized in applied vision systems, especially in agriculture and field robotics, where changing illumination, cluttered backgrounds, and seasonal variations frequently violate training-time assumptions. Recent studies on image-based high-throughput phenotyping trends, crop-chain deep learning surveys, and real-time plant–object recognition models emphasize that reliable deployment requires strong generalization under uncontrolled visual conditions and strict latency constraints [49,50,51,52,53,54]. These findings are consistent with our observation that uncertainty-aware closed-loop replanning is critical for maintaining stable performance when symmetry is broken by distractors and unseen conditions.

As shown in Figure 7, PI-VLA demonstrates strong robustness across diverse symmetry-breaking scenarios.

Figure 8 illustrates how task success rates degrade as the level of visual distraction increases. While baseline methods suffer rapid performance deterioration, PI-VLA maintains significantly higher success rates even under severe distraction. This behavior highlights the effectiveness of uncertainty-aware symmetry breaking in preventing error accumulation during long-horizon execution.

We further examine cross-dataset generalization in Figure 9. The resulting transfer matrix shows that PI-VLA achieves strong cross-domain performance when trained and evaluated on datasets with differing visual and task symmetries. This result confirms that the proposed framework learns representations and decision policies that generalize beyond dataset-specific symmetry assumptions.

6. Ablation Studies

We conduct comprehensive ablation studies to analyze the contribution of each component in PI-VLA from the perspective of symmetry preservation, symmetry breaking, and execution consistency. To provide a more comprehensive analysis, we report ablation results on multiple LIBERO suites.

6.1. Ablation on CMS Action Heads

The Cognitive–Motor Synergy (CMS) module jointly models discrete and continuous action representations, enabling consistency across heterogeneous action modalities. As shown in Table 5, removing either modality leads to notable performance degradation.

Using only discrete or continuous action heads reduces the success rate by 4.2 and 5.6 percentage points, respectively, compared to the full CMS model. This result indicates that maintaining symmetry between symbolic (discrete) and geometric (continuous) action spaces is critical for stable execution.

6.2. Ablation on Unified Training Objective

We next evaluate the role of each component in the unified training objective. As summarized in Table 6, removing the state prediction loss leads to a 2.7% decrease in success rate, highlighting the importance of predictive consistency for long-horizon decision-making.

Similarly, excluding the reinforcement learning loss reduces the model’s ability to optimize delayed returns, while imitation-only training yields the largest performance drop. These results demonstrate that symmetry between imitation, optimization, and prediction objectives is necessary to maintain coherent behavior across multiple execution steps.

6.3. Ablation on Active Uncertainty-Resolving Decider (AURD)

The Active Uncertainty-Resolving Decider (AURD) explicitly regulates symmetry breaking during execution by dynamically adjusting the planning horizon. Table 7 compares PI-VLA with several alternative execution strategies.

Using a fixed execution horizon reduces performance by 5.4 percentage points, indicating that rigid, symmetry-assuming execution is insufficient under accumulated uncertainty. Replacing AURD with adaptive horizon heuristics or temporal ensembling partially recovers performance but remains inferior to the proposed uncertainty-driven mechanism.

The primary results are presented on the Spatial suite, which emphasizes long-horizon spatial reasoning under accumulated uncertainty, with additional results on Object, Goal, and Long suites provided in Table 8.

As shown in Figure 10, the full PI-VLA model consistently outperforms its ablated variants.

6.4. Comprehensive Component Analysis

We further perform a comprehensive component analysis by incrementally adding PI-VLA modules to the OpenVLA-OFT baseline. As shown in Table 9 and Figure 11, each component contributes a measurable improvement.

The introduction of dual action heads enhances action space consistency, the unified training objective improves predictive symmetry across time, and the AURD module provides the final performance gain by actively resolving uncertainty-induced asymmetries during execution.

7. Training Dynamics Analysis

Figure 12 illustrates the training dynamics of PI-VLA from the perspective of temporal consistency. This figure complements the quantitative results in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9 by showing how the loss components evolve during training. The imitation loss converges rapidly in the early training stage, indicating effective alignment between policy outputs and expert demonstrations. In contrast, the state prediction loss decreases more gradually, reflecting the progressive acquisition of predictive symmetry across future states.

The validation success rate increases monotonically and stabilizes near the 100k-iteration mark, suggesting that the unified training objective promotes stable convergence without oscillatory behavior. These observations indicate that PI-VLA maintains a balanced symmetry between imitation, optimization, and prediction objectives throughout training.

From an optimization viewpoint, stability and interpretability of training dynamics are closely related to recent progress in optimization-inspired deep models for signal processing, where explicit algorithmic structures (e.g., ADMM-style updates, residual shrinkage, and time–frequency analysis priors) are integrated into neural architectures to improve convergence behavior and robustness. Representative works on deep time–frequency analysis and optimization-inspired frequency estimation demonstrate that embedding structured priors can lead to smoother optimization landscapes, reduced sensitivity to hyperparameters, and improved generalization in non-ideal conditions [55,56,57,58,59]. These insights align with our empirical observations on PI-VLA’s stable convergence and robustness under varying training configurations.

8. Uncertainty Analysis

We next analyze how uncertainty evolves along the action horizon and how it influences symmetry breaking during execution. Figure 13, Figure 14 and Figure 15 provide complementary visualizations of uncertainty dynamics. As shown in Figure 13, both action discrepancy and state prediction error increase with horizon length. This trend reflects the progressive violation of symmetry assumptions when executing longer open-loop action sequences.

The combined uncertainty signal aggregates these effects and triggers re-planning once predefined thresholds are exceeded, preventing uncontrolled divergence from nominal trajectories. This mechanism enables PI-VLA to perform controlled symmetry breaking, intervening only when accumulated uncertainty threatens execution consistency.

Figure 14 further examines the relationship between action horizon length and task success. Fixed-horizon strategies implicitly assume temporal symmetry across execution steps and suffer from performance degradation as horizon length increases. In contrast, the adaptive horizon selection implemented by AURD consistently yields higher success rates, with optimal nominal horizons between four and five steps.

Figure 15 visualizes the decision-making process of AURD during a representative task execution. Re-planning events are triggered when the uncertainty signal exceeds the high threshold, demonstrating how symmetry-breaking decisions are localized in time and driven by predictive inconsistency rather than heuristic rules.

9. Hyperparameter Sensitivity

We analyze the sensitivity of PI-VLA to key hyperparameters governing the balance between learning objectives and uncertainty estimation. Figure 16 reports performance across different values of

α

(reinforcement loss weight),

β

(state prediction loss weight), and

η

(uncertainty weighting factor).

The model exhibits robust performance across a wide parameter range, indicating that the proposed framework does not rely on fragile hyperparameter tuning. Optimal performance is achieved at

α = 0.1

,

β = 0.05

, and

η = 0.7

, corresponding to a balanced symmetry between reward optimization, predictive consistency, and uncertainty-driven adaptation.

10. Scalability Analysis

Figure 17 evaluates the scalability of PI-VLA with respect to training data size and model capacity. Compared to OpenVLA-OFT, PI-VLA achieves higher success rates with fewer demonstrations, indicating improved data efficiency.

As model size increases, PI-VLA exhibits consistent performance gains, suggesting that the proposed architecture preserves structural symmetry across scales. These results demonstrate that the framework can effectively leverage additional data and capacity without compromising stability.

Figure 18 visualizes the loss landscape of the unified training objective. The imitation loss exhibits a smooth and well-conditioned surface, while the reinforcement loss shows a more complex structure. Despite this heterogeneity, the joint optimization remains stable, indicating that the different loss components interact in a complementary and symmetry-preserving manner.

11. Conclusions

Summary. In the preceding sections, we presented PI-VLA’s architecture (Section 3), experimental setup (Section 4), main results (Section 5), and comprehensive ablation studies (Section 6). We also analyzed training dynamics, uncertainty evolution, hyperparameter sensitivity, and scalability properties. These analyses collectively demonstrate that PI-VLA’s symmetry-aware design principles lead to consistent improvements across diverse evaluation scenarios.

This work introduces PI-VLA, a symmetry-aware framework for robust and adaptive robotic manipulation. By integrating imitation learning, reinforcement learning, and predictive world modeling into a unified architecture, PI-VLA explicitly addresses the tension between symmetry assumptions and real-world asymmetries.

The Cognitive–Motor Synergy (CMS) module enforces consistency across heterogeneous action representations, while the Active Uncertainty-Resolving Decider (AURD) enables controlled symmetry breaking through uncertainty-driven horizon adaptation. Extensive experiments demonstrate that PI-VLA achieves state-of-the-art performance on the LIBERO benchmark and significantly improves robustness in real-world manipulation tasks.

Future work may explore scaling predictive world models, extending the framework to multi-robot systems, and developing more principled uncertainty estimators to further enhance symmetry-aware decision-making in complex environments.

Author Contributions

Conceptualization, M.-J.-S.W.; methodology, Y.J., D.T., and X.-J.C.; software, Z.-Y.W. and C.-W.L.; validation, Y.J. and D.T.; formal analysis, X.-J.C.; investigation, Y.J.; resources, M.-J.-S.W.; data curation, D.T.; writing—original draft preparation, Y.J. and D.T.; writing—review and editing, M.-J.-S.W. and X.-J.C.; visualization, Z.-Y.W.; supervision, M.-J.-S.W.; project administration, M.-J.-S.W.; funding acquisition, M.-J.-S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Reproducibility Statement. To facilitate reproducibility, we provide the following resources: (1) Code: The PI-VLA implementation, including the CMS architecture and AURD module, will be released as open-source code upon paper acceptance at https://github.com/Kevin3777/PI-VLA (accessed on 15 January 2026). (2) Model weights: Pre-trained model checkpoints will be made available for download. (3) Hyperparameters: All training hyperparameters are detailed in Section 4.2. (4) Hardware specifications: CAD files and assembly instructions for the $300 robotic arm will be released. Regarding the dataset, due to company policy and confidentiality restrictions (e.g., proprietary hardware/software configurations, internal data-collection pipeline, and related experimental logs), the full dataset is not publicly available. However, we will release a representative subset of 200 demonstrations (covering all task categories) to enable methodology validation. Access to the complete dataset may be provided upon reasonable request, subject to institutional/company approval and appropriate confidentiality agreements.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions, which helped improve the quality of this manuscript.

Conflicts of Interest

Authors Mu-Jiang-Shan Wang, Zhen-Yuan Wei, and Chen-Wei Liang were employed by Shenzhen Kaihong Digital Industry Development Co., Ltd. The remaining authors declare that they have no financial or non- financial conflicts of interest.

References

Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gober, K.; Joshi, M.; Julian, R.; Kalashnikov, D.; et al. RT-1: Robotics Transformer for Real-World Control at Scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
Chen, X.; Wang, X.; Changpinyo, S.; Piergiovanni, A.; Padlewski, P.; Salz, D.; Goodman, S.; Grycner, A.; Mustafa, B.; Beyer, L.; et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model. arXiv 2022, arXiv:2209.06794. [Google Scholar]
Chi, C.; Feng, S.; Du, Y.; Xu, Z.; Cousineau, E.; Burchfiel, B.; Song, S. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Proceedings of the Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar]
Xie, A.; Lee, Y.; Abbeel, P.; Roberts, S. Decomposing the Generalization Gap in Imitation Learning for Visual Robotic Manipulation. arXiv 2024, arXiv:2307.03659. [Google Scholar]
Octo Model Team; Ghosh, D.; Walke, H.; Pertsch, K.; Black, K.; Mees, O.; Dasari, S.; Hejna, J.; Kreiman, T.; Xu, C.; et al. Octo: An Open-Source Generalist Robot Policy. arXiv 2024, arXiv:2405.12213. [Google Scholar] [CrossRef]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Choromanski, K.; Ding, T.; Driess, D.; Dubey, A.; Finn, C.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv 2023, arXiv:2307.15818. [Google Scholar]
Quigley, M.; Asbeck, A.; Ng, A. Low-cost Accelerometers for Robotic Manipulator Perception. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, MO, USA, 10–15 October 2009. [Google Scholar]
Dass, S.; Yoneda, T.; Desai, A.; Hawkins, C.; Stone, P.; Kroemer, O. PATO: Policy Assisted TeleOperation for Scalable Robot Data Collection. In Proceedings of the Proceedings of Robotics: Science and Systems (RSS), New York City, NY, USA, 27 June–1 July 2022. [Google Scholar]
Christensen, H.I.; Choset, H.; Krovi, V.; Smart, B.; Baird, J.; Bradski, G.; Branicky, M.; Brooks, R.; Burke, J.; Enright, J.; et al. A Roadmap for US Robotics: From Internet to Robotics; Computing Community Consortium: Washington, DC, USA, 2016. [Google Scholar]
Wang, M.; Xu, S.; Jiang, J.; Xiang, D.; Hsieh, S.Y. Global reliable diagnosis of networks based on Self-Comparative Diagnosis Model and g-good-neighbor property. J. Comput. Syst. Sci. 2025, 148, 103698. [Google Scholar] [CrossRef]
Wang, M.; Xiang, D.; Qu, Y.; Li, G. The diagnosability of interconnection networks. Discret. Appl. Math. 2024, 357, 413–428. [Google Scholar] [CrossRef]
Wang, M.; Wang, S. Connectivity and diagnosability of center k-ary n-cubes. Discret. Appl. Math. 2021, 294, 98–107. [Google Scholar] [CrossRef]
Xiang, D.; Hsieh, S.Y. G-good-neighbor diagnosability under the modified comparison model for multiprocessor systems. Theor. Comput. Sci. 2025, 1028, 115027. [Google Scholar]
Wang, S.; Wang, Z.; Wang, M.; Han, W. g-Good-neighbor conditional diagnosability of star graph networks under PMC model and MM* model. Front. Math. China 2017, 12, 1221–1234. [Google Scholar] [CrossRef]
Wang, M.; Lin, Y.; Wang, S.; Wang, M. Sufficient conditions for graphs to be maximally 4-restricted edge connected. Australas. J Comb. 2018, 70, 123–136. [Google Scholar]
Liu, B.; Zhu, Y.; Gao, C.; Feng, Y.; Liu, Q.; Zhu, Y.; Stone, P. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P.; et al. OpenVLA: An Open-Source Vision-Language-Action Model. In Proceedings of the 8th Conference on Robot Learning (CoRL), Munich, Germany, 6–9 November 2024; pp. 2679–2713. [Google Scholar]
Li, X.; Liu, M.; Zhang, H.; Yu, C.; Xu, J.; Wu, H.; Cheang, C.; Jing, Y.; Zhang, W.; Liu, H.; et al. Vision-Language Foundation Models as Effective Robot Imitators. arXiv 2023, arXiv:2311.01378. [Google Scholar]
Black, K.; Brown, N.; Driess, D.; Esber, A.; Suber, M.; Vijayakumar, A.; Chi, C.; Finn, C.; Levine, S. π₀: A Vision-Language-Action Flow Model for General Robot Control. arXiv 2024, arXiv:2410.24164. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Open X-Embodiment Collaboration. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv 2024, arXiv:2310.08864.
Walke, H.; Black, K.; Lee, A.; Kim, M.J.; Du, M.; Zheng, C.; Zhao, T.; Hansen-Estruch, P.; Vuong, Q.; Lu, A.; et al. BridgeData V2: A Dataset for Robot Learning at Scale. In Proceedings of the Conference on Robot Learning (CoRL), Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
Khazatsky, A.; Pertsch, K.; Nair, S.; Balakrishna, A.; Dasari, S.; Karamcheti, S.; Nasiriany, S.; Srirama, M.K.; Chen, L.Y.; Ellis, K.; et al. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. arXiv 2024, arXiv:2403.12945. [Google Scholar]
Zhou, X.; Ren, Z.; Min, S.; Liu, M. Language Conditioned Spatial Relation Reasoning for 3D Object Grounding. arXiv 2023, arXiv:2211.09646. [Google Scholar]
Pertsch, K.; Kim, M.J.; Luo, J.; Finn, C.; Levine, S. FAST: Efficient Action Tokenization for Vision-Language-Action Models. arXiv 2025, arXiv:2501.09747. [Google Scholar]
ALOHA 2 Team; Aldaco, J.; Armstrong, T.; Baruch, R.; Bingham, J.; Chan, S.; Draper, K.; Dwibedi, D.; Finn, C.; Florence, P.; et al. ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation. arXiv 2024, arXiv:2405.02292. [Google Scholar]
Wen, J.; Zhu, Y.; Li, J.; Zhu, M.; Tang, Z.; Wu, K.; Xu, Z.; Liu, N.; Cheng, R.; Shen, C.; et al. TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation. arXiv 2025, arXiv:2409.12514. [Google Scholar] [CrossRef]
Zhang, H.; Wu, Y.; Liu, E.; Shih, C.; Zhou, Z.; Torralba, A. Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture. arXiv 2024, arXiv:2312.09181. [Google Scholar]
Wang, Z.; Zheng, H.; He, P.; Chen, W.; Zhou, M. Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Kim, M.J.; Finn, C.; Liang, P. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success. arXiv 2025, arXiv:2502.19645. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Zhao, T.Z.; Kumar, V.; Levine, S.; Finn, C. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. In Proceedings of the Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar]
Pan, C.H.; Qu, Y.; Yao, Y.; Wang, M.J.S. HybridGNN: A Self-Supervised graph neural network for efficient maximum matching in bipartite graphs. Symmetry 2024, 16, 1631. [Google Scholar] [CrossRef]
Liu, J.; Chen, H.; An, P.; Liu, Z.; Zhang, R.; Gu, C.; Li, X.; Guo, Z.; Chen, S.; Liu, M.; et al. HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model. arXiv 2025, arXiv:2503.10631. [Google Scholar]
Yang, B.; Jayaraman, D.; Levine, S.; Jayaraman, D. RepLAB: A Reproducible Low-Cost Arm Benchmark Platform for Robotic Learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019. [Google Scholar]
Cui, H.; Yuan, Y.; Zheng, Y.; Hao, J. AhaRobot: A Low-Cost Open-Source Bimanual Mobile Manipulator for Embodied AI. arXiv 2025, arXiv:2503.10070. [Google Scholar]
Wu, P.; Xie, Y.; Gopinath, D.; Koh, J. GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators. arXiv 2024, arXiv:2309.13037. [Google Scholar]
Kim, J.; Kim, J.; Lee, D.; Jang, Y.; Kim, B. A low-cost and lightweight 6 DoF bimanual arm for dynamic and contact-rich manipulation. arXiv 2015, arXiv:2502.16908. [Google Scholar]
Bruyninckx, H. Open Robot Control Software: The OROCOS Project. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Seoul, Republic of Korea, 21–26 May 2001; pp. 2523–2528. [Google Scholar]
Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A.Y. ROS: An Open-Source Robot Operating System. In Proceedings of the ICRA Workshop on Open Source Software, Kobe, Japan, 12–17 May 2009. [Google Scholar]
Karamcheti, S.; Nair, S.; Balakrishna, A.; Liang, P.; Kollar, T.; Sadigh, D. Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Song, X.; Chen, K.; Bi, Z.; Niu, Q.; Liu, J.; Peng, B.; Zhang, S.; Liu, M.; Li, M.; Pan, X.; et al. Mastering Reinforcement Learning: Foundations, Algorithms, and Real-World Applications. arXiv 2025, arXiv:2501.05265. [Google Scholar]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022, 1, 3. [Google Scholar]
Liang, C.X.; Bi, Z.; Wang, T.; Liu, M.; Song, X.; Zhang, Y.; Song, J.; Niu, Q.; Peng, B.; Chen, K.; et al. Low-Rank Adaptation for Scalable Large Language Models: A Comprehensive Survey. arXiv 2025, arXiv:2502.01782. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Hou, C.; Zhang, C.; Liu, Y.; Li, Y. Diffusion Transformer Policy. arXiv 2024, arXiv:2410.15959. [Google Scholar]
Li, Q.; Liang, Y.; Wang, Z.; Luo, L.; Chen, X.; Liao, M.; Wei, F.; Deng, Y.; Xu, S.; Zhang, Y.; et al. CogAct: A Foundational Vision–Language–Action Model for Synergizing Cognition and Action in Robotic Manipulation. arXiv 2024, arXiv:2411.19650. [Google Scholar]
Wang, R.F.; Qu, H.R.; Su, W.H. From sensors to insights: Technological trends in image-based high-throughput plant phenotyping. Smart Agric. Technol. 2025, 11, 101257. [Google Scholar] [CrossRef]
Wang, R.F.; Su, W.H. The application of deep learning in the whole potato production Chain: A Comprehensive review. Agriculture 2024, 14, 1225. [Google Scholar] [CrossRef]
Wang, R.F.; Qin, Y.M.; Zhao, Y.Y.; Xu, M.; Schardong, I.B.; Cui, K. RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition. AI 2025, 6, 235. [Google Scholar] [CrossRef]
Sun, H.; Xi, X.; Wu, A.Q.; Wang, R.F. ToRLNet: A Lightweight Deep Learning Model for Tomato Detection and Quality Assessment Across Ripeness Stages. Horticulturae 2025, 11, 1334. [Google Scholar] [CrossRef]
Sun, H.; Wang, R.F. BMDNet-YOLO: A Lightweight and Robust Model for High-Precision Real-Time Recognition of Blueberry Maturity. Horticulturae 2025, 11, 1202. [Google Scholar] [CrossRef]
Wang, R.F.; Tu, Y.H.; Li, X.C.; Chen, Z.Q.; Zhao, C.T.; Yang, C.; Su, W.H. An Intelligent Robot Based on Optimized YOLOv11l for Weed Control in Lettuce. In Proceedings of the 2025 ASABE Annual International Meeting; American Society of Agricultural and Biological Engineers: St. Joseph, MI, USA, 2025; p. 1. [Google Scholar]
Pan, P.; Zhang, Y.; Deng, Z.; Qi, W. Deep learning-based 2-D frequency estimation of multiple sinusoidals. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 5429–5440. [Google Scholar] [CrossRef]
Pan, P.; Zhang, Y.; Deng, Z.; Wu, G. Complex-valued frequency estimation network and its applications to superresolution of radar range profiles. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5105712. [Google Scholar] [CrossRef]
Pan, P.; Zhang, Y.; Deng, Z.; Fan, S.; Huang, X. TFA-Net: A deep learning-based time-frequency analysis tool. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9274–9286. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Pan, P.; Li, Y.; Guo, R. Efficient off-grid frequency estimation via ADMM with residual shrinkage and learning enhancement. Mech. Syst. Signal Process. 2025, 224, 112200. [Google Scholar] [CrossRef]
Pan, P.; Zhang, Y.; Li, Y.; Ye, Y.; He, W.; Zhu, Y.; Guo, R. Interpretable Optimization-Inspired Deep Network for Off-Grid Frequency Estimation. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 14816–14828. [Google Scholar] [CrossRef]

Figure 1. Motivation. Standard VLA policies are brittle in long-horizon manipulation due to error accumulation and delayed failures. The figure illustrates three key observations: (left) accumulated errors over time steps; (middle) the gap between predicted and actual states; (right) how PI-VLA’s uncertainty-aware replanning mitigates these issues. We observe that robustness hinges on preserving action–state symmetry while explicitly detecting and resolving symmetry breaking. PI-VLA addresses this via predictive modeling and interactive replanning.

Figure 2. Algorithm architecture. PI-VLA consists of a Cognitive–Motor Synergy (CMS) core with dual discrete and continuous action heads, state prediction, and value estimation. The architecture is organized into three main components: (1) Visual-Language Encoder: Processes RGB observations through a vision encoder (SigLIP) and fuses them with language instructions via a pre-trained VLM backbone (Prismatic-7B). (2) CMS Module: Four parallel projection heads generate discrete action tokens (

A_{t}^{d}

), continuous action chunks (

A_{t}^{c}

), predicted future states (

{\hat{z}}_{t + 1}

), and value estimates (

V_{t}

). (3) AURD Module: Monitors cross-modal action discrepancy and state prediction error to compute uncertainty signals, triggering closed-loop replanning when thresholds are exceeded. Solid arrows indicate forward computation; dashed arrows indicate uncertainty-driven feedback.

Figure 2. Algorithm architecture. PI-VLA consists of a Cognitive–Motor Synergy (CMS) core with dual discrete and continuous action heads, state prediction, and value estimation. The architecture is organized into three main components: (1) Visual-Language Encoder: Processes RGB observations through a vision encoder (SigLIP) and fuses them with language instructions via a pre-trained VLM backbone (Prismatic-7B). (2) CMS Module: Four parallel projection heads generate discrete action tokens (

A_{t}^{d}

), continuous action chunks (

A_{t}^{c}

), predicted future states (

{\hat{z}}_{t + 1}

), and value estimates (

V_{t}

). (3) AURD Module: Monitors cross-modal action discrepancy and state prediction error to compute uncertainty signals, triggering closed-loop replanning when thresholds are exceeded. Solid arrows indicate forward computation; dashed arrows indicate uncertainty-driven feedback.

Figure 3. Performance comparison on the LIBERO benchmark across four task suites. This figure provides a visual summary of the results in Table 2; readers are referred to the table for precise numerical values.

Figure 4. Inference rate comparison across methods. PI-VLA maintains competitive throughput while providing uncertainty-aware planning capabilities.

Figure 5. Per-task success rate breakdown on the LIBERO Spatial suite. PI-VLA demonstrates consistent improvements across all manipulation tasks.

Figure 6. Real-world success rate heatmap across different objects (Block, Ball, Rock) and instructions (Away, Left, Right). PI-VLA demonstrates consistent high performance across all evaluated conditions.

Figure 7. Radar chart visualizing generalization performance across different symmetry-breaking conditions. PI-VLA exhibits superior robustness in all evaluated scenarios.

Figure 8. Robustness degradation under increasing levels of visual distraction. PI-VLA demonstrates superior resilience under extreme symmetry-breaking conditions.

Figure 9. Cross-dataset generalization matrix illustrating transfer performance across different benchmarks. Diagonal entries correspond to in-distribution evaluation.

Figure 10. Ablation study results showing the contribution of each PI-VLA component. The full model significantly outperforms all ablated variants.

Figure 11. Incremental component contribution analysis illustrating how each PI-VLA module improves performance over the baseline.

Figure 12. Training dynamics of PI-VLA. Left: Loss curves for the three components. Right: Validation success rate over training iterations.

Figure 13. Uncertainty analysis over action horizon steps. The combined uncertainty signal increases with horizon length, enabling adaptive re-planning.

Figure 14. Action horizon analysis. Left: Success rate comparison between PI-VLA’s adaptive horizon and fixed-horizon baselines. Right: Distribution of executed action horizons showing AURD’s dynamic adjustment behavior.

Figure 15. AURD decision timeline during task execution. The uncertainty signal (blue line) triggers re-planning events (red vertical dashed lines) when exceeding the high threshold

τ_{high}

(upper horizontal line). The low threshold

τ_{low}

(lower horizontal line) defines the safe execution zone. Green regions indicate successful action execution; yellow regions indicate cautious execution with elevated uncertainty; red markers indicate re-planning triggers. This visualization demonstrates how AURD enables adaptive behavior in uncertain situations.

Figure 15. AURD decision timeline during task execution. The uncertainty signal (blue line) triggers re-planning events (red vertical dashed lines) when exceeding the high threshold

τ_{high}

(upper horizontal line). The low threshold

τ_{low}

(lower horizontal line) defines the safe execution zone. Green regions indicate successful action execution; yellow regions indicate cautious execution with elevated uncertainty; red markers indicate re-planning triggers. This visualization demonstrates how AURD enables adaptive behavior in uncertain situations.

Figure 16. Parameter sensitivity analysis for key hyperparameters. Red dashed lines indicate optimal values used in our experiments.

Figure 17. Scalability analysis. Left: Performance vs. number of training demonstrations showing superior data efficiency. Right: Performance vs. model size demonstrating consistent improvements with scale.

Figure 18. Loss landscape visualization for the three components of the unified training objective. The imitation loss exhibits a smooth, well-behaved landscape, while the reinforcement loss shows more complex structure.

Table 1. Comparative summary of PI-VLA and representative baseline methods. Key architectural and methodological differences are highlighted. × indicates the feature is not supported; ✓ indicates the feature is supported.

Method	Action Type	Dual Heads	World Model	Adaptive Horizon	Params
Diffusion Policy	Continuous	×	×	×	0.3 B
OpenVLA	Discrete	×	×	×	7 B
OpenVLA-OFT	Continuous	×	×	×	7 B
FAST	Tokenized	×	×	×	7 B
HybridVLA	Hybrid	✓	×	×	7 B
PI-VLA (Ours)	Hybrid	✓	✓	✓	7 B

Table 2. Success rates (%) on the LIBERO benchmark. Best scores are shown in bold, and second-best results are underlined. Avg. denotes the overall success rate reported on LIBERO (computed as an official aggregate across all tasks rather than a simple arithmetic mean of the four suites).

Method	Spatial	Object	Goal	Long	Avg.
Diffusion Policy	65.2	60.1	58.7	52.3	59.1
Octo	68.9	64.3	62.5	56.8	63.1
DiT Policy	70.5	66.7	65.2	59.4	65.5
OpenVLA	72.1	68.5	67.8	61.9	67.6
OpenVLA-OFT	73.8	70.2	69.5	63.4	69.2
EverydayVLA	74.4	67.5	64.3	54.7	65.2
PI-VLA (Ours)	79.5	73.4	73.3	66.6	73.2

Table 3. Real-world in-distribution pick-and-place success rates (%). Best results are shown in bold.

Method	Block			Ball			Rock			Avg.
Method	Away	Left	Right	Away	Left	Right	Away	Left	Right	Avg.
OpenVLA	80	45	50	75	60	70	70	50	55	61.7
OpenVLA-OFT	85	60	65	80	65	75	75	60	65	70.0
EverydayVLA	90	80	75	90	85	80	85	80	75	82.2
PI-VLA (Ours)	95	90	85	95	90	85	90	85	80	88.3

Table 4. Generalization and robustness evaluation (success rate %). Bold values indicate the best result in each column.

Method	Unseen Tasks	Unseen Env.	Static Dist.	Dynamic Dist.
OpenVLA	55.2	58.7	50.1	45.3
OpenVLA-OFT	60.5	63.4	55.8	50.2
EverydayVLA	65.8	68.9	64.2	63.5
PI-VLA (Ours)	72.4	75.1	71.8	70.1

Table 5. Ablation study on CMS action heads (LIBERO Spatial, success rate %). Best results are shown in bold.

Variant	Spatial Suite (%)
PI-VLA (Full CMS)	79.5
w/ Discrete-only head	75.3
w/ Continuous-only head	73.9

Table 6. Ablation on unified training objective (LIBERO Spatial, success rate %). Bold indicates the best result.

Training Objective Variant	Spatial Suite (%)
Full $L_{PI - VLA}$	79.5
w/o State Prediction Loss ( $L_{SP}$ )	76.8
w/o Reinforcement Loss ( $L_{RL}$ )	78.2
Imitation-only ( $L_{IL}$ only)	74.0

Table 7. Ablation on AURD module (LIBERO Spatial, success rate %). Bold indicates the best result.

Execution Strategy	Spatial Suite (%)
PI-VLA (Full with AURD)	79.5
w/o AURD (Fixed Horizon = 5)	74.1
Replace AURD with AdaHorizon	76.4
Replace AURD with ACT (Temporal Ens.)	73.2

Table 8. Extended ablation study across all LIBERO suites (success rate %). Results demonstrate that the performance gains from PI-VLA components are consistent across different task types. Bold indicates the best result.

Variant	Spatial	Object	Goal	Long
PI-VLA (Full)	79.5	73.4	73.3	66.6
w/ Discrete-only head	75.3	69.8	69.1	62.4
w/ Continuous-only head	73.9	68.2	67.5	60.8
w/o $L_{SP}$	76.8	71.0	70.2	63.5
w/o $L_{RL}$	78.2	72.1	71.8	64.9
w/o AURD	74.1	68.5	68.0	61.2

Table 9. Comprehensive ablation incrementally adding PI-VLA components. Bold indicates the best result.

Model Configuration	Spatial Suite (%)
OpenVLA-OFT	73.8
+ Dual-Heads (Discrete + Continuous)	74.8
+ Unified Loss ( $L_{PI - VLA}$ )	77.1
+ AURD (Full PI-VLA)	79.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jian, Y.; Tian, D.; Chen, X.-J.; Wei, Z.-Y.; Liang, C.-W.; Wang, M.-J.-S. PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation. Symmetry 2026, 18, 394. https://doi.org/10.3390/sym18030394

AMA Style

Jian Y, Tian D, Chen X-J, Wei Z-Y, Liang C-W, Wang M-J-S. PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation. Symmetry. 2026; 18(3):394. https://doi.org/10.3390/sym18030394

Chicago/Turabian Style

Jian, Yina, Di Tian, Xuan-Jing Chen, Zhen-Yuan Wei, Chen-Wei Liang, and Mu-Jiang-Shan Wang. 2026. "PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation" Symmetry 18, no. 3: 394. https://doi.org/10.3390/sym18030394

APA Style

Jian, Y., Tian, D., Chen, X.-J., Wei, Z.-Y., Liang, C.-W., & Wang, M.-J.-S. (2026). PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation. Symmetry, 18(3), 394. https://doi.org/10.3390/sym18030394

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation

Abstract

1. Introduction

2. Related Work

2.1. Vision–Language–Action Models from a Symmetry Perspective

2.2. Low-Cost Robotic Manipulators and Hardware Symmetry

3. Method

3.1. Symmetry Formalization

3.2. Overview of PI-VLA

3.3. Cognitive–Motor Synergy (CMS) Architecture

3.4. Unified Training Objective

3.4.1. Imitation Learning Loss ( L IL )

3.4.2. Reinforcement Learning Loss ( L RL )

3.4.3. State Prediction Loss ( L SP )

3.5. Active Uncertainty-Resolving Decider (AURD)

3.6. Hardware Setup

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Baselines

4.4. Symmetry-Centric Evaluation

5. Results

5.1. Results on the LIBERO Simulation Benchmark

5.2. Results on Real-World Tests

5.3. Generalization and Robustness

6. Ablation Studies

6.1. Ablation on CMS Action Heads

6.2. Ablation on Unified Training Objective

6.3. Ablation on Active Uncertainty-Resolving Decider (AURD)

6.4. Comprehensive Component Analysis

7. Training Dynamics Analysis

8. Uncertainty Analysis

9. Hyperparameter Sensitivity

10. Scalability Analysis

11. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.1. Imitation Learning Loss ( $L_{IL}$ )

3.4.2. Reinforcement Learning Loss ( $L_{RL}$ )

3.4.3. State Prediction Loss ( $L_{SP}$ )