Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation

Beyene, Sofanit Wubeshet; Han, Ji-Hyeong

doi:10.3390/electronics15071365

Open AccessArticle

Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation

by

Sofanit Wubeshet Beyene

and

Ji-Hyeong Han

^*

Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(7), 1365; https://doi.org/10.3390/electronics15071365

Submission received: 11 February 2026 / Revised: 17 March 2026 / Accepted: 19 March 2026 / Published: 25 March 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Visual perception is fundamental to robotic manipulation for recognizing objects, goals, and contextual details. Third-person cameras provide global views but can miss contact-rich interactions and require calibration. Wrist-mounted egocentric cameras reduce these limitations but introduce occlusion, motion blur, and partial observability, which complicate visuomotor learning. Furthermore, existing perception modules that rely solely on pixels or fuse imagery with proprioception as flat vectors do not explicitly model structured scene representations in dynamic egocentric views. To address these challenges, a multi-slot attention fusion encoder for egocentric manipulation is introduced. Learnable slot queries extract localized visual features from image tokens, and Feature-wise Linear Modulation (FiLM) conditions each slot on the robot’s joint states, producing a structured slot-based latent representation that adapts to viewpoint and configuration changes without requiring object labels or external camera priors. The resulting structured slot-based latent representation is used as input to a Soft Actor–Critic (SAC) agent, which achieves a higher mean cumulative return than pixel-only CNN/DrQ and state-only baselines on a ManiSkill3 egocentric manipulation task. Probing experiments and real-camera evaluation further show that the learned representation remains stable under egocentric viewpoint shifts and partial occlusions, indicating robustness in practical manipulation settings.

Keywords:

robot manipulation; egocentric perception; reinforcement learning; visual representation learning

1. Introduction

Learning directly from pixels has driven major advances in reinforcement learning (RL), from human-level performance in Atari [1] to continuous-control policies in simulated environments [2]. In robotics, visual RL has enabled agents to acquire manipulation skills such as grasping [3], pushing and stacking [2], and large-scale goal-directed control [4]. These achievements highlight the potential of end-to-end visuomotor learning, but also reveal that challenges remain in how visual input is represented and aligned with control, particularly in embodied robotic settings.

One fundamental reason for this misalignment is the choice of viewpoint. Many visuomotor policies rely on external third-person cameras, which provide broad workspace coverage but are inherently decoupled from the robot’s action space [5]. In such views, the same motor command can produce different pixel changes depending on where the object lies in the external frame. Moreover, maintaining alignment with the robot’s workspace requires careful camera calibration and consistent placement. This may suffice in static or simulated environments but becomes challenging in real-world systems, particularly for mobile or multi-arm platforms, where maintaining stable external camera calibration and synchronization is difficult [6,7].

By contrast, egocentric vision, where the wrist camera offers a perspective naturally aligned with robot motion [8], grounds visual consequences in the manipulator’s workspace [9]. However, egocentric views also introduce challenges including viewpoint shifts, narrow fields of view, and partial observability as objects enter and leave the frame [10,11]. These issues are difficult to solve by vision alone, since the system cannot distinguish between changes caused by self-motion and those caused by object motion. Proprioceptive state provides this missing context, enabling the encoder to interpret visual changes relative to the robot’s configuration.

However, most visual RL pipelines were developed for pixel-only benchmarks, where proprioceptive state was deliberately excluded to test the limits of image-based learning. Methods such as DrQ [12], CURL [13], and RAD learn data-efficient visual representations but do not incorporate robot states. Furthermore, sequence models, such as Decision Transformer [14] process states and actions as separate tokens but still rely on globally pooled image embeddings rather than conditioning visual representations on the robot’s configuration. When state information is included, as in world-model approaches like Dreamer [15,16,17], it is generally fused with visual latents through simple concatenation, without structured cross-modal interaction. Related efforts in object-centric RL [18] highlight the importance of compositional representations but again lack explicit coupling between proprioception and attention. Insights from human sensorimotor coordination suggest a path forward and reflect the principle of embodied perception, where perception is shaped by the agent’s body configuration rather than processed independently. During reaching, the human gaze follows and adapts to hand motion, prioritizing visual features relevant to the current motor plan [19,20].

This perception–action alignment suggests that visuomotor systems should emphasize object regions and interactions that matter for the current control strategy rather than processing the entire visual scene uniformly. Following this principle, we propose a multi-slot attention architecture that couples perception and proprioception so that the robot’s body state shapes visual representation, and the resulting visual interpretation directly informs action selection, forming a closed perception–action loop.

The module first learns a set of spatially localized query vectors (slots) [21,22] that attend to image regions corresponding to task-relevant entities through cross-attention. To stabilize slot–object binding under egocentric viewpoint shifts, the visual slot encoder is initialized with a supervised perception pre-training phase. During this phase, the simulator provides ground-truth object poses, which are projected into 2D pixel coordinates to generate supervision targets. The encoder receives only RGB observations as input. The projected 2D coordinates are used solely as training targets to optimize slot attention and position prediction so that each slot aligns with a corresponding object region in the image. The policy and value networks are not trained during this phase. During subsequent Soft Actor–Critic (SAC) [23] training, the encoder receives RGB observations and the robot’s proprioceptive state. The encoder then computes slot attention over image tokens and applies Feature-wise Linear Modulation (FiLM) layers [24] to condition each slot on the robot’s proprioceptive state, producing modulated features. These modulated features serve as inputs to the policy and value networks. Gradients from the critic loss update the trainable encoder components, while the visual backbone remains frozen. This design integrates perception with control context and replaces flat pixel fusion with a slot-based representation. Slot attention and FiLM modulation maintain alignment between slots and object regions under egocentric viewpoint changes, improving robustness to viewpoint shifts and partial occlusions.

The proposed method was evaluated in the ManiSkill simulator [25] using an egocentric manipulation setup and compared against vanilla SAC (state-only), SAC with CNN, and SAC with DrQ-style image augmentation.

The main contributions are as follows:

A state-guided multi-slot attention encoder that produces slot-based visual features from egocentric RGB observations.
Proprioceptive conditioning of slot features via Feature-wise Linear Modulation (FiLM).
An SAC training formulation in which slot-based visual features are modulated by proprioception and updated through actor–critic losses while keeping the visual backbone frozen.
Empirical evaluation against SAC-based visual baselines in an egocentric ManiSkill manipulation setting.

2. Related Works

Recent visual RL research has focused on stabilizing training and improving sample efficiency in pixel-only domains. Data augmentation methods such as RAD and DrQ-v2 [12,26] improve stability by injecting invariances into pixel-level learning, while contrastive approaches such as CURL [13] shape visual encoders by aligning augmented views. Latent dynamics models, including SLAC [27] and the Dreamer family [15,16,17], learn compact latent spaces to model temporal dynamics, with Dreamer in particular using latent imagination rollouts for long-horizon control. These works established strong benchmarks for image-based reinforcement learning, but their encoders are typically trained from pixel observations alone and are not explicitly conditioned on the robot’s proprioceptive configuration. In many vision-based control frameworks, perception is processed through flattened convolutional features or globally pooled visual tokens, and any low-dimensional state is concatenated only in later MLP layers of the policy or value network, leaving the visual encoder itself blind to the robot’s configuration [2,4,13].

Before deep RL, alignment between vision and control was achieved through calibration-based pipelines. [28] remains a canonical approach to hand–eye calibration, computing the rigid transformation between a robot’s end-effector and its camera. Visual servoing [29,30] extended this concept by iteratively adjusting actions to minimize image space error. Although effective in structured settings, these methods rely on engineered geometric features and brittle calibration procedures, which limit scalability to dynamic, visually rich environments.

Learning-based visuomotor systems seek to overcome these constraints. QT-Opt [4] demonstrated large-scale off-policy learning for robotic grasping using multiple fixed third-person cameras. Transporter Networks [31] introduced spatial transport maps for pick-and-place actions but assumed static overhead views, avoiding viewpoint-shift challenges inherent to egocentric vision. More recent transformer-based architectures, including RT-1 [32], Gato [33], Decision Transformer [14], and PerAct [34], process visual observations as token sequences and condition policies on pooled embeddings. While these models scale effectively with data, they still rely on globally pooled visual representations or late-fused proprioception, rather than using it to structure the visual encoding around the robot’s embodiment. Related closed-loop manipulation frameworks further highlight the need for structured perception–control coupling in cluttered scenes, often introducing hierarchical controllers or auxiliary imitation signals to stabilize visuomotor learning [35].

To move beyond monolithic feature encoders, recent work in representation learning has explored structured visual representations, including object-centric approaches. Methods such as MONet [36], IODINE [37], and SCALOR [38] decompose images into latent slots that capture entities and spatial structure. However, these models are trained with reconstruction or prediction losses under primarily static or slowly varying camera assumptions. Relational approaches such as VIN [39], C-SWM [40], and Relational RL [18] model interactions between pre-segmented objects but do not address the problem of extracting structured representations directly from pixels under egocentric viewpoint shifts. Slot Attention [21] provides a flexible slot-based representation mechanism that can be integrated into downstream learning pipelines without requiring a dedicated reconstruction decoder. The proposed method adopts this mechanism and introduces state-guided slot attention that conditions visual feature organization on the proprioceptive state, enabling consistent perception under self-motion and dynamic views for object manipulation tasks.

3. Problem Formulation

We formulate vision-based robotic manipulation as a partially observable Markov decision process (POMDP) framework, defined by the tuple

M = (S, O, A, T, r, γ)

, where

S

is the latent physical state (joint configuration, object poses),

O

is the observation space,

A

is the continuous action space,

T (s_{t + 1} ∣ s_{t}, a_{t})

is the transition dynamics,

r : S \times A \to R

is the reward function, and

γ \in [0, 1)

is the discount factor.

At each timestep t, the agent receives a partial observation

o_{t} = (o_{t}^{img}, o_{t}^{state}) \sim Ω (\cdot ∣ s_{t})

, generated from the latent state

s_{t}

through an observation function

Ω

, where

o_{t}^{img} \in R^{H \times W \times 3}

is an RGB image captured from an end-effector-mounted egocentric camera, and a low-dimensional proprioceptive state

o_{t}^{state} \in R^{d}

including joint angles, velocities, and the end-effector pose. The agent samples an action

a_{t} \sim π (a ∣ o_{t})

, receives a scalar reward

r_{t} = r (s_{t}, a_{t})

and the process evolves according to the transition dynamics

s_{t + 1} \sim T (s_{t}, a_{t})

. The learning objective is to find a policy that maximizes the expected discounted return:

π^{*} = \arg \max_{π} E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})],

where

π^{*}

is the optimal policy,

γ

is the discount factor, and the expectation is taken over trajectories induced by the policy

π

under the environment dynamics. Since the egocentric camera provides only a partial and viewpoint-dependent observation of the workspace, the environment is inherently non-stationary from the robot’s perspective. The same motor command may produce different pixel changes depending on viewpoint, occlusion, or arm configuration. This coupling between motion and perception motivates a state-conditioned perception model, in which the encoder explicitly integrates proprioceptive state to interpret visual input relative to the robot’s configuration.

In the following sections, we introduce a multi-slot attention encoder modulated by proprioceptive state, enabling visual features to remain consistent under egocentric motion and occlusion. We then describe how this encoder is incorporated into Soft Actor–Critic to learn control-aligned representations for egocentric manipulation.

4. Proposed Method

The proposed method learns a control-oriented visual representation for egocentric robotic manipulation in which visual features extracted from the wrist-mounted camera are modulated by the robot’s proprioceptive state. The architecture consists of a visual encoder that extracts spatial features from the image observation, a slot attention module that produces structured visual tokens, and an FiLM modulation mechanism that adapts these features according to the robot’s current configuration. The resulting state-modulated representation is used as the observation input to the Soft Actor–Critic (SAC) policy. This design forms a bidirectional perception–action loop where the robot state modulates visual features and the policy acts on the resulting representation.

Formally, at each timestep t, the agent receives an egocentric RGB observation

o_{t}^{i m g} \in R^{H \times W \times 3}

and a proprioceptive state vector

o_{t}^{s t a t e} \in R^{d}

. After standard preprocessing, we denote the image input as

I_{t}

. As shown in Figure 1,

I_{t}

is encoded by a ResNet-18 backbone followed by a

1 \times 1

projection into a fixed channel dimension, producing a spatial feature map that is flattened into a sequence of visual tokens with positional encoding. The proprioceptive state is projected to a feature vector through a fully connected layer and used later for FiLM conditioning.

A small set of learnable slot queries attends to the visual tokens through multihead cross-attention and produces K spatially localized slot embeddings. If an additional goal token is available during RL, it is mixed with the pooled slot features using a lightweight two-way softmax gate. FiLM conditioning is then applied to the slot features to incorporate state information, producing the final latent representation passed into SAC. During RL, the ResNet backbone and

1 \times 1

conv remain frozen for stability, and gradients flow only through the trainable perception modules (beige background in Figure 1). This configuration allows the trainable fusion layers to adapt to the task reward, while the frozen backbone preserves low-level visual features learned during pre-training.

4.1. Learning Slot-Based Representations from Egocentric Vision

We begin by constructing a structured perceptual representation from egocentric images using a slot-based attention mechanism. Each slot specializes in a distinct scene region, providing the policy with separate visual features for each task-relevant object rather than a single global image embedding. The slot representation is trained in a supervised manner using 2D projections of task-relevant 3D keypoints, allowing the model to associate each slot with a consistent visual concept.

Visual geometry and observation model: The hand-mounted camera provides egocentric visual observations. Slot supervision labels are generated by projecting simulator object positions onto the image plane using a standard pinhole camera model:

$(u_{t}^{(k)}, v_{t}^{(k)}) = Π (P_{t} X_{t}^{(k)}),$

where $P_{t} = K [R_{t} ∣ t_{t}]$ combines fixed camera intrinsics K and time-varying extrinsic $R_{t}, t_{t}$ , and $Π$ denotes homogeneous division with visibility masking. The resulting 2D coordinates serve as annotation-free supervision targets for slot position regression, requiring no segmentation masks or manual labels.

Image encoder: The RGB image $I_{t}$ is first processed by a truncated ResNet-18 backbone, preserving the feature map up to the third convolutional block. This produces a spatial tensor $F_{t} = h_{ψ} (I_{t}) \in R^{C^{'} \times H_{f} \times W_{f}},$ which is linearly projected to a feature space $R^{D}$ through a $1 \times 1$ convolution: ${\tilde{F}}_{t} = W_{proj} * F_{t} .$ Flattening along spatial dimensions results in a sequence of $S = H_{f} W_{f}$ image tokens $X_{t}^{img} = flatten ({\tilde{F}}_{t}) \in R^{S \times D} .$ A learned 2D positional encoding $P^{2 D} \in R^{S \times D}$ , constructed from row and column embeddings, is added to preserve spatial locality: $X_{t}^{img} = X_{t}^{img} + λ P^{2 D},$ where $λ = 0.05$ is empirically tuned to prevent the positional bias from dominating the learned features.

4.1.1. Learnable Slot Queries

To extract structured visual features, the model uses a set of learnable slot queries. The token sequence

X_{t}^{img} \in R^{S \times D}

is obtained by flattening the spatial feature map produced by the visual encoder, where

S = H_{f} W_{f}

is the number of spatial locations. Each token corresponds to a local region in the image. Instead of collapsing these tokens into a single global embedding, the model maintains a set of

N_{s}

learnable slot vectors:

Q^{slot} = {[q^{(1)}, q^{(2)}, \dots, q^{(N_{s})}]}^{⊤} \in R^{N_{s} \times D},

which serve as queries in a multihead cross-attention module. Each slot query competes to attend to different regions of the image token grid, and auxiliary supervision during pre-training encourages each slot to specialize in a distinct task-relevant object. During RL, when a goal pixel

(u_{t}^{g o a l}, v_{t}^{g o a l})

is available, an additional goal query is constructed by sampling its corresponding feature vector from the spatial feature map

{\tilde{F}}_{t}

. This reuses the frozen encoder’s spatial features to represent the goal location without introducing additional learned parameters.

4.1.2. Cross-Attention

The slot queries use cross-attention to attend to the visual tokens. The query set is defined as:

Q_{t} = \{\begin{matrix} Q^{slot}, & pre - training, \\ [Q^{slot}, Q_{t}^{goal}], & RL . \end{matrix}

Given the sequence of image tokens

X_{t}^{img}

, cross-attention is computed using multihead attention

Y_{t} = MHA (Q_{t}, K = X_{t}^{img}, V = X_{t}^{img}) .

Each query attends to the image tokens by computing similarity scores in the feature space and retrieving a weighted combination of spatial features. The resulting attention matrix

A_{t} \in {[0, 1]}^{| Q_{t} | \times S}

satisfies

\sum_{j} A_{t} [i, j] = 1

for each query i, representing a soft spatial assignment over the image tokens. Each output

y_{t}^{(i)} \in R^{D}

represents the visual features aggregated for one object location.

4.1.3. Permutation-Invariant Aggregation and Gating

The perception module is trained in two stages: pre-training on geometric supervision followed by RL fine-tuning through critic gradients. During pre-training, supervision uses a best-of-N matching scheme that selects the minimum-cost assignment between slots and objects at each step, allowing slots to reassign based on currently visible objects rather than maintaining a fixed slot-to-object mapping. This design improves pre-training stability under occlusion, where a fixed assignment would leave a slot without a valid supervision signal when its assigned object is out of view.

During RL fine-tuning, slot attention weights are updated by critic gradients to emphasize task-relevant features, causing attention patterns to drift from pre-training assignments and making slot swaps inevitable. Mean pooling is used as the slot aggregation function, producing a permutation-invariant representation that remains stable regardless of slot ordering:

z_{t}^{slot} = \frac{1}{N_{s}} \sum_{i = 1}^{N_{s}} y_{t}^{(i)} .

When a goal query is available, the aggregated slot feature

z_{t}^{slot}

and the goal feature

y_{t}^{goal}

are combined through a learned soft gating mechanism:

z_{t} = α_{t}^{(s)} z_{t}^{slot} + α_{t}^{(g)} y_{t}^{goal}, [α_{t}^{(s)}, α_{t}^{(g)}] = softmax ([ℓ_{s}, ℓ_{g} + b (ν_{t}^{goal})]),

where

ℓ_{s}, ℓ_{g}

are learnable logits, and the bias term

b (ν_{t}^{goal})

assigns a large negative value when the goal lies outside the camera’s field of view. Softmax gating keeps contributions normalized and suppresses the goal feature when it is not visible.

4.2. State-Conditioned Modulation

In egocentric manipulation, slot features encode spatially structured visual information in image space, but the relevance of those locations for control depends on the robot’s current configuration. The same visual feature can therefore have different control significance depending on the relative pose between the arm and the object. The slot representation

z_{t}

must therefore be interpreted in the context of the robot’s current state before being passed to the policy. Rather than concatenating

o_{t}^{state}

with

z_{t}

, which treats state as an additional policy input rather than a signal that shapes visual interpretation, Feature-wise Linear Modulation (FiLM) conditions the visual features directly on the proprioceptive state. This allows the robot state to gate the visual features before policy inference so that control decisions operate on state-aware perceptual representations rather than raw visual embeddings.

Given visual feature

z_{t} \in R^{D}

and the proprioceptive observation

o_{t}^{state} \in R^{d_{s}}

, FiLM computes channel-wise scaling and shift parameters from the proprioceptive state and applies them to the slot features:

{\hat{o}}_{t} = W_{s} o_{t}^{state} + b_{s}, γ_{t} = tanh (W_{γ} {\hat{o}}_{t} + b_{γ}), β_{t} = W_{β} {\hat{o}}_{t} + b_{β},

{\tilde{z}}_{t} = (1 + γ_{t}) ⊙ z_{t} + β_{t},

where ⊙ denotes element-wise multiplication.

The tanh activation bounds

γ_{t} \in [- 1, 1]

, which results in the effective scaling factor

1 + γ_{t} \in [0, 2]

. This bounded modulation allows robot states to amplify or attenuate individual visual channels while preserving the sign and relative structure of the pre-trained slot representation. As a result, proprioceptive conditioning can adjust the relative importance of visual features according to the current arm-object configuration without distorting the learned slot representation.

The multiplicative term

γ_{t}

performs channel-wise scaling, while

β_{t}

applies a channel-wise shift, resulting in a feature-wise affine modulation of the slot representation. An ablation comparing FiLM variants is provided in Section 7.

4.3. Auxiliary Supervision and Pre-Training

Before policy learning, the perception module is pre-trained to establish spatial correspondence between egocentric image observations and physical object locations. This stage provides dense pixel-level supervision that would be difficult to derive from sparse task reward alone. During pre-training, the agent observes simulated rollouts generated by a random policy. For each timestep t, we collect the following:

{I_{t}, o_{t}^{state}, I_{t + 1}, o_{t + 1}^{state}, u_{t}, v_{t}, u_{t + 1}, v_{t + 1}} .

Here

I_{t}

and

I_{t + 1}

are egocentric RGB images, and

o_{t}^{state}

and

o_{t + 1}^{state}

are the corresponding proprioceptive observations. The vectors

u_{t}, v_{t} \in R^{N_{o}}

contain the projected pixel coordinates of each object at time t, obtained from the calibrated camera model

P_{t} = K_{t} [R_{t} | t_{t}]

.

All supervision labels are derived from the simulator state via camera projection. The encoder is optimized using three auxiliary losses targeting slot position, object visibility, and attention alignment.

Position regression: Each attention slot output $y_{t}^{(i)}$ predicts a pixel coordinate $({\hat{u}}_{t}^{(i)}, {\hat{v}}_{t}^{(i)})$ through a small regression head. Because slots are exchangeable, we adopt a best-of-N matching scheme that aligns predicted coordinates with object labels via the minimal assignment:

$L_{pos} = min_{σ \in S_{N_{s}, N_{o}}} \frac{1}{N_{o}} \sum_{k = 1}^{N_{o}} m_{t}^{(k)} ∥ ({\hat{u}}_{t}^{(σ (k))}, {\hat{v}}_{t}^{(σ (k))}) - (u_{t}^{(k)}, v_{t}^{(k)}) ∥_{2}^{2} .$

(1)

Here $σ$ denotes a permutation (one-to-one assignment) from objects to slots, and $σ (k)$ selects the slot matched to object k under that assignment. $N_{s}$ is the number of slots, and $N_{o}$ is the number of tracked objects (or keypoints) for which pixel labels are available. The set $S_{N_{s}, N_{o}}$ contains all assignments $σ : {1, \dots, N_{o}} \to {1, \dots, N_{s}}$ . In our implementation, we set $N_{s} = N_{o} = 2$ ; this minimization reduces to evaluating the two possible matchings (identity vs. swapped) and taking the lower error, while $m_{t}^{(k)} \in {0, 1}$ indicates whether object k is visible in frame t, masks out objects that are not visible and the loss is normalized by the number of valid objects in the batch. This encourages each slot to specialize consistently toward one object without enforcing a fixed ordering.
Visibility prediction: A visibility head predicts the probability that each object lies within the camera’s field of view. A small visibility head takes the attention-derived features and outputs ${\hat{m}}_{t}^{(k)} \in (0, 1)$ for every object k, indicating whether its projected pixel location lies on-screen. The corresponding supervision signal is the geometric validity mask $m_{t}^{(k)} \in {0, 1}$ , derived directly from camera projection. We optimize a per-object binary cross-entropy loss (BCE):

$L_{vis} = - \frac{1}{N_{o}} \sum_{k = 1}^{N_{o}} [m_{t}^{(k)} log {\hat{m}}_{t}^{(k)} + (1 - m_{t}^{(k)}) log (1 - {\hat{m}}_{t}^{(k)})] .$

(2)

This loss encourages the encoder to recognize when an object’s projected location is valid and to avoid allocating attention to regions that fall outside the field of view.

Attention alignment: Attention alignment supervises the attention mechanism to place probability mass at each object’s projected token location. Given the projected pixel location of object k, $idx (u_{t}^{(k)}, v_{t}^{(k)})$ provides the corresponding token index on the $H_{f} \times W_{f}$ feature grid. For each object, the summed attention probability across all slots at each object’s token index is:

$p_{t}^{(k)} = \sum_{i = 1}^{N_{s}} A_{t} [i, idx (u_{t}^{(k)}, v_{t}^{(k)})],$

and we define the alignment loss as follows

$L_{attn} = - \frac{1}{N_{o}} \sum_{k = 1}^{N_{o}} m_{t}^{(k)} log p_{t}^{(k)} .$

(3)

In this formulation, object-level supervision is applied to the sum of slot-wise attention weights rather than a specific slot.
The total pre-training loss is:

$L_{pre} = λ_{pos} L_{pos} + λ_{vis} L_{vis} + λ_{attn} L_{attn} .$

(4)

During this stage, all components of the perception module, including the ResNet18 backbone, slot attention, FiLM layers, and auxiliary prediction heads, are trained jointly. After pre-training, gradients to the backbone are disabled, and the encoder produces state-conditioned embeddings ${\tilde{z}}_{t}$ that are used as input to SAC.

4.4. Integration with Reinforcement Learning

After pre-training, the fused representation

{\tilde{z}}_{t}

is used as the input to the SAC actor and critic networks. During RL optimization, gradients from the critic losses propagate through the trainable encoder layers, allowing the visual representation to adapt to the control objective. At each timestep t, the agent receives a partial observation

o_{t} = (I_{t}, o_{t}^{state}),

where

I_{t}

is the egocentric RGB image, and

o_{t}^{state}

is the proprioceptive vector containing joint angles, velocities, and gripper pose. These are mapped by the pre-trained fusion encoder

f_{ψ}

into a structured latent representation

{\tilde{z}}_{t} = f_{ψ} (I_{t}, o_{t}^{state}, g_{t}),

where

g_{t}

denotes a goal input when provided by the environment and is omitted otherwise. The encoder applies slot cross-attention to the egocentric image and modulates the resulting features through FiLM using the proprioceptive state.

This latent representation serves as the sole input to the SAC components. The policy is a stochastic actor

π_{θ} (a_{t} ∣ {\tilde{z}}_{t})

that outputs a Gaussian distribution over continuous actions

a_{t} \in R^{n}

. Two critic networks

Q_{ϕ_{1}} ({\tilde{z}}_{t}, a_{t})

and

Q_{ϕ_{2}} ({\tilde{z}}_{t}, a_{t})

, estimate soft Q-values, and a separate value network

V_{η} ({\tilde{z}}_{t})

provides the bootstrap target used for temporal-difference updates through a slowly updated target network

V_{\bar{η}} ({\tilde{z}}_{t})

. Transitions

{o_{t}, a_{t}, r_{t}, o_{t + 1}, {done}_{t}}

sampled from the replay buffer

D

are used for the SAC updates. The reward function is defined in Section 4.5.

The critic target is computed as the soft Bellman backup:

y_{t} = r_{t} + γ (1 - {done}_{t}) V_{\bar{η}} ({\tilde{z}}_{t + 1}),

(5)

where

γ \in [0, 1)

is the discount factor, and

V_{\bar{η}}

denotes a slowly updated target network. The critics minimize the temporal-difference loss

L_{Q} (ϕ_{i}) = \frac{1}{2} E_{D} {[Q_{ϕ_{i}} ({\tilde{z}}_{t}, a_{t}) - y_{t}]}^{2}, i \in {1, 2} .

(6)

For the value and policy objectives, actions are re-sampled from the current actor,

a_{t}^{'} \sim π_{θ} (\cdot ∣ {\tilde{z}}_{t})

. The value network is then trained to regress toward the expected critic output under the current policy:

L_{V} (η) = \frac{1}{2} E_{D} {[V_{η} ({\tilde{z}}_{t}) - (\min_{i} Q_{ϕ_{i}} ({\tilde{z}}_{t}, a_{t}^{'}) - α \log π_{θ} (a_{t}^{'} ∣ {\tilde{z}}_{t}))]}^{2},

(7)

where

α > 0

is the entropy temperature that controls the trade-off between reward maximization and policy stochasticity. The policy parameters are optimized by minimizing the entropy-regularized objective

L_{π} (θ) = E_{D} [α \log π_{θ} (a_{t}^{'} ∣ {\tilde{z}}_{t}) - \min_{i} Q_{ϕ_{i}} ({\tilde{z}}_{t}, a_{t}^{'})],

(8)

which encourages the actor to select actions with both high expected value and high entropy. Finally, the target parameters are updated via Polyak averaging:

\bar{η} \leftarrow τ η + (1 - τ) \bar{η},

(9)

with

τ \in (0, 1)

typically set to

0.005

. All components are optimized with the Adam optimizer using learning rates shown in Table A1.

Algorithm 1 summarizes the overall training pipeline. At each iteration, a minibatch of transitions is sampled from the replay buffer and encoded into

{\tilde{z}}_{t}

and

{\tilde{z}}_{t + 1}

by the fusion encoder. The soft Bellman backup (5) is computed using the target value network, and the twin critics are updated by minimizing the TD loss (6). The reparameterized actions

a_{t}^{'} \sim π_{θ} (\cdot ∣ {\tilde{z}}_{t})

are then used to update the value network (7) and the actor (8). Finally, the target value parameters are updated by Polyak averaging.

Algorithm 1 State-guided slot attention with SAC.

Require: Replay buffer

D

, encoder

f_{ψ}

, actor

π_{θ}

, critics

Q_{ϕ_{1}}, Q_{ϕ_{2}}

, value

V_{η}

, target value

V_{\bar{η}}

, batch size B, discount

γ

1:: Initialize network parameters $ψ, θ, ϕ_{1}, ϕ_{2}, η, \bar{η}$
2:: for each training iteration do
3:: Sample batch ${(o_{t}, a_{t}, r_{t}, o_{t + 1}, {done}_{t})}_{i = 1}^{B} \sim D$
4:: Encode ${\tilde{z}}_{t} = f_{ψ} (o_{t})$ and ${\tilde{z}}_{t + 1} = f_{ψ} (o_{t + 1})$
5:: Compute critic target Equation (5)
6:: Update critics $Q_{ϕ_{1}}, Q_{ϕ_{2}}$ by minimizing Equation (6)
7:: Backpropagate critic gradients into trainable encoder parameters $ψ$
8:: Sample reparameterized action $a_{t}^{'} \sim π_{θ} (\cdot ∣ {\tilde{z}}_{t})$
9:: Update value network $V_{η}$ by minimizing Equation (7) using $a_{t}^{'}$
10:: Update actor $π_{θ}$ by minimizing Equation (8) using $a_{t}^{'}$
11:: Update target value network: Equation (9)
12:: end for

The ResNet-18 backbone and 1 × 1 projection remain frozen throughout RL, while the slot attention, FiLM, and goal gate are updated only through critic gradients. Actor and value updates treat

{\tilde{z}}_{t})

as fixed to ensure the visual representation is refined toward task-relevant features without destabilizing the policy.

4.5. Reward Function

In egocentric manipulation, the wrist camera moves with the arm, and the distance between the end-effector and the object changes continuously during interaction. Effective control therefore requires reward signals that guide large movements across the workspace while also supporting precise alignment near the target. To capture these different interaction phases, the reward uses a multiscale shaping design that combines exponential and inverse-distance terms. These complementary components produce reward sensitivity across both large movements in the workspace and small positional adjustments near the target. The formulation is implemented through a piecewise design for reaching and placement with gated components.

Reaching:

$r_{reach} = \{\begin{matrix} exp (- α ∥ x_{obj} - x_{TCP} ∥), & d > δ_{1} \\ \frac{1}{∥ x_{obj} - x_{TCP} ∥ + ϵ}, & otherwise \end{matrix}$

where $d = ∥ x_{o b j} - x_{T C P} ∥$ .
The exponential term produces a smooth distance-based signal for global motion toward the object, while the inverse-distance term increases sensitivity when the end-effector is close to the object to support precise alignment.
Grasping: A binary reward marks the transition from reaching to transport upon successful grasp.
Placing:

$r_{place} = \{\begin{matrix} exp (- α ∥ x_{obj} - x_{goal} ∥), & if d > δ_{2} \\ \frac{1}{∥ x_{obj} - x_{goal} ∥ + ϵ}, & otherwise \end{matrix}$
Post-grasp guidance: Once the object is grasped, an additional shaping term encourages movement toward the placement objective during transport. The term is defined as

$r_{pg} = exp (- d_{goal}) \cdot ⊮_{grasp}$

where $d_{goal} = ∥ x_{obj} - x_{goal} ∥$ and $⊮_{grasp}$ indicates that the term is active only when the object is grasped. The exponential factor reduces the influence of this term as the object approaches the placement region so that the reward transitions smoothly to the placement objective.
Stability: Once placement has been achieved, this stability term discourages residual joint motion so that the robot settles into a stable configuration before termination. It is defined as:

$r_{static} = 1 - tanh (ϖ \cdot | {\dot{q}}_{robot} |)$

where $ϖ$ controls the sensitivity of the stabilization term.

The switching threshold

δ_{1}, δ_{2} = 0.025

corresponds to the environment-defined threshold indicating proximity to the object, and

ϵ

is a small constant (

10^{- 5}

) to avoid division by zero. The placement and stability components activate only when the object is grasped and placed, respectively, enforcing sequential phase ordering. The contribution of each component across training is analyzed in Section 6.3, with further ablations provided in Section 7.2.

5. Experimental Setup

We evaluate the proposed method in a simulated robotic manipulation environment using a PickBanana task in ManiSkill [25]. The scene contains a seven-DoF robotic arm, a banana object, and a bowl positioned near a marked target region shown as a green point in Figure 2. In each episode, the agent must reach, grasp, and place the banana at the target location. The bowl serves as a movable object, introducing contact-rich interaction into the task. Table 1 summarizes the main environment specifications. The wrist-mounted camera follows the default pinhole perspective model in ManiSkill. Images were rendered at

128 \times 128

resolution synchronously at a control frequency of 20 Hz. No sensor noise or lens distortion was modeled during training. At evaluation, the robustness of the learned representations was assessed under three types of visual perturbation: random occlusion (

15 %

of the image), additive Gaussian noise (

σ = 0.05

), and camera-like affine perturbation (

3^{°}

yaw rotation,

3 %

translation,

1.05 \times

scale), applied independently to the input images.

Camera intrinsics and extrinsics were obtained directly from the ManiSkill simulator at each episode. The intrinsic matrix K defines the focal lengths (

f_{x}, f_{y}

) and principal point (

c_{x}, c_{y}

) at

128 \times 128

resolution. The extrinsic matrix

R_{t}

(

3 \times 4

) encodes the rigid transformation from world coordinates to the wrist camera frame, updated at every environment step to reflect the camera’s pose as the arm moves. These parameters are used to project 3D object positions into pixel-space for UV supervision only during perception pre-training. Complete camera specifications are provided in Table A4.

5.1. Implementation Detail

All neural network components were implemented in PyTorch (v2.3.0, Python 3.9). The SAC agent was implemented following the original formulation by [23], which includes a stochastic actor network, two critic networks, and a value network. All SAC-based methods (state-only, CNN pixel, DrQ-v1 [41], and the proposed variants) share identical SAC hyperparameters, actor and critic architecture, and reward function. Only the visual encoder and input representation differ across methods. Three pixel-based baselines are included: a four-layer CNN with ReLU activations and linear projection; DrQ-v1 [41] following the original architecture with minor adaptations for the ManiSkill3 observation space and image resolution; and DrQ-v2 [12], following its original DDPG-based formulation and included as an additional reference baseline. Details are provided in Appendix A.5. The state-only baseline uses the proprioceptive state as input to the actor and critics without visual input. Hyperparameters are summarized in Table A1, and hardware specifications are provided in Table A2.

Visual observations are processed by a ResNet-18 backbone initialized with ImageNet-pre-trained weights. The backbone serves as a spatial feature extractor, providing low-level visual features over an

8 \times 8

spatial grid for

128 \times 128

inputs, and is therefore truncated after the third residual stage. The slot attention module uses

N_{s} = 2

slots, matching the number of task-relevant objects in the scene. FiLM scale and shift networks are zero-initialized, ensuring exact identity modulation at the start of RL training. The backbone is frozen during SAC optimization, while slot attention and FiLM layers remain trainable. Training is executed as a single run of 3000 epochs with 400 environment steps per epoch. During the initial data collection phase, the agent interacts with the environment for 100,000 steps to populate the replay buffer without policy optimization. The perception encoder is then pre-trained for

150,000

gradient steps using samples from this replay buffer. SAC optimization begins after pre-training and continues for the remaining epochs, with one gradient update per environment step. The encoder pre-training required approximately 100 min, and the full run required approximately

32.1

h of wall-clock time.

5.2. Evaluation Objectives and Metrics

Evaluation targets two aspects of the method. First, the fusion encoder should improve reinforcement learning performance from egocentric RGB compared to pixel-only and state-only baselines. Second, the slot-based representation should remain stable over time and specialize in task-relevant objects. Both are evaluated through policy-level and perception-level metrics.

Perception-level evaluation: Encoder quality is assessed during pre-training by reporting pixel-space position regression error between predicted UV coordinates and simulator-projected object locations, with visibility masking and swap-based matching, together with attention alignment NLL and visibility BCE loss. Furthermore, slot specialization is assessed qualitatively by visualizing attention maps across viewpoint changes and occlusion events.
Policy-level evaluation: Policy performance is measured by final return and success rate, defined as the fraction of episodes in which the banana is placed within the target region. With two slots matching two foreground objects, position and visibility losses directly reflect whether each slot consistently tracks its assigned object.

6. Results and Discussions

6.1. Perception-Level Results and Discussion

Figure 3 shows the pre-training curves for all three losses. The attention alignment loss decreases sharply early in training and approaches near-zero values, indicating that slot queries learn to consistently select object-relevant regions rather than diffuse background features.

The position regression loss drops rapidly within the first portion of training and remains low, confirming that the UV heads recover accurate pixel-space object locations. The visibility loss follows a similar downward trend, stabilizing at low value, indicating that the encoder reliably distinguishes whether each object lies within the egocentric field of view.

Furthermore, Figure 4 shows that one slot persistently focuses on the banana while the other tracks the bowl across camera motion and contact-driven object displacement. When neither object is visible, attention concentrates on the end-effector, which remains within the field of view throughout the episode. This is a consistent agent-centric fallback in egocentric perception in which the end-effector is always in field of view. The auxiliary heads remain consistent with these attention patterns. As shown in Figure 5a–d, both objects are visible, and the predicted UV markers lie close to the corresponding ground-truth projections, demonstrating accurate localization across varied spatial configurations and scales. Figure 5e shows a harder boundary case where objects are near or outside the camera view due to ground-truth projections falling outside the valid region, and the visibility mask appropriately down-weights their contribution.

To examine whether slot attention generalizes beyond simulation rendering, printed images of a banana and a bowl were held in front of a Samsung Galaxy Note 20 smartphone and live-streamed through the encoder in real time without fine-tuning. The smartphone camera differs substantially from the simulation camera in resolution, FOV, and lens distortion (Table A4), with additional MJPEG compression artifacts absent during training.

As shown in Figure 6, slot attention maps remained localized on the printed targets, and position and visibility predictions were consistent with the observed objects across consecutive frames. Furthermore, when the camera approached one object closely, attention shifted toward it proportionally, redistributing as the viewpoint re-centered. This observation does not involve a robotic platform or closed-loop control and is not intended as a quantitative sim-to-real benchmark. However, it confirms that the learned representation remains coherent under real visual noise and viewpoint variation beyond simulation rendering.

6.2. Policy-Level Results and Discussion

Figure 7 shows the return curves for all evaluated methods. Among the baselines, the state-only SAC exhibits a delayed but steady rise after the mid-training phase, stabilizing at moderate returns, indicating that proprioception alone supports partial task completion but saturates below methods that incorporate visual features. The DrQ-v1 baseline achieves modest returns, typically stabilizing in the 300 range, while the CNN pixel baseline fails to learn. Both results indicate that raw pixel encoders struggle to form stable representations under wrist-mounted camera motion and occlusion.

The proposed method without goal conditioning achieves the highest return, stabilizing near 3.7–3.9 k, and converges fastest at around ∼1.1–1.2 k episodes. The proposed method with goal conditioning reaches the second-highest return, plateauing around ∼3.1–3.2 k. The lower return of the goal-conditioned variant reflects the added input complexity of goal conditioning, which slows early optimization while still producing a strong final policy.

Replacing FiLM with simple concatenation results in slower convergence and a lower final return, suggesting that channel-wise state modulation contributes beyond simple feature fusion. The variant w/o state (attention fusion output used alone) rises much later, peaks around ∼2.6–2.7 k, and then declines toward the end of training. This late collapse indicates that visual slots alone are insufficient to resolve the partial observability in egocentric views and that proprioceptive context is needed for stable control.

Figure 7 indicates that pixel-only baselines saturate at low returns, while state-guided slot fusion with FiLM conditioning achieves faster convergence and higher final performance. Figure 8 shows the full manipulation sequence: approach, grasp, transport, and placement, executed successfully. Furthermore, evaluating the final checkpoint over 50 deterministic rollouts results in a

92 %

success rate, confirming that the return gains correspond to reliable task completion. Beyond task success, these results highlight practical aspects for robotic manipulation. Freezing the backbone after pre-training reduces computation during policy inference and makes the method easier to deploy on resource-limited platforms. The wrist-mounted camera provides an egocentric view that moves with the arm, so the policy observes objects from a consistent reference frame. Conditioning the representation on the robot’s proprioceptive state further aligns the policy input with the current arm configuration and supports stable control throughout the manipulation sequence.

6.3. Reward Component Analysis

The reward function defines three primary components: reach, post-grasp guidance, and place, each corresponding to a distinct stage of the pick-and-place task. To analyze how reward contributions evolved over training, the logged environment steps were divided into three equal windows: early (0–

0.4 \times 10^{6}

steps), mid (

0.4

–

0.8 \times 10^{6}

steps), and late (

0.8

–

1.2 \times 10^{6}

steps). For each phase, the mean and standard deviation of each component’s fractional contribution to the total reward were computed over all recorded environment steps.

As shown in Figure 9 and Table 2, the reach reward dominates early training (

79.1 \pm 13.4 %

), indicating that the agent initially focuses on approaching the object before acquiring grasping or placement behavior. The post-grasp guidance component remains near-zero during the early and mid phases, increasing only after approximately

0.6 \times 10^{6}

steps (

4.1 \pm 5.0 %

in the late phase), consistent with the agent learning to maintain a stable grasp during object transport. The place reward shows a clear upward trend, increasing from

20.6 \pm 13.4 %

in the early phase to

63.6 \pm 25.9 %

in the late phase, indicating that successful object placement becomes the dominant reward signal as training progresses. The variance in the late phase reflects the step-level distribution of reward contributions across different task phases within each episode.

6.4. Visuomotor Robustness and Efficiency Evaluation

To examine what information is encoded in the learned latent representation

\tilde{z_{t}}

, linear probes are trained on frozen encoder outputs collected from 10 rollout episodes across all ablations. Since the encoder is frozen and the probes are linear, strong probe performance indicates that the relevant signals are linearly decodable from

\tilde{z_{t}}

and directly accessible to the policy. The same probes are applied identically across all ablations (FiLM vs. concatenation, with vs. without proprioceptive state, and image-only and state-only variants) on identical rollout frames to isolate the effect of the fusion mechanism.

Interaction alignment: Egocentric manipulation hinges on detecting when the robot has reached, contacted, or grasped an object. To test whether the latent representation separates interaction from non-interaction states, a linear classifier is trained to predict a binary contact label from $\tilde{z_{t}}$ , and performance is reported as AUROC. As shown in Table 3, the full FiLM variant achieves the highest contact AUROC $(0.879 \pm 0.019)$ , indicating that the latent representation separates contact from non-contact states more cleanly than other variants. The state-only variant also achieves high AUROC, consistent with the fact that contact correlates with proprioceptive cues.
Progress and object-dynamics encoding: Contact detection alone does not capture how the end-effector approaches the object or how agent actions displace it. Two additional probes target these aspects directly. One-step TCP-to-object distance change $Δ d_{t} = d_{t + 1} - d_{t}$ and object displacement $Δ o b j_{t} = o b j_{t + 1} - o b j_{t}$ are regressed from $\tilde{z_{t}}$ , with performance reported as $R^{2}$ .

As shown in Table 3, all variants achieve strong

R^{2}

on both probes, indicating that TCP-to-object distance change and object displacement are linearly decodable across all fusion variants. The FiLM variant maintains these high scores while also achieving the strongest contact AUROC, whereas the concatenation variant achieves comparable Δdist/ΔR² but substantially lower contact AUROC. This indicates that concatenation preserves motion-related information but produces weak interaction alignment than FiLM.

Egocentric robustness: Egocentric cameras introduce self-motion, occlusion, and sensor noise that can degrade learned representations. The trained contact probe is evaluated on perturbed latents ${\tilde{z_{t}}}^{o c c}$ and ${\tilde{z_{t}}}^{n o i s e}$ with robustness measured as AUROC drops: $Δ {A U R O C}_{o c c} = A U R O C ({\tilde{z}}^{o c c}) - A U R O C (\tilde{z})$ , and $Δ {A U R O C}_{n o i s e} = A U R O C ({\tilde{z}}^{n o i s e}) - A U R O C (\tilde{z})$ .

As shown in Table 3, the FiLM variant produces near-zero

Δ A U R O C

under both occlusion and noise, indicating that the contact signal remains stable under egocentric perturbations. The state-only variant produces

Δ A U R O C = 0

by construction since it receives no image input. The image-only and concatenation variants show larger AUROC drops, indicating that pixel-based representations without proprioceptive state modulation are more sensitive to egocentric perturbations.

7. Ablation Studies

7.1. FiLM Conditioning Variants

Three state conditioning variants are evaluated against the proposed method to examine the role of FiLM modulation along two dimensions: where the state enters the pipeline and how it conditions the visual representation. The proposed method, FiLM with tanh (proposed), applies channel-wise multiplicative modulation to slot features using

(1 + \tanh (γ_{t})) ⊙ z_{t} + β_{t},

where

γ_{t}

and

β_{t}

are predicted from the proprioceptive state. In this formulation, the state conditions the visual representation directly at the slot feature level multiplicatively.

The first variant, FiLM linear+clamp, replaces the tanh scaling with a linear scale factor clamped to

[- 1, 1]

. The conditioning mechanism and entry point remain identical to the proposed method. This variant isolates the contribution of smooth bounded modulation from the FiLM mechanism itself.

The second variant, concat+projection, replaces FiLM with a linear projection applied to the concatenation of the state and slot features, projecting the result back to the original feature dimension. Here the state still enters early at the slot feature level but interacts with the visual features additively rather than multiplicatively. This tests whether the state must actively reshape the visual feature space channel-wise or whether simple additive mixing is sufficient.

The third variant, proposed with concat (no FiLM), appends the state to the aggregated slot features only at the policy input stage after all visual processing is complete. In this case, the state never influences the visual representation itself and is provided only as an additional input to the policy. This variant tests whether passive perception with state as a side input is sufficient without explicit visual feature conditioning.

As shown in Figure 10, the concat+projection variant fails to learn a meaningful policy, indicating that additive fusion of state and visual features at the slot level is insufficient. The late concat variant learns the task but converges to a substantially lower return than the proposed method, showing that providing the state only at the policy input without conditioning the visual features is also inadequate. In contrast, both FiLM variants successfully learn the task. The tanh-modulated FiLM converges faster and reaches a higher stable return than the clamp variant. These results indicate that proprioceptive state must condition visual features early and multiplicatively, consistent with the embodied perception principle underlying the proposed method.

7.2. Reward Function Ablations

To evaluate the contribution and robustness of the reward design, we perform ablation experiments examining both the structural components and the sensitivity to key parameters.

Stability sensitivity: The scaling coefficient $ϖ$ controls the sharpness of the velocity penalty in the stability term. We vary the stabilization parameter $ϖ \in {1, 5, 10}$ to test the sensitivity of the bounded velocity penalty. As shown in Figure 11, all three values eventually learn the task, confirming that the method is robust to this parameter. However, $ϖ = 5$ achieves the fastest convergence and highest stable return. $ϖ = 1$ produces the slowest and most unstable learning, while $ϖ = 10$ delays convergence relative to the proposed value. These results confirm that $ϖ = 5$ provides the most effective balance between penalizing residual motion and preserving the fine control needed during placement.

Stability term removal: We remove the stabilization component to evaluate whether stable placement behavior emerges without explicitly penalizing residual joint motion.

Figure 12 compares the proposed method against a variant with the stability term removed entirely. Removing the stability term results in faster initial convergence but a lower final return, with the agent saturating around 3700 compared to 3900 for the proposed method. Both variants follow a similar learning trajectory until approximately episode 1250, after which the proposed method pulls ahead and stabilizes at a higher return.

Reward-scale robustness: Ablation is done on the distance shaping parameter $α \in {5, 10, 15}$ to examine the robustness of the multiscale reward shaping formulation.

The scaling coefficient

α

controls the rate of exponential decay in the reaching, post-grasp, and placement reward terms. As shown in Figure 13, all three values eventually learn the task, indicating that the method is robust to this parameter. However,

α = 10

achieves the fastest convergence and the highest stable return. Reducing

α

to 5 slows convergence and introduces instability during the learning transition, while increasing

α

to 15 delays the onset of learning.

7.3. Surplus Slot Analysis

In the proposed method, the number of slots was set equal to the number of task-relevant objects

N_{o} = 2

in the main experiments. This ablation examines what happens when surplus slots are introduced,

N_{s} = 4

with

N_{o} = 2

. During pre-training, only two slots receive pixel-space supervision targets corresponding to the task-relevant objects. The remaining slots have no explicit supervision and are free to attend to any image region. This setting tests whether surplus unsupervised slots destabilize slot binding. Figure 14 and Figure 15 show representative attention maps from the two slots that produce the strongest activation responses. The remaining slots exhibit weaker diffuse responses and are not selected by the position prediction head. Additionally, object localization remains highly accurate.

7.4. End-to-End Training Without Pre-Training

The main method initializes the visual encoder through a supervised pre-training stage before RL begins. This ablation evaluates the contribution of pre-training by training the full encoder end-to-end with SAC from random initialization, with no pre-training stage. The ResNet backbone, slot attention, and FiLM layers are all updated jointly through the critic loss from the start of training. This isolates the contribution of the pre-training stage and investigates whether the learned visual representation it provides is necessary for task performance, or whether critic gradients alone are sufficient to learn effective slot-based visual features from scratch.

Figure 16 shows the learning curve for an end-to-end variant. The return increases, indicating that the agent is learning. However, the policy never reaches successful task completion, and the return saturates below the performance achieved by the proposed approach. This behavior arises because the visual encoder must simultaneously learn object localization, stable feature binding, and control-relevant representations from the reinforcement learning objective alone. The RL signal is task-oriented and provides only indirect supervision for pixel-level representation learning, resulting in high-variance gradients that are weakly coupled to object localization and spatial reasoning. Pre-training the encoder therefore improves training efficiency by providing a stable visual representation before policy optimization begins.

8. Conclusions

This work introduced a state-guided multi-slot attention fusion architecture for egocentric robotic manipulation and integrated it with Soft Actor–Critic for end-to-end visuomotor control. In the PickBanana manipulation task, the proposed approach achieved the fastest learning and the highest final return among all compared methods. Ablations showed that the fusion framework remains strong with or without explicit goal input, FiLM-based state modulation is critical for sample-efficient and stable learning compared to naive concatenation, and proprioceptive conditioning is necessary for reliable control in partially observed egocentric views. Qualitative rollouts further confirmed that the learned representation supports coherent, temporally structured manipulation behaviors.

Overall, the findings demonstrate that slot-based representations, when anchored to the agent’s state, provide an effective and practical route to egocentric manipulation policies in simulation. This establishes a foundation for future work on broader task suites and transfer to real-robot egocentric settings, where occlusion and viewpoint changes are unavoidable. However, the method has a limitation in that it uses simulator supervision, and tasks that require more complex interactions have not been considered. Future work could explore replacing oracle simulator supervision with perception-only signals to enable real-robot transfer and experiment on more complex interactions.

Author Contributions

Conceptualization, S.W.B. and J.-H.H.; methodology, S.W.B.; software, S.W.B.; validation, S.W.B. and J.-H.H.; formal analysis, S.W.B.; investigation, J.-H.H.; resources, J.-H.H.; data curation, S.W.B.; writing—original draft preparation, S.W.B.; writing—review and editing, J.-H.H.; visualization, S.W.B.; supervision, J.-H.H.; project administration, J.-H.H.; funding acquisition, J.-H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2024-00407295).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. SAC Hyperparameters

Table A1. Training hyperparameters.

Parameter	Value	Description
SAC
$α$	0.0001	Actor network learning rate
$β$	0.001	Critic and value learning rate
$γ$	0.99	Discount factor
$τ$	0.005	Target network soft update
Reward Scale	10	Adjusts reward magnitude
Hidden Layers	256, 256	Neural network layer sizes
Batch Size	128	Training batch size
Buffer Size	1 M	Maximum replay buffer size
Encoder Pre-training
Learning Rate	0.0001	Encoder learning rate
Feature Dim	42	Slot and FiLM feature dimension
Attention Heads	3	Multihead cross-attention heads
Pre-training Steps	150,000	Gradient steps for encoder pre-training
RL Encoder Fine-tuning
Fusion layers lr	0.0001	Learning rate for slot attention and FiLM
Goal gate lr	0.00003	Learning rate for goal gating logits
Optimizer	Adam	Both pre-training and RL fine-tuning
RL reward
$α$	10	Scales obj-TCP distance
$ϖ$	5	Adjusts stability term
$δ_{1}, δ_{2}$	0.025	Proximity threshold (environment set)

Appendix A.2. Computational Requirement

Table A2. Computational setup (primary setup).

Component	Details
Operating System	Ubuntu 22.04.2 (jammy)
GPU Model	NVIDIA RTX A6000
CPU	AMD EPYC 7313P (16-Core)
CUDA Version	12.6

Appendix A.3. Computational Characteristics

Table A3 summarizes the parameter count, inference latency, effective inference rate, and peak GPU memory consumption of the full architecture. The model contains 2.91 M parameters in total, with 2.83 M in the fusion encoder and

0.08 M

in the actor network. All measurements were obtained on an NVIDIA RTX A6000 (48 GB) GPU with batch size one, with inference latency averaged over 100 forward passes following 10 warm-up iterations.

Table A3. Computational cost of the proposed method measured on an NVIDIA RTX A6000 (48 GB) GPU.

Metric	Value
Total parameters	2.91 M
Mean inference latency	2.57 ms (±0.09 ms)
Effective inference rate	388 Hz
Control (20 Hz)	50 ms
Peak GPU memory	148.8 MB

Appendix A.4. Wrist and Mobile Camera Specifications

Table A4. Comparison between the simulation wrist camera and mobile device camera for egocentric viewpoint approximation (Samsung Galaxy Note 20, main wide-angle lens).

	Simulation Wrist Camera	Samsung Galaxy Note 20
FOV	90^°	∼79^° (main wide-angle lens at 1080p video mode)
Resolution	128 × 128 at 20 fps	1920 × 1080 at 25 fps (streamed via IP Webcam app, downsampled to 128 × 128 for inference)
Model	Pinhole (no distortion during training)	Real lens with distortion
Near/Far	0.01 m/100 m	N/A
Mount	Wrist-mounted on camera_link (end-effector link) with identity relative pose	Handheld to approximate egocentric wrist viewpoint

Appendix A.5. Additional Baseline Results

DrQ-v2 [12] was evaluated in the same egocentric manipulation setting to verify that the observed improvements are not specific to the SAC-based training framework. DrQ-v2 combines data augmentation with a deterministic actor–critic algorithm (DDPG), introducing a difference in the underlying RL algorithm relative to the SAC-based baselines in the main experiments.

As shown in Figure A1, DrQ-v2 achieves lower returns throughout training, consistent with the DrQ-v1 result in the main experiments, with a 0 success rate. Together, these results suggest that data augmentation alone is insufficient for egocentric manipulation. Both methods process visual observations passively without access to the robot’s proprioceptive state and cannot resolve the ambiguity between object appearance and arm configuration introduced by the dynamic egocentric viewpoint. This supports the conclusion that proprioceptive conditioning of slot-based visual features, rather than augmentation strategy or policy optimization algorithm, is the key factor driving task performance.

Figure A1. Training return across baselines and architectural variants under the same reward scaling. All methods use 400 environment steps per epoch. All runs were trained for 3000 epochs except DrQ-v2, which was trained for 2000; therefore, only the first 2000 epochs are shown for comparison.

References

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-End Training of Deep Visuomotor Policies. J. Mach. Learn. Res. 2015, 17, 1334–1373. [Google Scholar]
Pinto, L.; Gupta, A.K. Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours. In 2016 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2015; pp. 3406–3413. [Google Scholar]
Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. arXiv 2018, arXiv:1806.10293. [Google Scholar]
Ebert, F.; Finn, C.; Dasari, S.; Xie, A.; Lee, A.X.; Levine, S. Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control. arXiv 2018, arXiv:1812.00568. [Google Scholar]
Guo, H.; Song, M.; Ding, Z.; Yi, C.; Jiang, F. Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer. Sensors 2023, 23, 515. [Google Scholar] [CrossRef]
Jangir, R.; Hansen, N.; Ghosal, S.; Jain, M.; Wang, X. Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation. IEEE Robot. Autom. Lett. 2022, 7, 3046–3053. [Google Scholar] [CrossRef]
Zhu, H.; Yu, J.; Gupta, A.; Shah, D.; Hartikainen, K.; Singh, A.; Kumar, V.; Levine, S. The Ingredients of Real-World Robotic Reinforcement Learning. arXiv 2020, arXiv:2004.12570. [Google Scholar] [CrossRef]
Grauman, K.; Westbury, A.; Torresani, L.; Kitani, K.; Malik, J.; Afouras, T.; Ashutosh, K.; Baiyya, V.; Bansal, S.; Boote, B.; et al. Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 19383–19400. [Google Scholar]
Bandini, A.; Zariffa, J. Analysis of the Hands in Egocentric Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 45, 6846–6866. [Google Scholar] [CrossRef]
Li, X.; Qiu, H.; Wang, L.; Zhang, H.; Qi, C.; Han, L.; Xiong, H.; Li, H. Challenges and Trends in Egocentric Vision: A Survey. arXiv 2025, arXiv:2503.15275. [Google Scholar] [CrossRef]
Yarats, D.; Fergus, R.; Lazaric, A.; Pinto, L. Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning. arXiv 2021, arXiv:2107.09645. [Google Scholar] [CrossRef]
Laskin, M.; Srinivas, A.; Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5639–5650. [Google Scholar]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. In Proceedings of the Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
Hafner, D.; Lillicrap, T.P.; Fischer, I.S.; Villegas, R.; Ha, D.R.; Lee, H.; Davidson, J. Learning Latent Dynamics for Planning from Pixels. arXiv 2018, arXiv:1811.04551. [Google Scholar]
Hafner, D.; Lillicrap, T.P.; Ba, J.; Norouzi, M. Dream to Control: Learning Behaviors by Latent Imagination. arXiv 2019, arXiv:1912.01603. [Google Scholar]
Hafner, D.; Paukonis, J.; Ba, J.; Lillicrap, T.P. Mastering Diverse Domains through World Models. arXiv 2023, arXiv:2301.04104. [Google Scholar]
Zambaldi, V.F.; Raposo, D.; Santoro, A.; Bapst, V.; Li, Y.; Babuschkin, I.; Tuyls, K.; Reichert, D.P.; Lillicrap, T.P.; Lockhart, E.; et al. Relational Deep Reinforcement Learning. arXiv 2018, arXiv:1806.01830. [Google Scholar] [CrossRef]
Land, M.F.; Hayhoe, M.M. In what ways do eye movements contribute to everyday activities? Vis. Res. 2001, 41, 3559–3565. [Google Scholar] [CrossRef]
Prablanc, C.; Martin, O. Automatic control during hand reaching at undetected two-dimensional target displacements. J. Neurophysiol. 1992, 67, 455–469. [Google Scholar] [CrossRef]
Locatello, F.; Weissenborn, D.; Unterthiner, T.; Mahendran, A.; Heigold, G.; Uszkoreit, J.; Dosovitskiy, A.; Kipf, T. Object-Centric Learning with Slot Attention. arXiv 2020, arXiv:2006.15055. [Google Scholar] [CrossRef]
Singh, G.; Wu, Y.F.; Ahn, S. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. arXiv 2022, arXiv:2205.14065. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]
Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Tao, S.; Xiang, F.; Shukla, A.; Qin, Y.; Hinrichsen, X.; Yuan, X.; Bao, C.; Lin, X.; Liu, Y.; Chan, T.-k.; et al. ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI. arXiv 2024, arXiv:2410.00425. [Google Scholar] [CrossRef]
Laskin, M.; Lee, K.; Stooke, A.; Pinto, L.; Abbeel, P.; Srinivas, A. Reinforcement Learning with Augmented Data. arXiv 2020, arXiv:2004.14990. [Google Scholar] [CrossRef]
Lee, A.X.; Nagabandi, A.; Abbeel, P.; Levine, S. Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model. arXiv 2019, arXiv:1907.00953. [Google Scholar]
Tsai, R.Y. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE J. Robot. Autom. 1987, 3, 323–344. [Google Scholar] [CrossRef]
Hutchinson, S.A.; Hager, G.; Corke, P. A tutorial on visual servo control. IEEE Trans. Robot. Autom. 1996, 12, 651–670. [Google Scholar] [CrossRef]
Chaumette, F.; Hutchinson, S.A. Visual servo control. I. Basic approaches. IEEE Robot. Autom. Mag. 2006, 13, 82–90. [Google Scholar] [CrossRef]
Zeng, A.; Florence, P.R.; Tompson, J.; Welker, S.; Chien, J.M.; Attarian, M.; Armstrong, T.; Krasin, I.; Duong, D.; Sindhwani, V.; et al. Transporter Networks: Rearranging the Visual World for Robotic Manipulation. In Proceedings of the Conference on Robot Learning, Virtually, 16–18 November 2020. [Google Scholar]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J.; et al. RT-1: Robotics Transformer for Real-World Control at Scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
Reed, S.; Zolna, K.; Parisotto, E.; Colmenarejo, S.G.; Novikov, A.; Barth-Maron, G.; Giménez, M.; Sulsky, Y.; Kay, J.; Springenberg, J.T.; et al. A Generalist Agent. arXiv 2022, arXiv:2205.06175. [Google Scholar] [CrossRef]
Shridhar, M.; Manuelli, L.; Fox, D. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. arXiv 2022, arXiv:2209.05451. [Google Scholar]
Yu, H.F.; Altahhan, A. Hierarchical Learning for Closed-Loop Robotic Manipulation in Cluttered Scenes via Depth Vision, Reinforcement Learning, and Behaviour Cloning. Electronics 2025, 14, 3074. [Google Scholar] [CrossRef]
Burgess, C.P.; Matthey, L.; Watters, N.; Kabra, R.; Higgins, I.; Botvinick, M.M.; Lerchner, A. MONet: Unsupervised Scene Decomposition and Representation. arXiv 2019, arXiv:1901.11390. [Google Scholar] [CrossRef]
Greff, K.; Kaufman, R.L.; Kabra, R.; Watters, N.; Burgess, C.P.; Zoran, D.; Matthey, L.; Botvinick, M.M.; Lerchner, A. Multi-Object Representation Learning with Iterative Variational Inference. arXiv 2019, arXiv:1903.00450. [Google Scholar]
Jiang, J.; Janghorbani, S.; de Melo, G.; Ahn, S. SCALOR: Generative World Models with Scalable Object Representations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Watters, N.; Zoran, D.; Weber, T.; Battaglia, P.W.; Pascanu, R.; Tacchetti, A. Visual Interaction Networks: Learning a Physics Simulator from Video. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Kipf, T.; van der Pol, E.; Welling, M. Contrastive Learning of Structured World Models. arXiv 2019, arXiv:1911.12247. [Google Scholar]
Kostrikov, I.; Yarats, D.; Fergus, R. Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels. arXiv 2020, arXiv:2004.13649. [Google Scholar]

Figure 1. Overview of the proposed proprioception-modulated architecture. Egocentric RGB input is encoded via ResNet and tokenized. Learnable slot queries attend to the tokens using a cross-attention mechanism. The resulting slot features are modulated by the robot states via FiLM and shared across all SAC components for policy learning. During reinforcement learning, the beige background modules, including FiLM conditioning, state projection, cross-attention, and slot heads, remain trainable and receive gradients from the SAC critic, while the blue background modules are frozen.

Figure 2. PickBanana egocentric manipulation setup in ManiSkill. A seven-DoF robotic arm operates above a tabletop with a banana and a bowl placed near the target region (green dot). The task objective is to grasp and place the banana at the target location. The bowl is a non-goal movable object present at the target area.

Figure 3. Pre-training loss components for the egocentric slot-based encoder. (a) Swap-invariant pixel-space UV regression error (pos), (b) attention alignment loss (attnNLL), and (c) visibility prediction BCE (vis loss) over pre-training steps on PickBanana. All losses decrease rapidly and stabilize, indicating accurate pixel localization, attention centered on projected object tokens, and robust visibility handling in the wrist camera view.

Figure 4. Slot specialization from egocentric attention. Random frames illustrate the original observation (orig) and the attention heatmaps for Slot 0 and Slot 1. Across diverse viewpoints and object configurations, the two slots form a consistent decomposition in which one slot attends to the banana and the other to the bowl, with rare identity permutations expected from permutation-invariant slot learning without explicit slot–object binding. When both objects are fully occluded, attention reverts to the end-effector region as an agent-centric fallback cue.

Figure 5. Qualitative evaluation of the auxiliary pixel heads after pre-training. Predicted object UV locations (blue “x” for banana, green “x” for bowl) are overlaid with simulator-projected ground-truth pixels (orange “o” for banana, red “o” for bowl) on egocentric frames. Samples (a–d) show accurate localization when objects are visible, while sample (e) illustrates a difficult occlusion/out-of-view case handled via visibility/validity masking.

Figure 6. Egocentric qualitative check of slot attention. A phone-mounted camera on a human arm mimics the wrist viewpoint, observing printed banana and bowl targets. Each pair of panels shows Slot-0 and Slot-1 attention heatmaps. Across the shown samples, the two slots remain spatially localized and track the banana and bowl under viewpoint changes, with occasional permutation (slot swapping), reflecting exchangeable slot identities rather than loss of tracking.

Figure 7. Training return over 3000 episodes for all baselines and ablations under the same reward scaling. The proposed multi-slot attention agent consistently outperforms pixel-only and state-only alternatives, and the ablations isolate which design choices drive this gap.

Figure 8. Manipulation behavior (left to right) (a) reaching toward the banana; (b) grasping (c) lifting, transporting and pushing bowl; (d) place at target position (e) hold at the target position.

Figure 9. Evolution of normalized reward components during training. The x-axis shows environment steps, and the y-axis shows normalized component magnitude in the range

[0, 1]

. Each curve corresponds to a reward component (

r_{r e a c h}, r_{p o s t_{g r a s p}}, r_{p l a c e}

) logged during training and smoothed using exponential moving averaging.

Figure 9. Evolution of normalized reward components during training. The x-axis shows environment steps, and the y-axis shows normalized component magnitude in the range

[0, 1]

. Each curve corresponds to a reward component (

r_{r e a c h}, r_{p o s t_{g r a s p}}, r_{p l a c e}

) logged during training and smoothed using exponential moving averaging.

Figure 10. Training curves comparing FiLM with tanh modulation (proposed), FiLM with linear+clamp, concatenation with projection, and concatenation without FiLM.

Figure 11. Training return across 2000 episodes for different velocity penalty scaling terms. The proposed setting (StaticTerm = 5) achieves faster learning and higher stable return compared to StaticTerm = 1 and StaticTerm = 10.

Figure 12. Training return across episodes comparing the proposed method with the static reward term and the variant where the static term is removed.

Figure 13. Training return across 2000 episodes for different reward scaling factors. The proposed variant (RewardScale:10) achieves faster performance improvement and a higher final return compared to RewardScale:5 and RewardScale:15.

Figure 14. Slot attention visualization when

N_{s} = 4

and the environment contains two objects. For clarity, we visualize the two slots that exhibit the strongest localized attention responses.

Figure 14. Slot attention visualization when

N_{s} = 4

and the environment contains two objects. For clarity, we visualize the two slots that exhibit the strongest localized attention responses.

Figure 15. Position prediction under surplus slot configuration (

N_{s} = 4

). Predicted and ground-truth keypoints for the banana and bowl are overlaid across different viewpoints.

Figure 15. Position prediction under surplus slot configuration (

N_{s} = 4

). Predicted and ground-truth keypoints for the banana and bowl are overlaid across different viewpoints.

Figure 16. Training return for end-to-end learning, where the visual encoder and policy are optimized jointly without representation pre-training. The agent improves its return but never reaches successful task completion.

Table 1. Environment and observation specifications for the PickBanana task.

Component	Specification
Robot model	7-DOF Franka Emika Panda
Camera setup	Egocentric RGB camera mounted on the end-effector (Pandawrist)
Observation vector $o_{t}^{state}$	35-D proprioceptive state (joint positions, velocities, gripper pose)
Action space $A$	7-D continuous action space
Objects	Banana, target location, and a bowl
Simulator	SAPIEN engine via ManiSkill framework (panda_wristcam agent)

Table 2. Percentage contribution of reward components across training phases (mean ± standard deviation over environment steps).

Phase	Reach	Post-Grasp	Place
Early	$79.1 \pm 13.4$	$0.3 \pm 1.1$	$20.6 \pm 13.4$
Mid	$72.9 \pm 18.2$	$3.5 \pm 4.1$	$23.6 \pm 18.3$
Late	$31.9 \pm 24.4$	$4.5 \pm 5.0$	$63.6 \pm 25.9$

Table 3. Representation probing on PickBanana latents (mean ± std over three seeds). Contact AUROC measures interaction alignment. Δdist

R^{2}

measures progress encoding. ΔObj

R^{2}

is averaged over

x, y, z

object displacement probes. Occ/Noise ΔAUROC reports the change in Contact AUROC under occlusion and noise (closer to 0 indicates higher robustness).

Table 3. Representation probing on PickBanana latents (mean ± std over three seeds). Contact AUROC measures interaction alignment. Δdist

R^{2}

measures progress encoding. ΔObj

R^{2}

is averaged over

x, y, z

object displacement probes. Occ/Noise ΔAUROC reports the change in Contact AUROC under occlusion and noise (closer to 0 indicates higher robustness).

Variant	Contact AUROC ↑	$Δ$ dist $R^{2}$ ↑	$Δ$ Obj $R^{2}$ (avg) ↑	Occ $Δ$ AUROC $\to 0$	Noise $Δ$ AUROC $\to 0$
FiLM (full)	0.879 ± 0.019	0.841 ± 0.011	0.552 ± 0.029	−0.002 ± 0.026	−0.001 ± 0.006
State-only	0.852 ± 0.009	0.813 ± 0.008	0.570 ± 0.025	$0.000 \pm 0.000$	$0.000 \pm 0.000$
Image-only	0.638 ± 0.142	0.821 ± 0.022	0.529 ± 0.024	$- 0.038 \pm 0.101$	$- 0.020 \pm 0.101$
Concat (no FiLM)	0.531 ± 0.206	0.846 ± 0.014	0.577 ± 0.035	$- 0.058 \pm 0.131$	$+ 0.042 \pm 0.088$

Note: ↑ indicates higher values are better;

\to 0

indicates values closer to zero are better. Bold values indicate the best results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Beyene, S.W.; Han, J.-H. Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation. Electronics 2026, 15, 1365. https://doi.org/10.3390/electronics15071365

AMA Style

Beyene SW, Han J-H. Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation. Electronics. 2026; 15(7):1365. https://doi.org/10.3390/electronics15071365

Chicago/Turabian Style

Beyene, Sofanit Wubeshet, and Ji-Hyeong Han. 2026. "Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation" Electronics 15, no. 7: 1365. https://doi.org/10.3390/electronics15071365

APA Style

Beyene, S. W., & Han, J.-H. (2026). Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation. Electronics, 15(7), 1365. https://doi.org/10.3390/electronics15071365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation

Abstract

1. Introduction

2. Related Works

3. Problem Formulation

4. Proposed Method

4.1. Learning Slot-Based Representations from Egocentric Vision

4.1.1. Learnable Slot Queries

4.1.2. Cross-Attention

4.1.3. Permutation-Invariant Aggregation and Gating

4.2. State-Conditioned Modulation

4.3. Auxiliary Supervision and Pre-Training

4.4. Integration with Reinforcement Learning

4.5. Reward Function

5. Experimental Setup

5.1. Implementation Detail

5.2. Evaluation Objectives and Metrics

6. Results and Discussions

6.1. Perception-Level Results and Discussion

6.2. Policy-Level Results and Discussion

6.3. Reward Component Analysis

6.4. Visuomotor Robustness and Efficiency Evaluation

7. Ablation Studies

7.1. FiLM Conditioning Variants

7.2. Reward Function Ablations

7.3. Surplus Slot Analysis

7.4. End-to-End Training Without Pre-Training

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. SAC Hyperparameters

Appendix A.2. Computational Requirement

Appendix A.3. Computational Characteristics

Appendix A.4. Wrist and Mobile Camera Specifications

Appendix A.5. Additional Baseline Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI