Next Article in Journal
CrtNet: A Cross-Model Residual Transformer Network for Structure-Guided Remote Sensing Scene Classification
Previous Article in Journal
AeroPinWorld: Revisiting Stride-2 Downsampling for Zero-Shot Transferable Open-Vocabulary UAV Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation

by
Sofanit Wubeshet Beyene
and
Ji-Hyeong Han
*
Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(7), 1365; https://doi.org/10.3390/electronics15071365
Submission received: 11 February 2026 / Revised: 17 March 2026 / Accepted: 19 March 2026 / Published: 25 March 2026
(This article belongs to the Section Artificial Intelligence)

Abstract

Visual perception is fundamental to robotic manipulation for recognizing objects, goals, and contextual details. Third-person cameras provide global views but can miss contact-rich interactions and require calibration. Wrist-mounted egocentric cameras reduce these limitations but introduce occlusion, motion blur, and partial observability, which complicate visuomotor learning. Furthermore, existing perception modules that rely solely on pixels or fuse imagery with proprioception as flat vectors do not explicitly model structured scene representations in dynamic egocentric views. To address these challenges, a multi-slot attention fusion encoder for egocentric manipulation is introduced. Learnable slot queries extract localized visual features from image tokens, and Feature-wise Linear Modulation (FiLM) conditions each slot on the robot’s joint states, producing a structured slot-based latent representation that adapts to viewpoint and configuration changes without requiring object labels or external camera priors. The resulting structured slot-based latent representation is used as input to a Soft Actor–Critic (SAC) agent, which achieves a higher mean cumulative return than pixel-only CNN/DrQ and state-only baselines on a ManiSkill3 egocentric manipulation task. Probing experiments and real-camera evaluation further show that the learned representation remains stable under egocentric viewpoint shifts and partial occlusions, indicating robustness in practical manipulation settings.

1. Introduction

Learning directly from pixels has driven major advances in reinforcement learning (RL), from human-level performance in Atari [1] to continuous-control policies in simulated environments [2]. In robotics, visual RL has enabled agents to acquire manipulation skills such as grasping [3], pushing and stacking [2], and large-scale goal-directed control [4]. These achievements highlight the potential of end-to-end visuomotor learning, but also reveal that challenges remain in how visual input is represented and aligned with control, particularly in embodied robotic settings.
One fundamental reason for this misalignment is the choice of viewpoint. Many visuomotor policies rely on external third-person cameras, which provide broad workspace coverage but are inherently decoupled from the robot’s action space [5]. In such views, the same motor command can produce different pixel changes depending on where the object lies in the external frame. Moreover, maintaining alignment with the robot’s workspace requires careful camera calibration and consistent placement. This may suffice in static or simulated environments but becomes challenging in real-world systems, particularly for mobile or multi-arm platforms, where maintaining stable external camera calibration and synchronization is difficult [6,7].
By contrast, egocentric vision, where the wrist camera offers a perspective naturally aligned with robot motion [8], grounds visual consequences in the manipulator’s workspace [9]. However, egocentric views also introduce challenges including viewpoint shifts, narrow fields of view, and partial observability as objects enter and leave the frame [10,11]. These issues are difficult to solve by vision alone, since the system cannot distinguish between changes caused by self-motion and those caused by object motion. Proprioceptive state provides this missing context, enabling the encoder to interpret visual changes relative to the robot’s configuration.
However, most visual RL pipelines were developed for pixel-only benchmarks, where proprioceptive state was deliberately excluded to test the limits of image-based learning. Methods such as DrQ [12], CURL [13], and RAD learn data-efficient visual representations but do not incorporate robot states. Furthermore, sequence models, such as Decision Transformer [14] process states and actions as separate tokens but still rely on globally pooled image embeddings rather than conditioning visual representations on the robot’s configuration. When state information is included, as in world-model approaches like Dreamer [15,16,17], it is generally fused with visual latents through simple concatenation, without structured cross-modal interaction. Related efforts in object-centric RL [18] highlight the importance of compositional representations but again lack explicit coupling between proprioception and attention. Insights from human sensorimotor coordination suggest a path forward and reflect the principle of embodied perception, where perception is shaped by the agent’s body configuration rather than processed independently. During reaching, the human gaze follows and adapts to hand motion, prioritizing visual features relevant to the current motor plan [19,20].
This perception–action alignment suggests that visuomotor systems should emphasize object regions and interactions that matter for the current control strategy rather than processing the entire visual scene uniformly. Following this principle, we propose a multi-slot attention architecture that couples perception and proprioception so that the robot’s body state shapes visual representation, and the resulting visual interpretation directly informs action selection, forming a closed perception–action loop.
The module first learns a set of spatially localized query vectors (slots) [21,22] that attend to image regions corresponding to task-relevant entities through cross-attention. To stabilize slot–object binding under egocentric viewpoint shifts, the visual slot encoder is initialized with a supervised perception pre-training phase. During this phase, the simulator provides ground-truth object poses, which are projected into 2D pixel coordinates to generate supervision targets. The encoder receives only RGB observations as input. The projected 2D coordinates are used solely as training targets to optimize slot attention and position prediction so that each slot aligns with a corresponding object region in the image. The policy and value networks are not trained during this phase. During subsequent Soft Actor–Critic (SAC) [23] training, the encoder receives RGB observations and the robot’s proprioceptive state. The encoder then computes slot attention over image tokens and applies Feature-wise Linear Modulation (FiLM) layers [24] to condition each slot on the robot’s proprioceptive state, producing modulated features. These modulated features serve as inputs to the policy and value networks. Gradients from the critic loss update the trainable encoder components, while the visual backbone remains frozen. This design integrates perception with control context and replaces flat pixel fusion with a slot-based representation. Slot attention and FiLM modulation maintain alignment between slots and object regions under egocentric viewpoint changes, improving robustness to viewpoint shifts and partial occlusions.
The proposed method was evaluated in the ManiSkill simulator [25] using an egocentric manipulation setup and compared against vanilla SAC (state-only), SAC with CNN, and SAC with DrQ-style image augmentation.
The main contributions are as follows:
  • A state-guided multi-slot attention encoder that produces slot-based visual features from egocentric RGB observations.
  • Proprioceptive conditioning of slot features via Feature-wise Linear Modulation (FiLM).
  • An SAC training formulation in which slot-based visual features are modulated by proprioception and updated through actor–critic losses while keeping the visual backbone frozen.
  • Empirical evaluation against SAC-based visual baselines in an egocentric ManiSkill manipulation setting.

2. Related Works

Recent visual RL research has focused on stabilizing training and improving sample efficiency in pixel-only domains. Data augmentation methods such as RAD and DrQ-v2 [12,26] improve stability by injecting invariances into pixel-level learning, while contrastive approaches such as CURL [13] shape visual encoders by aligning augmented views. Latent dynamics models, including SLAC [27] and the Dreamer family [15,16,17], learn compact latent spaces to model temporal dynamics, with Dreamer in particular using latent imagination rollouts for long-horizon control. These works established strong benchmarks for image-based reinforcement learning, but their encoders are typically trained from pixel observations alone and are not explicitly conditioned on the robot’s proprioceptive configuration. In many vision-based control frameworks, perception is processed through flattened convolutional features or globally pooled visual tokens, and any low-dimensional state is concatenated only in later MLP layers of the policy or value network, leaving the visual encoder itself blind to the robot’s configuration [2,4,13].
Before deep RL, alignment between vision and control was achieved through calibration-based pipelines. [28] remains a canonical approach to hand–eye calibration, computing the rigid transformation between a robot’s end-effector and its camera. Visual servoing [29,30] extended this concept by iteratively adjusting actions to minimize image space error. Although effective in structured settings, these methods rely on engineered geometric features and brittle calibration procedures, which limit scalability to dynamic, visually rich environments.
Learning-based visuomotor systems seek to overcome these constraints. QT-Opt [4] demonstrated large-scale off-policy learning for robotic grasping using multiple fixed third-person cameras. Transporter Networks [31] introduced spatial transport maps for pick-and-place actions but assumed static overhead views, avoiding viewpoint-shift challenges inherent to egocentric vision. More recent transformer-based architectures, including RT-1 [32], Gato [33], Decision Transformer [14], and PerAct [34], process visual observations as token sequences and condition policies on pooled embeddings. While these models scale effectively with data, they still rely on globally pooled visual representations or late-fused proprioception, rather than using it to structure the visual encoding around the robot’s embodiment. Related closed-loop manipulation frameworks further highlight the need for structured perception–control coupling in cluttered scenes, often introducing hierarchical controllers or auxiliary imitation signals to stabilize visuomotor learning [35].
To move beyond monolithic feature encoders, recent work in representation learning has explored structured visual representations, including object-centric approaches. Methods such as MONet [36], IODINE [37], and SCALOR [38] decompose images into latent slots that capture entities and spatial structure. However, these models are trained with reconstruction or prediction losses under primarily static or slowly varying camera assumptions. Relational approaches such as VIN [39], C-SWM [40], and Relational RL [18] model interactions between pre-segmented objects but do not address the problem of extracting structured representations directly from pixels under egocentric viewpoint shifts. Slot Attention [21] provides a flexible slot-based representation mechanism that can be integrated into downstream learning pipelines without requiring a dedicated reconstruction decoder. The proposed method adopts this mechanism and introduces state-guided slot attention that conditions visual feature organization on the proprioceptive state, enabling consistent perception under self-motion and dynamic views for object manipulation tasks.

3. Problem Formulation

We formulate vision-based robotic manipulation as a partially observable Markov decision process (POMDP) framework, defined by the tuple M = ( S , O , A , T , r , γ ) , where S is the latent physical state (joint configuration, object poses), O is the observation space, A is the continuous action space, T ( s t + 1 s t , a t ) is the transition dynamics, r : S × A R is the reward function, and γ [ 0 , 1 ) is the discount factor.
At each timestep t, the agent receives a partial observation o t = ( o t img , o t state ) Ω ( · s t ) , generated from the latent state s t through an observation function Ω , where o t img R H × W × 3 is an RGB image captured from an end-effector-mounted egocentric camera, and a low-dimensional proprioceptive state o t state R d including joint angles, velocities, and the end-effector pose. The agent samples an action a t π ( a o t ) , receives a scalar reward r t = r ( s t , a t ) and the process evolves according to the transition dynamics s t + 1 T ( s t , a t ) . The learning objective is to find a policy that maximizes the expected discounted return:
π * = arg max π   E π t = 0 γ t   r ( s t , a t ) ,
where π * is the optimal policy, γ is the discount factor, and the expectation is taken over trajectories induced by the policy π under the environment dynamics. Since the egocentric camera provides only a partial and viewpoint-dependent observation of the workspace, the environment is inherently non-stationary from the robot’s perspective. The same motor command may produce different pixel changes depending on viewpoint, occlusion, or arm configuration. This coupling between motion and perception motivates a state-conditioned perception model, in which the encoder explicitly integrates proprioceptive state to interpret visual input relative to the robot’s configuration.
In the following sections, we introduce a multi-slot attention encoder modulated by proprioceptive state, enabling visual features to remain consistent under egocentric motion and occlusion. We then describe how this encoder is incorporated into Soft Actor–Critic to learn control-aligned representations for egocentric manipulation.

4. Proposed Method

The proposed method learns a control-oriented visual representation for egocentric robotic manipulation in which visual features extracted from the wrist-mounted camera are modulated by the robot’s proprioceptive state. The architecture consists of a visual encoder that extracts spatial features from the image observation, a slot attention module that produces structured visual tokens, and an FiLM modulation mechanism that adapts these features according to the robot’s current configuration. The resulting state-modulated representation is used as the observation input to the Soft Actor–Critic (SAC) policy. This design forms a bidirectional perception–action loop where the robot state modulates visual features and the policy acts on the resulting representation.
Formally, at each timestep t, the agent receives an egocentric RGB observation o t i m g R H × W × 3 and a proprioceptive state vector o t s t a t e R d . After standard preprocessing, we denote the image input as I t . As shown in Figure 1, I t is encoded by a ResNet-18 backbone followed by a 1 × 1 projection into a fixed channel dimension, producing a spatial feature map that is flattened into a sequence of visual tokens with positional encoding. The proprioceptive state is projected to a feature vector through a fully connected layer and used later for FiLM conditioning.
A small set of learnable slot queries attends to the visual tokens through multihead cross-attention and produces K spatially localized slot embeddings. If an additional goal token is available during RL, it is mixed with the pooled slot features using a lightweight two-way softmax gate. FiLM conditioning is then applied to the slot features to incorporate state information, producing the final latent representation passed into SAC. During RL, the ResNet backbone and 1 × 1 conv remain frozen for stability, and gradients flow only through the trainable perception modules (beige background in Figure 1). This configuration allows the trainable fusion layers to adapt to the task reward, while the frozen backbone preserves low-level visual features learned during pre-training.

4.1. Learning Slot-Based Representations from Egocentric Vision

We begin by constructing a structured perceptual representation from egocentric images using a slot-based attention mechanism. Each slot specializes in a distinct scene region, providing the policy with separate visual features for each task-relevant object rather than a single global image embedding. The slot representation is trained in a supervised manner using 2D projections of task-relevant 3D keypoints, allowing the model to associate each slot with a consistent visual concept.
  • Visual geometry and observation model: The hand-mounted camera provides egocentric visual observations. Slot supervision labels are generated by projecting simulator object positions onto the image plane using a standard pinhole camera model:
    ( u t ( k ) , v t ( k ) ) = Π ( P t X t ( k ) ) ,
    where P t = K [ R t t t ] combines fixed camera intrinsics K and time-varying extrinsic R t , t t , and Π denotes homogeneous division with visibility masking. The resulting 2D coordinates serve as annotation-free supervision targets for slot position regression, requiring no segmentation masks or manual labels.
  • Image encoder: The RGB image I t is first processed by a truncated ResNet-18 backbone, preserving the feature map up to the third convolutional block. This produces a spatial tensor F t = h ψ ( I t ) R C × H f × W f , which is linearly projected to a feature space R D through a 1 × 1 convolution: F ˜ t = W proj F t . Flattening along spatial dimensions results in a sequence of S = H f W f image tokens X t img = flatten ( F ˜ t ) R S × D . A learned 2D positional encoding P 2 D R S × D , constructed from row and column embeddings, is added to preserve spatial locality: X t img = X t img + λ P 2 D , where λ = 0.05 is empirically tuned to prevent the positional bias from dominating the learned features.

4.1.1. Learnable Slot Queries

To extract structured visual features, the model uses a set of learnable slot queries. The token sequence X t img R S × D is obtained by flattening the spatial feature map produced by the visual encoder, where S = H f W f is the number of spatial locations. Each token corresponds to a local region in the image. Instead of collapsing these tokens into a single global embedding, the model maintains a set of N s learnable slot vectors:
Q slot = [ q ( 1 ) , q ( 2 ) , , q ( N s ) ] R N s × D ,
which serve as queries in a multihead cross-attention module. Each slot query competes to attend to different regions of the image token grid, and auxiliary supervision during pre-training encourages each slot to specialize in a distinct task-relevant object. During RL, when a goal pixel ( u t g o a l , v t g o a l ) is available, an additional goal query is constructed by sampling its corresponding feature vector from the spatial feature map F ˜ t . This reuses the frozen encoder’s spatial features to represent the goal location without introducing additional learned parameters.

4.1.2. Cross-Attention

The slot queries use cross-attention to attend to the visual tokens. The query set is defined as:
Q t = Q slot , pre - training ,   Q slot ,   Q t goal   , RL .
Given the sequence of image tokens X t img , cross-attention is computed using multihead attention
Y t = MHA ( Q t ,   K = X t img ,   V = X t img ) .
Each query attends to the image tokens by computing similarity scores in the feature space and retrieving a weighted combination of spatial features. The resulting attention matrix A t [ 0 , 1 ] | Q t | × S satisfies j A t [ i , j ] = 1 for each query i, representing a soft spatial assignment over the image tokens. Each output y t ( i ) R D represents the visual features aggregated for one object location.

4.1.3. Permutation-Invariant Aggregation and Gating

The perception module is trained in two stages: pre-training on geometric supervision followed by RL fine-tuning through critic gradients. During pre-training, supervision uses a best-of-N matching scheme that selects the minimum-cost assignment between slots and objects at each step, allowing slots to reassign based on currently visible objects rather than maintaining a fixed slot-to-object mapping. This design improves pre-training stability under occlusion, where a fixed assignment would leave a slot without a valid supervision signal when its assigned object is out of view.
During RL fine-tuning, slot attention weights are updated by critic gradients to emphasize task-relevant features, causing attention patterns to drift from pre-training assignments and making slot swaps inevitable. Mean pooling is used as the slot aggregation function, producing a permutation-invariant representation that remains stable regardless of slot ordering:
z t slot = 1 N s i = 1 N s y t ( i ) .
When a goal query is available, the aggregated slot feature z t slot and the goal feature y t goal are combined through a learned soft gating mechanism:
z t = α t ( s )   z t slot + α t ( g )   y t goal ,         [ α t ( s ) ,   α t ( g ) ] = softmax [ s ,   g + b ( ν t goal ) ] ,
where s , g are learnable logits, and the bias term b ( ν t goal ) assigns a large negative value when the goal lies outside the camera’s field of view. Softmax gating keeps contributions normalized and suppresses the goal feature when it is not visible.

4.2. State-Conditioned Modulation

In egocentric manipulation, slot features encode spatially structured visual information in image space, but the relevance of those locations for control depends on the robot’s current configuration. The same visual feature can therefore have different control significance depending on the relative pose between the arm and the object. The slot representation z t must therefore be interpreted in the context of the robot’s current state before being passed to the policy. Rather than concatenating o t state with z t , which treats state as an additional policy input rather than a signal that shapes visual interpretation, Feature-wise Linear Modulation (FiLM) conditions the visual features directly on the proprioceptive state. This allows the robot state to gate the visual features before policy inference so that control decisions operate on state-aware perceptual representations rather than raw visual embeddings.
Given visual feature z t R D and the proprioceptive observation o t state R d s , FiLM computes channel-wise scaling and shift parameters from the proprioceptive state and applies them to the slot features:
o ^ t = W s   o t state + b s ,             γ t = tanh ( W γ o ^ t + b γ ) ,             β t = W β o ^ t + b β ,
z ˜ t = ( 1 + γ t ) z t + β t ,
where ⊙ denotes element-wise multiplication.
The tanh activation bounds γ t [ 1 , 1 ] , which results in the effective scaling factor 1 + γ t [ 0 , 2 ] . This bounded modulation allows robot states to amplify or attenuate individual visual channels while preserving the sign and relative structure of the pre-trained slot representation. As a result, proprioceptive conditioning can adjust the relative importance of visual features according to the current arm-object configuration without distorting the learned slot representation.
The multiplicative term γ t performs channel-wise scaling, while β t applies a channel-wise shift, resulting in a feature-wise affine modulation of the slot representation. An ablation comparing FiLM variants is provided in Section 7.

4.3. Auxiliary Supervision and Pre-Training

Before policy learning, the perception module is pre-trained to establish spatial correspondence between egocentric image observations and physical object locations. This stage provides dense pixel-level supervision that would be difficult to derive from sparse task reward alone. During pre-training, the agent observes simulated rollouts generated by a random policy. For each timestep t, we collect the following:
{   I t ,   o t state ,   I t + 1 ,   o t + 1 state ,   u t ,   v t ,   u t + 1 ,   v t + 1   } .
Here I t and I t + 1 are egocentric RGB images, and o t state and o t + 1 state are the corresponding proprioceptive observations. The vectors u t , v t R N o contain the projected pixel coordinates of each object at time t, obtained from the calibrated camera model P t = K t [ R t | t t ] .
All supervision labels are derived from the simulator state via camera projection. The encoder is optimized using three auxiliary losses targeting slot position, object visibility, and attention alignment.
  • Position regression: Each attention slot output y t ( i ) predicts a pixel coordinate ( u ^ t ( i ) , v ^ t ( i ) ) through a small regression head. Because slots are exchangeable, we adopt a best-of-N matching scheme that aligns predicted coordinates with object labels via the minimal assignment:
    L pos = min σ S N s , N o 1 N o k = 1 N o m t ( k ) ( u ^ t ( σ ( k ) ) , v ^ t ( σ ( k ) ) ) ( u t ( k ) , v t ( k ) ) 2 2 .
    Here σ denotes a permutation (one-to-one assignment) from objects to slots, and σ ( k ) selects the slot matched to object k under that assignment. N s is the number of slots, and N o is the number of tracked objects (or keypoints) for which pixel labels are available. The set S N s , N o contains all assignments σ : { 1 , , N o } { 1 , , N s } . In our implementation, we set N s = N o = 2 ; this minimization reduces to evaluating the two possible matchings (identity vs. swapped) and taking the lower error, while m t ( k ) { 0 , 1 } indicates whether object k is visible in frame t, masks out objects that are not visible and the loss is normalized by the number of valid objects in the batch. This encourages each slot to specialize consistently toward one object without enforcing a fixed ordering.
  • Visibility prediction: A visibility head predicts the probability that each object lies within the camera’s field of view. A small visibility head takes the attention-derived features and outputs m ^ t ( k ) ( 0 , 1 ) for every object k, indicating whether its projected pixel location lies on-screen. The corresponding supervision signal is the geometric validity mask m t ( k ) { 0 , 1 } , derived directly from camera projection. We optimize a per-object binary cross-entropy loss (BCE):
    L vis = 1 N o k = 1 N o m t ( k ) log m ^ t ( k ) + ( 1 m t ( k ) ) log ( 1 m ^ t ( k ) ) .
This loss encourages the encoder to recognize when an object’s projected location is valid and to avoid allocating attention to regions that fall outside the field of view.
  • Attention alignment: Attention alignment supervises the attention mechanism to place probability mass at each object’s projected token location. Given the projected pixel location of object k, idx ( u t ( k ) , v t ( k ) ) provides the corresponding token index on the H f × W f feature grid. For each object, the summed attention probability across all slots at each object’s token index is:
    p t ( k ) = i = 1 N s A t [ i ,   idx ( u t ( k ) , v t ( k ) ) ] ,
    and we define the alignment loss as follows
    L attn = 1 N o k = 1 N o m t ( k )   log p t ( k ) .
    In this formulation, object-level supervision is applied to the sum of slot-wise attention weights rather than a specific slot.
  • The total pre-training loss is:
    L pre = λ pos L pos + λ vis L vis + λ attn L attn .
    During this stage, all components of the perception module, including the ResNet18 backbone, slot attention, FiLM layers, and auxiliary prediction heads, are trained jointly. After pre-training, gradients to the backbone are disabled, and the encoder produces state-conditioned embeddings z ˜ t that are used as input to SAC.

4.4. Integration with Reinforcement Learning

After pre-training, the fused representation z ˜ t is used as the input to the SAC actor and critic networks. During RL optimization, gradients from the critic losses propagate through the trainable encoder layers, allowing the visual representation to adapt to the control objective. At each timestep t, the agent receives a partial observation
o t = ( I t ,   o t state ) ,
where I t is the egocentric RGB image, and o t state is the proprioceptive vector containing joint angles, velocities, and gripper pose. These are mapped by the pre-trained fusion encoder f ψ into a structured latent representation
z ˜ t = f ψ ( I t ,   o t state ,   g t ) ,
where g t denotes a goal input when provided by the environment and is omitted otherwise. The encoder applies slot cross-attention to the egocentric image and modulates the resulting features through FiLM using the proprioceptive state.
This latent representation serves as the sole input to the SAC components. The policy is a stochastic actor π θ ( a t z ˜ t ) that outputs a Gaussian distribution over continuous actions a t R n . Two critic networks Q ϕ 1 ( z ˜ t , a t ) and Q ϕ 2 ( z ˜ t , a t ) , estimate soft Q-values, and a separate value network V η ( z ˜ t ) provides the bootstrap target used for temporal-difference updates through a slowly updated target network V η ¯ ( z ˜ t ) . Transitions { o t , a t , r t , o t + 1 , done t } sampled from the replay buffer D are used for the SAC updates. The reward function is defined in Section 4.5.
The critic target is computed as the soft Bellman backup:
y t = r t + γ ( 1 done t )   V η ¯ ( z ˜ t + 1 ) ,
where γ [ 0 , 1 ) is the discount factor, and V η ¯ denotes a slowly updated target network. The critics minimize the temporal-difference loss
L Q ( ϕ i ) = 1 2   E D   Q ϕ i ( z ˜ t , a t ) y t 2 ,       i { 1 , 2 } .
For the value and policy objectives, actions are re-sampled from the current actor, a t π θ ( · z ˜ t ) . The value network is then trained to regress toward the expected critic output under the current policy:
L V ( η ) = 1 2   E D   V η ( z ˜ t ) min i Q ϕ i ( z ˜ t , a t ) α log π θ ( a t z ˜ t ) 2 ,
where α > 0 is the entropy temperature that controls the trade-off between reward maximization and policy stochasticity. The policy parameters are optimized by minimizing the entropy-regularized objective
L π ( θ ) = E D   α log π θ ( a t z ˜ t ) min i Q ϕ i ( z ˜ t , a t ) ,
which encourages the actor to select actions with both high expected value and high entropy. Finally, the target parameters are updated via Polyak averaging:
η ¯ τ η + ( 1 τ ) η ¯ ,
with τ ( 0 , 1 ) typically set to 0.005 . All components are optimized with the Adam optimizer using learning rates shown in Table A1.
Algorithm 1 summarizes the overall training pipeline. At each iteration, a minibatch of transitions is sampled from the replay buffer and encoded into z ˜ t and z ˜ t + 1 by the fusion encoder. The soft Bellman backup (5) is computed using the target value network, and the twin critics are updated by minimizing the TD loss (6). The reparameterized actions a t π θ ( · z ˜ t ) are then used to update the value network (7) and the actor (8). Finally, the target value parameters are updated by Polyak averaging.
Algorithm 1 State-guided slot attention with SAC.
Require: Replay buffer D , encoder f ψ , actor π θ , critics Q ϕ 1 , Q ϕ 2 , value V η , target value V η ¯ , batch size B, discount γ
  1:
Initialize network parameters ψ , θ , ϕ 1 , ϕ 2 , η , η ¯
  2:
for each training iteration do
  3:
    Sample batch { ( o t , a t , r t , o t + 1 , done t ) } i = 1 B D
  4:
    Encode z ˜ t = f ψ ( o t ) and z ˜ t + 1 = f ψ ( o t + 1 )
  5:
    Compute critic target Equation (5)
  6:
    Update critics Q ϕ 1 , Q ϕ 2 by minimizing Equation (6)
  7:
    Backpropagate critic gradients into trainable encoder parameters ψ
  8:
    Sample reparameterized action a t π θ ( · z ˜ t )
  9:
    Update value network V η by minimizing Equation (7) using a t
10:
     Update actor π θ by minimizing Equation (8) using a t
11:
     Update target value network: Equation (9)
12:
end for
The ResNet-18 backbone and 1 × 1 projection remain frozen throughout RL, while the slot attention, FiLM, and goal gate are updated only through critic gradients. Actor and value updates treat z ˜ t ) as fixed to ensure the visual representation is refined toward task-relevant features without destabilizing the policy.

4.5. Reward Function

In egocentric manipulation, the wrist camera moves with the arm, and the distance between the end-effector and the object changes continuously during interaction. Effective control therefore requires reward signals that guide large movements across the workspace while also supporting precise alignment near the target. To capture these different interaction phases, the reward uses a multiscale shaping design that combines exponential and inverse-distance terms. These complementary components produce reward sensitivity across both large movements in the workspace and small positional adjustments near the target. The formulation is implemented through a piecewise design for reaching and placement with gated components.
  • Reaching:
    r reach = exp ( α x obj x TCP ) , d > δ 1 1 x obj x TCP + ϵ , otherwise
    where d = x o b j x T C P .
    The exponential term produces a smooth distance-based signal for global motion toward the object, while the inverse-distance term increases sensitivity when the end-effector is close to the object to support precise alignment.
  • Grasping: A binary reward marks the transition from reaching to transport upon successful grasp.
  • Placing:
    r place = exp ( α x obj x goal ) , if   d > δ 2 1 x obj x goal + ϵ , otherwise
  • Post-grasp guidance: Once the object is grasped, an additional shaping term encourages movement toward the placement objective during transport. The term is defined as
    r pg = exp ( d goal ) · grasp
    where d goal = x obj x goal and grasp indicates that the term is active only when the object is grasped. The exponential factor reduces the influence of this term as the object approaches the placement region so that the reward transitions smoothly to the placement objective.
  • Stability: Once placement has been achieved, this stability term discourages residual joint motion so that the robot settles into a stable configuration before termination. It is defined as:
    r static = 1 tanh ( ϖ · | q ˙ robot | )
    where ϖ controls the sensitivity of the stabilization term.
The switching threshold δ 1 , δ 2 = 0.025 corresponds to the environment-defined threshold indicating proximity to the object, and ϵ is a small constant ( 10 5 ) to avoid division by zero. The placement and stability components activate only when the object is grasped and placed, respectively, enforcing sequential phase ordering. The contribution of each component across training is analyzed in Section 6.3, with further ablations provided in Section 7.2.

5. Experimental Setup

We evaluate the proposed method in a simulated robotic manipulation environment using a PickBanana task in ManiSkill [25]. The scene contains a seven-DoF robotic arm, a banana object, and a bowl positioned near a marked target region shown as a green point in Figure 2. In each episode, the agent must reach, grasp, and place the banana at the target location. The bowl serves as a movable object, introducing contact-rich interaction into the task. Table 1 summarizes the main environment specifications. The wrist-mounted camera follows the default pinhole perspective model in ManiSkill. Images were rendered at 128 × 128 resolution synchronously at a control frequency of 20 Hz. No sensor noise or lens distortion was modeled during training. At evaluation, the robustness of the learned representations was assessed under three types of visual perturbation: random occlusion ( 15 % of the image), additive Gaussian noise ( σ = 0.05 ), and camera-like affine perturbation ( 3 ° yaw rotation, 3 % translation, 1.05 × scale), applied independently to the input images.
Camera intrinsics and extrinsics were obtained directly from the ManiSkill simulator at each episode. The intrinsic matrix K defines the focal lengths ( f x , f y ) and principal point ( c x , c y ) at 128 × 128 resolution. The extrinsic matrix R t ( 3 × 4 ) encodes the rigid transformation from world coordinates to the wrist camera frame, updated at every environment step to reflect the camera’s pose as the arm moves. These parameters are used to project 3D object positions into pixel-space for UV supervision only during perception pre-training. Complete camera specifications are provided in Table A4.

5.1. Implementation Detail

All neural network components were implemented in PyTorch (v2.3.0, Python 3.9). The SAC agent was implemented following the original formulation by [23], which includes a stochastic actor network, two critic networks, and a value network. All SAC-based methods (state-only, CNN pixel, DrQ-v1 [41], and the proposed variants) share identical SAC hyperparameters, actor and critic architecture, and reward function. Only the visual encoder and input representation differ across methods. Three pixel-based baselines are included: a four-layer CNN with ReLU activations and linear projection; DrQ-v1 [41] following the original architecture with minor adaptations for the ManiSkill3 observation space and image resolution; and DrQ-v2 [12], following its original DDPG-based formulation and included as an additional reference baseline. Details are provided in Appendix A.5. The state-only baseline uses the proprioceptive state as input to the actor and critics without visual input. Hyperparameters are summarized in Table A1, and hardware specifications are provided in Table A2.
Visual observations are processed by a ResNet-18 backbone initialized with ImageNet-pre-trained weights. The backbone serves as a spatial feature extractor, providing low-level visual features over an 8 × 8 spatial grid for 128 × 128 inputs, and is therefore truncated after the third residual stage. The slot attention module uses N s = 2 slots, matching the number of task-relevant objects in the scene. FiLM scale and shift networks are zero-initialized, ensuring exact identity modulation at the start of RL training. The backbone is frozen during SAC optimization, while slot attention and FiLM layers remain trainable. Training is executed as a single run of 3000 epochs with 400 environment steps per epoch. During the initial data collection phase, the agent interacts with the environment for 100,000 steps to populate the replay buffer without policy optimization. The perception encoder is then pre-trained for 150,000 gradient steps using samples from this replay buffer. SAC optimization begins after pre-training and continues for the remaining epochs, with one gradient update per environment step. The encoder pre-training required approximately 100 min, and the full run required approximately 32.1 h of wall-clock time.

5.2. Evaluation Objectives and Metrics

Evaluation targets two aspects of the method. First, the fusion encoder should improve reinforcement learning performance from egocentric RGB compared to pixel-only and state-only baselines. Second, the slot-based representation should remain stable over time and specialize in task-relevant objects. Both are evaluated through policy-level and perception-level metrics.
  • Perception-level evaluation: Encoder quality is assessed during pre-training by reporting pixel-space position regression error between predicted UV coordinates and simulator-projected object locations, with visibility masking and swap-based matching, together with attention alignment NLL and visibility BCE loss. Furthermore, slot specialization is assessed qualitatively by visualizing attention maps across viewpoint changes and occlusion events.
  • Policy-level evaluation: Policy performance is measured by final return and success rate, defined as the fraction of episodes in which the banana is placed within the target region. With two slots matching two foreground objects, position and visibility losses directly reflect whether each slot consistently tracks its assigned object.

6. Results and Discussions

6.1. Perception-Level Results and Discussion

Figure 3 shows the pre-training curves for all three losses. The attention alignment loss decreases sharply early in training and approaches near-zero values, indicating that slot queries learn to consistently select object-relevant regions rather than diffuse background features.
The position regression loss drops rapidly within the first portion of training and remains low, confirming that the UV heads recover accurate pixel-space object locations. The visibility loss follows a similar downward trend, stabilizing at low value, indicating that the encoder reliably distinguishes whether each object lies within the egocentric field of view.
Furthermore, Figure 4 shows that one slot persistently focuses on the banana while the other tracks the bowl across camera motion and contact-driven object displacement. When neither object is visible, attention concentrates on the end-effector, which remains within the field of view throughout the episode. This is a consistent agent-centric fallback in egocentric perception in which the end-effector is always in field of view. The auxiliary heads remain consistent with these attention patterns. As shown in Figure 5a–d, both objects are visible, and the predicted UV markers lie close to the corresponding ground-truth projections, demonstrating accurate localization across varied spatial configurations and scales. Figure 5e shows a harder boundary case where objects are near or outside the camera view due to ground-truth projections falling outside the valid region, and the visibility mask appropriately down-weights their contribution.
To examine whether slot attention generalizes beyond simulation rendering, printed images of a banana and a bowl were held in front of a Samsung Galaxy Note 20 smartphone and live-streamed through the encoder in real time without fine-tuning. The smartphone camera differs substantially from the simulation camera in resolution, FOV, and lens distortion (Table A4), with additional MJPEG compression artifacts absent during training.
As shown in Figure 6, slot attention maps remained localized on the printed targets, and position and visibility predictions were consistent with the observed objects across consecutive frames. Furthermore, when the camera approached one object closely, attention shifted toward it proportionally, redistributing as the viewpoint re-centered. This observation does not involve a robotic platform or closed-loop control and is not intended as a quantitative sim-to-real benchmark. However, it confirms that the learned representation remains coherent under real visual noise and viewpoint variation beyond simulation rendering.

6.2. Policy-Level Results and Discussion

Figure 7 shows the return curves for all evaluated methods. Among the baselines, the state-only SAC exhibits a delayed but steady rise after the mid-training phase, stabilizing at moderate returns, indicating that proprioception alone supports partial task completion but saturates below methods that incorporate visual features. The DrQ-v1 baseline achieves modest returns, typically stabilizing in the 300 range, while the CNN pixel baseline fails to learn. Both results indicate that raw pixel encoders struggle to form stable representations under wrist-mounted camera motion and occlusion.
The proposed method without goal conditioning achieves the highest return, stabilizing near 3.7–3.9 k, and converges fastest at around ∼1.1–1.2 k episodes. The proposed method with goal conditioning reaches the second-highest return, plateauing around ∼3.1–3.2 k. The lower return of the goal-conditioned variant reflects the added input complexity of goal conditioning, which slows early optimization while still producing a strong final policy.
Replacing FiLM with simple concatenation results in slower convergence and a lower final return, suggesting that channel-wise state modulation contributes beyond simple feature fusion. The variant w/o state (attention fusion output used alone) rises much later, peaks around ∼2.6–2.7 k, and then declines toward the end of training. This late collapse indicates that visual slots alone are insufficient to resolve the partial observability in egocentric views and that proprioceptive context is needed for stable control.
Figure 7 indicates that pixel-only baselines saturate at low returns, while state-guided slot fusion with FiLM conditioning achieves faster convergence and higher final performance. Figure 8 shows the full manipulation sequence: approach, grasp, transport, and placement, executed successfully. Furthermore, evaluating the final checkpoint over 50 deterministic rollouts results in a 92 % success rate, confirming that the return gains correspond to reliable task completion. Beyond task success, these results highlight practical aspects for robotic manipulation. Freezing the backbone after pre-training reduces computation during policy inference and makes the method easier to deploy on resource-limited platforms. The wrist-mounted camera provides an egocentric view that moves with the arm, so the policy observes objects from a consistent reference frame. Conditioning the representation on the robot’s proprioceptive state further aligns the policy input with the current arm configuration and supports stable control throughout the manipulation sequence.

6.3. Reward Component Analysis

The reward function defines three primary components: reach, post-grasp guidance, and place, each corresponding to a distinct stage of the pick-and-place task. To analyze how reward contributions evolved over training, the logged environment steps were divided into three equal windows: early (0– 0.4 × 10 6 steps), mid ( 0.4 0.8 × 10 6 steps), and late ( 0.8 1.2 × 10 6 steps). For each phase, the mean and standard deviation of each component’s fractional contribution to the total reward were computed over all recorded environment steps.
As shown in Figure 9 and Table 2, the reach reward dominates early training ( 79.1 ± 13.4 % ), indicating that the agent initially focuses on approaching the object before acquiring grasping or placement behavior. The post-grasp guidance component remains near-zero during the early and mid phases, increasing only after approximately 0.6 × 10 6 steps ( 4.1 ± 5.0 % in the late phase), consistent with the agent learning to maintain a stable grasp during object transport. The place reward shows a clear upward trend, increasing from 20.6 ± 13.4 % in the early phase to 63.6 ± 25.9 % in the late phase, indicating that successful object placement becomes the dominant reward signal as training progresses. The variance in the late phase reflects the step-level distribution of reward contributions across different task phases within each episode.

6.4. Visuomotor Robustness and Efficiency Evaluation

To examine what information is encoded in the learned latent representation z t ˜ , linear probes are trained on frozen encoder outputs collected from 10 rollout episodes across all ablations. Since the encoder is frozen and the probes are linear, strong probe performance indicates that the relevant signals are linearly decodable from z t ˜ and directly accessible to the policy. The same probes are applied identically across all ablations (FiLM vs. concatenation, with vs. without proprioceptive state, and image-only and state-only variants) on identical rollout frames to isolate the effect of the fusion mechanism.
  • Interaction alignment: Egocentric manipulation hinges on detecting when the robot has reached, contacted, or grasped an object. To test whether the latent representation separates interaction from non-interaction states, a linear classifier is trained to predict a binary contact label from z t ˜ , and performance is reported as AUROC. As shown in Table 3, the full FiLM variant achieves the highest contact AUROC ( 0.879 ± 0.019 ) , indicating that the latent representation separates contact from non-contact states more cleanly than other variants. The state-only variant also achieves high AUROC, consistent with the fact that contact correlates with proprioceptive cues.
  • Progress and object-dynamics encoding: Contact detection alone does not capture how the end-effector approaches the object or how agent actions displace it. Two additional probes target these aspects directly. One-step TCP-to-object distance change Δ d t = d t + 1 d t and object displacement Δ o b j t = o b j t + 1 o b j t are regressed from z t ˜ , with performance reported as R 2 .
As shown in Table 3, all variants achieve strong R 2 on both probes, indicating that TCP-to-object distance change and object displacement are linearly decodable across all fusion variants. The FiLM variant maintains these high scores while also achieving the strongest contact AUROC, whereas the concatenation variant achieves comparable Δdist/ΔR2 but substantially lower contact AUROC. This indicates that concatenation preserves motion-related information but produces weak interaction alignment than FiLM.
  • Egocentric robustness: Egocentric cameras introduce self-motion, occlusion, and sensor noise that can degrade learned representations. The trained contact probe is evaluated on perturbed latents z t ˜ o c c and z t ˜ n o i s e with robustness measured as AUROC drops: Δ A U R O C o c c = A U R O C ( z ˜ o c c ) A U R O C ( z ˜ ) , and Δ A U R O C n o i s e = A U R O C ( z ˜ n o i s e ) A U R O C ( z ˜ ) .
As shown in Table 3, the FiLM variant produces near-zero Δ A U R O C under both occlusion and noise, indicating that the contact signal remains stable under egocentric perturbations. The state-only variant produces Δ A U R O C = 0 by construction since it receives no image input. The image-only and concatenation variants show larger AUROC drops, indicating that pixel-based representations without proprioceptive state modulation are more sensitive to egocentric perturbations.

7. Ablation Studies

7.1. FiLM Conditioning Variants

Three state conditioning variants are evaluated against the proposed method to examine the role of FiLM modulation along two dimensions: where the state enters the pipeline and how it conditions the visual representation. The proposed method, FiLM with tanh (proposed), applies channel-wise multiplicative modulation to slot features using ( 1 + tanh ( γ t ) ) z t + β t , where γ t and β t are predicted from the proprioceptive state. In this formulation, the state conditions the visual representation directly at the slot feature level multiplicatively.
The first variant, FiLM linear+clamp, replaces the tanh scaling with a linear scale factor clamped to [ 1 ,   1 ] . The conditioning mechanism and entry point remain identical to the proposed method. This variant isolates the contribution of smooth bounded modulation from the FiLM mechanism itself.
The second variant, concat+projection, replaces FiLM with a linear projection applied to the concatenation of the state and slot features, projecting the result back to the original feature dimension. Here the state still enters early at the slot feature level but interacts with the visual features additively rather than multiplicatively. This tests whether the state must actively reshape the visual feature space channel-wise or whether simple additive mixing is sufficient.
The third variant, proposed with concat (no FiLM), appends the state to the aggregated slot features only at the policy input stage after all visual processing is complete. In this case, the state never influences the visual representation itself and is provided only as an additional input to the policy. This variant tests whether passive perception with state as a side input is sufficient without explicit visual feature conditioning.
As shown in Figure 10, the concat+projection variant fails to learn a meaningful policy, indicating that additive fusion of state and visual features at the slot level is insufficient. The late concat variant learns the task but converges to a substantially lower return than the proposed method, showing that providing the state only at the policy input without conditioning the visual features is also inadequate. In contrast, both FiLM variants successfully learn the task. The tanh-modulated FiLM converges faster and reaches a higher stable return than the clamp variant. These results indicate that proprioceptive state must condition visual features early and multiplicatively, consistent with the embodied perception principle underlying the proposed method.

7.2. Reward Function Ablations

To evaluate the contribution and robustness of the reward design, we perform ablation experiments examining both the structural components and the sensitivity to key parameters.
  • Stability sensitivity: The scaling coefficient ϖ controls the sharpness of the velocity penalty in the stability term. We vary the stabilization parameter ϖ { 1 , 5 , 10 } to test the sensitivity of the bounded velocity penalty. As shown in Figure 11, all three values eventually learn the task, confirming that the method is robust to this parameter. However, ϖ = 5 achieves the fastest convergence and highest stable return. ϖ = 1 produces the slowest and most unstable learning, while ϖ = 10 delays convergence relative to the proposed value. These results confirm that ϖ = 5 provides the most effective balance between penalizing residual motion and preserving the fine control needed during placement.
  • Stability term removal: We remove the stabilization component to evaluate whether stable placement behavior emerges without explicitly penalizing residual joint motion.
Figure 12 compares the proposed method against a variant with the stability term removed entirely. Removing the stability term results in faster initial convergence but a lower final return, with the agent saturating around 3700 compared to 3900 for the proposed method. Both variants follow a similar learning trajectory until approximately episode 1250, after which the proposed method pulls ahead and stabilizes at a higher return.
  • Reward-scale robustness: Ablation is done on the distance shaping parameter α { 5 , 10 , 15 } to examine the robustness of the multiscale reward shaping formulation.
The scaling coefficient α controls the rate of exponential decay in the reaching, post-grasp, and placement reward terms. As shown in Figure 13, all three values eventually learn the task, indicating that the method is robust to this parameter. However, α = 10 achieves the fastest convergence and the highest stable return. Reducing α to 5 slows convergence and introduces instability during the learning transition, while increasing α to 15 delays the onset of learning.

7.3. Surplus Slot Analysis

In the proposed method, the number of slots was set equal to the number of task-relevant objects N o = 2 in the main experiments. This ablation examines what happens when surplus slots are introduced, N s = 4 with N o = 2 . During pre-training, only two slots receive pixel-space supervision targets corresponding to the task-relevant objects. The remaining slots have no explicit supervision and are free to attend to any image region. This setting tests whether surplus unsupervised slots destabilize slot binding. Figure 14 and Figure 15 show representative attention maps from the two slots that produce the strongest activation responses. The remaining slots exhibit weaker diffuse responses and are not selected by the position prediction head. Additionally, object localization remains highly accurate.

7.4. End-to-End Training Without Pre-Training

The main method initializes the visual encoder through a supervised pre-training stage before RL begins. This ablation evaluates the contribution of pre-training by training the full encoder end-to-end with SAC from random initialization, with no pre-training stage. The ResNet backbone, slot attention, and FiLM layers are all updated jointly through the critic loss from the start of training. This isolates the contribution of the pre-training stage and investigates whether the learned visual representation it provides is necessary for task performance, or whether critic gradients alone are sufficient to learn effective slot-based visual features from scratch.
Figure 16 shows the learning curve for an end-to-end variant. The return increases, indicating that the agent is learning. However, the policy never reaches successful task completion, and the return saturates below the performance achieved by the proposed approach. This behavior arises because the visual encoder must simultaneously learn object localization, stable feature binding, and control-relevant representations from the reinforcement learning objective alone. The RL signal is task-oriented and provides only indirect supervision for pixel-level representation learning, resulting in high-variance gradients that are weakly coupled to object localization and spatial reasoning. Pre-training the encoder therefore improves training efficiency by providing a stable visual representation before policy optimization begins.

8. Conclusions

This work introduced a state-guided multi-slot attention fusion architecture for egocentric robotic manipulation and integrated it with Soft Actor–Critic for end-to-end visuomotor control. In the PickBanana manipulation task, the proposed approach achieved the fastest learning and the highest final return among all compared methods. Ablations showed that the fusion framework remains strong with or without explicit goal input, FiLM-based state modulation is critical for sample-efficient and stable learning compared to naive concatenation, and proprioceptive conditioning is necessary for reliable control in partially observed egocentric views. Qualitative rollouts further confirmed that the learned representation supports coherent, temporally structured manipulation behaviors.
Overall, the findings demonstrate that slot-based representations, when anchored to the agent’s state, provide an effective and practical route to egocentric manipulation policies in simulation. This establishes a foundation for future work on broader task suites and transfer to real-robot egocentric settings, where occlusion and viewpoint changes are unavoidable. However, the method has a limitation in that it uses simulator supervision, and tasks that require more complex interactions have not been considered. Future work could explore replacing oracle simulator supervision with perception-only signals to enable real-robot transfer and experiment on more complex interactions.

Author Contributions

Conceptualization, S.W.B. and J.-H.H.; methodology, S.W.B.; software, S.W.B.; validation, S.W.B. and J.-H.H.; formal analysis, S.W.B.; investigation, J.-H.H.; resources, J.-H.H.; data curation, S.W.B.; writing—original draft preparation, S.W.B.; writing—review and editing, J.-H.H.; visualization, S.W.B.; supervision, J.-H.H.; project administration, J.-H.H.; funding acquisition, J.-H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2024-00407295).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. SAC Hyperparameters

Table A1. Training hyperparameters.
Table A1. Training hyperparameters.
ParameterValueDescription
SAC
α 0.0001Actor network learning rate
β 0.001Critic and value learning rate
γ 0.99Discount factor
τ 0.005Target network soft update
Reward Scale10Adjusts reward magnitude
Hidden Layers256, 256Neural network layer sizes
Batch Size128Training batch size
Buffer Size1 MMaximum replay buffer size
Encoder Pre-training
Learning Rate0.0001Encoder learning rate
Feature Dim42Slot and FiLM feature dimension
Attention Heads3Multihead cross-attention heads
Pre-training Steps150,000Gradient steps for encoder pre-training
RL Encoder Fine-tuning
Fusion layers lr0.0001Learning rate for slot attention and FiLM
Goal gate lr0.00003Learning rate for goal gating logits
OptimizerAdamBoth pre-training and RL fine-tuning
RL reward
α 10Scales obj-TCP distance
ϖ 5Adjusts stability term
δ 1 , δ 2 0.025Proximity threshold (environment set)

Appendix A.2. Computational Requirement

Table A2. Computational setup (primary setup).
Table A2. Computational setup (primary setup).
ComponentDetails
Operating SystemUbuntu 22.04.2 (jammy)
GPU ModelNVIDIA RTX A6000
CPUAMD EPYC 7313P (16-Core)
CUDA Version12.6

Appendix A.3. Computational Characteristics

Table A3 summarizes the parameter count, inference latency, effective inference rate, and peak GPU memory consumption of the full architecture. The model contains 2.91 M parameters in total, with 2.83 M in the fusion encoder and 0.08   M in the actor network. All measurements were obtained on an NVIDIA RTX A6000 (48 GB) GPU with batch size one, with inference latency averaged over 100 forward passes following 10 warm-up iterations.
Table A3. Computational cost of the proposed method measured on an NVIDIA RTX A6000 (48 GB) GPU.
Table A3. Computational cost of the proposed method measured on an NVIDIA RTX A6000 (48 GB) GPU.
MetricValue
Total parameters2.91 M
Mean inference latency2.57 ms (±0.09 ms)
Effective inference rate388 Hz
Control (20 Hz)50 ms
Peak GPU memory148.8 MB

Appendix A.4. Wrist and Mobile Camera Specifications

Table A4. Comparison between the simulation wrist camera and mobile device camera for egocentric viewpoint approximation (Samsung Galaxy Note 20, main wide-angle lens).
Table A4. Comparison between the simulation wrist camera and mobile device camera for egocentric viewpoint approximation (Samsung Galaxy Note 20, main wide-angle lens).
Simulation Wrist CameraSamsung Galaxy Note 20
FOV90°∼79° (main wide-angle lens at 1080p video mode)
Resolution128 × 128 at 20 fps1920 × 1080 at 25 fps (streamed via IP Webcam app, downsampled to 128 × 128 for inference)
ModelPinhole (no distortion during training)Real lens with distortion
Near/Far0.01 m/100 mN/A
MountWrist-mounted on camera_link (end-effector link) with identity relative poseHandheld to approximate egocentric wrist viewpoint

Appendix A.5. Additional Baseline Results

DrQ-v2 [12] was evaluated in the same egocentric manipulation setting to verify that the observed improvements are not specific to the SAC-based training framework. DrQ-v2 combines data augmentation with a deterministic actor–critic algorithm (DDPG), introducing a difference in the underlying RL algorithm relative to the SAC-based baselines in the main experiments.
As shown in Figure A1, DrQ-v2 achieves lower returns throughout training, consistent with the DrQ-v1 result in the main experiments, with a 0 success rate. Together, these results suggest that data augmentation alone is insufficient for egocentric manipulation. Both methods process visual observations passively without access to the robot’s proprioceptive state and cannot resolve the ambiguity between object appearance and arm configuration introduced by the dynamic egocentric viewpoint. This supports the conclusion that proprioceptive conditioning of slot-based visual features, rather than augmentation strategy or policy optimization algorithm, is the key factor driving task performance.
Figure A1. Training return across baselines and architectural variants under the same reward scaling. All methods use 400 environment steps per epoch. All runs were trained for 3000 epochs except DrQ-v2, which was trained for 2000; therefore, only the first 2000 epochs are shown for comparison.
Figure A1. Training return across baselines and architectural variants under the same reward scaling. All methods use 400 environment steps per epoch. All runs were trained for 3000 epochs except DrQ-v2, which was trained for 2000; therefore, only the first 2000 epochs are shown for comparison.
Electronics 15 01365 g0a1

References

  1. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  2. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-End Training of Deep Visuomotor Policies. J. Mach. Learn. Res. 2015, 17, 1334–1373. [Google Scholar]
  3. Pinto, L.; Gupta, A.K. Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours. In 2016 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2015; pp. 3406–3413. [Google Scholar]
  4. Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. arXiv 2018, arXiv:1806.10293. [Google Scholar]
  5. Ebert, F.; Finn, C.; Dasari, S.; Xie, A.; Lee, A.X.; Levine, S. Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control. arXiv 2018, arXiv:1812.00568. [Google Scholar]
  6. Guo, H.; Song, M.; Ding, Z.; Yi, C.; Jiang, F. Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer. Sensors 2023, 23, 515. [Google Scholar] [CrossRef]
  7. Jangir, R.; Hansen, N.; Ghosal, S.; Jain, M.; Wang, X. Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation. IEEE Robot. Autom. Lett. 2022, 7, 3046–3053. [Google Scholar] [CrossRef]
  8. Zhu, H.; Yu, J.; Gupta, A.; Shah, D.; Hartikainen, K.; Singh, A.; Kumar, V.; Levine, S. The Ingredients of Real-World Robotic Reinforcement Learning. arXiv 2020, arXiv:2004.12570. [Google Scholar] [CrossRef]
  9. Grauman, K.; Westbury, A.; Torresani, L.; Kitani, K.; Malik, J.; Afouras, T.; Ashutosh, K.; Baiyya, V.; Bansal, S.; Boote, B.; et al. Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 19383–19400. [Google Scholar]
  10. Bandini, A.; Zariffa, J. Analysis of the Hands in Egocentric Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 45, 6846–6866. [Google Scholar] [CrossRef]
  11. Li, X.; Qiu, H.; Wang, L.; Zhang, H.; Qi, C.; Han, L.; Xiong, H.; Li, H. Challenges and Trends in Egocentric Vision: A Survey. arXiv 2025, arXiv:2503.15275. [Google Scholar] [CrossRef]
  12. Yarats, D.; Fergus, R.; Lazaric, A.; Pinto, L. Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning. arXiv 2021, arXiv:2107.09645. [Google Scholar] [CrossRef]
  13. Laskin, M.; Srinivas, A.; Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5639–5650. [Google Scholar]
  14. Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. In Proceedings of the Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  15. Hafner, D.; Lillicrap, T.P.; Fischer, I.S.; Villegas, R.; Ha, D.R.; Lee, H.; Davidson, J. Learning Latent Dynamics for Planning from Pixels. arXiv 2018, arXiv:1811.04551. [Google Scholar]
  16. Hafner, D.; Lillicrap, T.P.; Ba, J.; Norouzi, M. Dream to Control: Learning Behaviors by Latent Imagination. arXiv 2019, arXiv:1912.01603. [Google Scholar]
  17. Hafner, D.; Paukonis, J.; Ba, J.; Lillicrap, T.P. Mastering Diverse Domains through World Models. arXiv 2023, arXiv:2301.04104. [Google Scholar]
  18. Zambaldi, V.F.; Raposo, D.; Santoro, A.; Bapst, V.; Li, Y.; Babuschkin, I.; Tuyls, K.; Reichert, D.P.; Lillicrap, T.P.; Lockhart, E.; et al. Relational Deep Reinforcement Learning. arXiv 2018, arXiv:1806.01830. [Google Scholar] [CrossRef]
  19. Land, M.F.; Hayhoe, M.M. In what ways do eye movements contribute to everyday activities? Vis. Res. 2001, 41, 3559–3565. [Google Scholar] [CrossRef]
  20. Prablanc, C.; Martin, O. Automatic control during hand reaching at undetected two-dimensional target displacements. J. Neurophysiol. 1992, 67, 455–469. [Google Scholar] [CrossRef]
  21. Locatello, F.; Weissenborn, D.; Unterthiner, T.; Mahendran, A.; Heigold, G.; Uszkoreit, J.; Dosovitskiy, A.; Kipf, T. Object-Centric Learning with Slot Attention. arXiv 2020, arXiv:2006.15055. [Google Scholar] [CrossRef]
  22. Singh, G.; Wu, Y.F.; Ahn, S. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. arXiv 2022, arXiv:2205.14065. [Google Scholar] [CrossRef]
  23. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]
  24. Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  25. Tao, S.; Xiang, F.; Shukla, A.; Qin, Y.; Hinrichsen, X.; Yuan, X.; Bao, C.; Lin, X.; Liu, Y.; Chan, T.-k.; et al. ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI. arXiv 2024, arXiv:2410.00425. [Google Scholar] [CrossRef]
  26. Laskin, M.; Lee, K.; Stooke, A.; Pinto, L.; Abbeel, P.; Srinivas, A. Reinforcement Learning with Augmented Data. arXiv 2020, arXiv:2004.14990. [Google Scholar] [CrossRef]
  27. Lee, A.X.; Nagabandi, A.; Abbeel, P.; Levine, S. Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model. arXiv 2019, arXiv:1907.00953. [Google Scholar]
  28. Tsai, R.Y. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE J. Robot. Autom. 1987, 3, 323–344. [Google Scholar] [CrossRef]
  29. Hutchinson, S.A.; Hager, G.; Corke, P. A tutorial on visual servo control. IEEE Trans. Robot. Autom. 1996, 12, 651–670. [Google Scholar] [CrossRef]
  30. Chaumette, F.; Hutchinson, S.A. Visual servo control. I. Basic approaches. IEEE Robot. Autom. Mag. 2006, 13, 82–90. [Google Scholar] [CrossRef]
  31. Zeng, A.; Florence, P.R.; Tompson, J.; Welker, S.; Chien, J.M.; Attarian, M.; Armstrong, T.; Krasin, I.; Duong, D.; Sindhwani, V.; et al. Transporter Networks: Rearranging the Visual World for Robotic Manipulation. In Proceedings of the Conference on Robot Learning, Virtually, 16–18 November 2020. [Google Scholar]
  32. Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J.; et al. RT-1: Robotics Transformer for Real-World Control at Scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
  33. Reed, S.; Zolna, K.; Parisotto, E.; Colmenarejo, S.G.; Novikov, A.; Barth-Maron, G.; Giménez, M.; Sulsky, Y.; Kay, J.; Springenberg, J.T.; et al. A Generalist Agent. arXiv 2022, arXiv:2205.06175. [Google Scholar] [CrossRef]
  34. Shridhar, M.; Manuelli, L.; Fox, D. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. arXiv 2022, arXiv:2209.05451. [Google Scholar]
  35. Yu, H.F.; Altahhan, A. Hierarchical Learning for Closed-Loop Robotic Manipulation in Cluttered Scenes via Depth Vision, Reinforcement Learning, and Behaviour Cloning. Electronics 2025, 14, 3074. [Google Scholar] [CrossRef]
  36. Burgess, C.P.; Matthey, L.; Watters, N.; Kabra, R.; Higgins, I.; Botvinick, M.M.; Lerchner, A. MONet: Unsupervised Scene Decomposition and Representation. arXiv 2019, arXiv:1901.11390. [Google Scholar] [CrossRef]
  37. Greff, K.; Kaufman, R.L.; Kabra, R.; Watters, N.; Burgess, C.P.; Zoran, D.; Matthey, L.; Botvinick, M.M.; Lerchner, A. Multi-Object Representation Learning with Iterative Variational Inference. arXiv 2019, arXiv:1903.00450. [Google Scholar]
  38. Jiang, J.; Janghorbani, S.; de Melo, G.; Ahn, S. SCALOR: Generative World Models with Scalable Object Representations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  39. Watters, N.; Zoran, D.; Weber, T.; Battaglia, P.W.; Pascanu, R.; Tacchetti, A. Visual Interaction Networks: Learning a Physics Simulator from Video. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  40. Kipf, T.; van der Pol, E.; Welling, M. Contrastive Learning of Structured World Models. arXiv 2019, arXiv:1911.12247. [Google Scholar]
  41. Kostrikov, I.; Yarats, D.; Fergus, R. Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels. arXiv 2020, arXiv:2004.13649. [Google Scholar]
Figure 1. Overview of the proposed proprioception-modulated architecture. Egocentric RGB input is encoded via ResNet and tokenized. Learnable slot queries attend to the tokens using a cross-attention mechanism. The resulting slot features are modulated by the robot states via FiLM and shared across all SAC components for policy learning. During reinforcement learning, the beige background modules, including FiLM conditioning, state projection, cross-attention, and slot heads, remain trainable and receive gradients from the SAC critic, while the blue background modules are frozen.
Figure 1. Overview of the proposed proprioception-modulated architecture. Egocentric RGB input is encoded via ResNet and tokenized. Learnable slot queries attend to the tokens using a cross-attention mechanism. The resulting slot features are modulated by the robot states via FiLM and shared across all SAC components for policy learning. During reinforcement learning, the beige background modules, including FiLM conditioning, state projection, cross-attention, and slot heads, remain trainable and receive gradients from the SAC critic, while the blue background modules are frozen.
Electronics 15 01365 g001
Figure 2. PickBanana egocentric manipulation setup in ManiSkill. A seven-DoF robotic arm operates above a tabletop with a banana and a bowl placed near the target region (green dot). The task objective is to grasp and place the banana at the target location. The bowl is a non-goal movable object present at the target area.
Figure 2. PickBanana egocentric manipulation setup in ManiSkill. A seven-DoF robotic arm operates above a tabletop with a banana and a bowl placed near the target region (green dot). The task objective is to grasp and place the banana at the target location. The bowl is a non-goal movable object present at the target area.
Electronics 15 01365 g002
Figure 3. Pre-training loss components for the egocentric slot-based encoder. (a) Swap-invariant pixel-space UV regression error (pos), (b) attention alignment loss (attnNLL), and (c) visibility prediction BCE (vis loss) over pre-training steps on PickBanana. All losses decrease rapidly and stabilize, indicating accurate pixel localization, attention centered on projected object tokens, and robust visibility handling in the wrist camera view.
Figure 3. Pre-training loss components for the egocentric slot-based encoder. (a) Swap-invariant pixel-space UV regression error (pos), (b) attention alignment loss (attnNLL), and (c) visibility prediction BCE (vis loss) over pre-training steps on PickBanana. All losses decrease rapidly and stabilize, indicating accurate pixel localization, attention centered on projected object tokens, and robust visibility handling in the wrist camera view.
Electronics 15 01365 g003
Figure 4. Slot specialization from egocentric attention. Random frames illustrate the original observation (orig) and the attention heatmaps for Slot 0 and Slot 1. Across diverse viewpoints and object configurations, the two slots form a consistent decomposition in which one slot attends to the banana and the other to the bowl, with rare identity permutations expected from permutation-invariant slot learning without explicit slot–object binding. When both objects are fully occluded, attention reverts to the end-effector region as an agent-centric fallback cue.
Figure 4. Slot specialization from egocentric attention. Random frames illustrate the original observation (orig) and the attention heatmaps for Slot 0 and Slot 1. Across diverse viewpoints and object configurations, the two slots form a consistent decomposition in which one slot attends to the banana and the other to the bowl, with rare identity permutations expected from permutation-invariant slot learning without explicit slot–object binding. When both objects are fully occluded, attention reverts to the end-effector region as an agent-centric fallback cue.
Electronics 15 01365 g004
Figure 5. Qualitative evaluation of the auxiliary pixel heads after pre-training. Predicted object UV locations (blue “x” for banana, green “x” for bowl) are overlaid with simulator-projected ground-truth pixels (orange “o” for banana, red “o” for bowl) on egocentric frames. Samples (ad) show accurate localization when objects are visible, while sample (e) illustrates a difficult occlusion/out-of-view case handled via visibility/validity masking.
Figure 5. Qualitative evaluation of the auxiliary pixel heads after pre-training. Predicted object UV locations (blue “x” for banana, green “x” for bowl) are overlaid with simulator-projected ground-truth pixels (orange “o” for banana, red “o” for bowl) on egocentric frames. Samples (ad) show accurate localization when objects are visible, while sample (e) illustrates a difficult occlusion/out-of-view case handled via visibility/validity masking.
Electronics 15 01365 g005
Figure 6. Egocentric qualitative check of slot attention. A phone-mounted camera on a human arm mimics the wrist viewpoint, observing printed banana and bowl targets. Each pair of panels shows Slot-0 and Slot-1 attention heatmaps. Across the shown samples, the two slots remain spatially localized and track the banana and bowl under viewpoint changes, with occasional permutation (slot swapping), reflecting exchangeable slot identities rather than loss of tracking.
Figure 6. Egocentric qualitative check of slot attention. A phone-mounted camera on a human arm mimics the wrist viewpoint, observing printed banana and bowl targets. Each pair of panels shows Slot-0 and Slot-1 attention heatmaps. Across the shown samples, the two slots remain spatially localized and track the banana and bowl under viewpoint changes, with occasional permutation (slot swapping), reflecting exchangeable slot identities rather than loss of tracking.
Electronics 15 01365 g006
Figure 7. Training return over 3000 episodes for all baselines and ablations under the same reward scaling. The proposed multi-slot attention agent consistently outperforms pixel-only and state-only alternatives, and the ablations isolate which design choices drive this gap.
Figure 7. Training return over 3000 episodes for all baselines and ablations under the same reward scaling. The proposed multi-slot attention agent consistently outperforms pixel-only and state-only alternatives, and the ablations isolate which design choices drive this gap.
Electronics 15 01365 g007
Figure 8. Manipulation behavior (left to right) (a) reaching toward the banana; (b) grasping (c) lifting, transporting and pushing bowl; (d) place at target position (e) hold at the target position.
Figure 8. Manipulation behavior (left to right) (a) reaching toward the banana; (b) grasping (c) lifting, transporting and pushing bowl; (d) place at target position (e) hold at the target position.
Electronics 15 01365 g008
Figure 9. Evolution of normalized reward components during training. The x-axis shows environment steps, and the y-axis shows normalized component magnitude in the range [ 0 ,   1 ] . Each curve corresponds to a reward component ( r r e a c h , r p o s t g r a s p , r p l a c e ) logged during training and smoothed using exponential moving averaging.
Figure 9. Evolution of normalized reward components during training. The x-axis shows environment steps, and the y-axis shows normalized component magnitude in the range [ 0 ,   1 ] . Each curve corresponds to a reward component ( r r e a c h , r p o s t g r a s p , r p l a c e ) logged during training and smoothed using exponential moving averaging.
Electronics 15 01365 g009
Figure 10. Training curves comparing FiLM with tanh modulation (proposed), FiLM with linear+clamp, concatenation with projection, and concatenation without FiLM.
Figure 10. Training curves comparing FiLM with tanh modulation (proposed), FiLM with linear+clamp, concatenation with projection, and concatenation without FiLM.
Electronics 15 01365 g010
Figure 11. Training return across 2000 episodes for different velocity penalty scaling terms. The proposed setting (StaticTerm = 5) achieves faster learning and higher stable return compared to StaticTerm = 1 and StaticTerm = 10.
Figure 11. Training return across 2000 episodes for different velocity penalty scaling terms. The proposed setting (StaticTerm = 5) achieves faster learning and higher stable return compared to StaticTerm = 1 and StaticTerm = 10.
Electronics 15 01365 g011
Figure 12. Training return across episodes comparing the proposed method with the static reward term and the variant where the static term is removed.
Figure 12. Training return across episodes comparing the proposed method with the static reward term and the variant where the static term is removed.
Electronics 15 01365 g012
Figure 13. Training return across 2000 episodes for different reward scaling factors. The proposed variant (RewardScale:10) achieves faster performance improvement and a higher final return compared to RewardScale:5 and RewardScale:15.
Figure 13. Training return across 2000 episodes for different reward scaling factors. The proposed variant (RewardScale:10) achieves faster performance improvement and a higher final return compared to RewardScale:5 and RewardScale:15.
Electronics 15 01365 g013
Figure 14. Slot attention visualization when N s = 4 and the environment contains two objects. For clarity, we visualize the two slots that exhibit the strongest localized attention responses.
Figure 14. Slot attention visualization when N s = 4 and the environment contains two objects. For clarity, we visualize the two slots that exhibit the strongest localized attention responses.
Electronics 15 01365 g014
Figure 15. Position prediction under surplus slot configuration ( N s = 4 ). Predicted and ground-truth keypoints for the banana and bowl are overlaid across different viewpoints.
Figure 15. Position prediction under surplus slot configuration ( N s = 4 ). Predicted and ground-truth keypoints for the banana and bowl are overlaid across different viewpoints.
Electronics 15 01365 g015
Figure 16. Training return for end-to-end learning, where the visual encoder and policy are optimized jointly without representation pre-training. The agent improves its return but never reaches successful task completion.
Figure 16. Training return for end-to-end learning, where the visual encoder and policy are optimized jointly without representation pre-training. The agent improves its return but never reaches successful task completion.
Electronics 15 01365 g016
Table 1. Environment and observation specifications for the PickBanana task.
Table 1. Environment and observation specifications for the PickBanana task.
ComponentSpecification
Robot model7-DOF Franka Emika Panda
Camera setupEgocentric RGB camera mounted on the end-effector (Pandawrist)
Observation vector o t state 35-D proprioceptive state (joint positions, velocities, gripper pose)
Action space A 7-D continuous action space
ObjectsBanana, target location, and a bowl
SimulatorSAPIEN engine via ManiSkill framework (panda_wristcam agent)
Table 2. Percentage contribution of reward components across training phases (mean ± standard deviation over environment steps).
Table 2. Percentage contribution of reward components across training phases (mean ± standard deviation over environment steps).
PhaseReachPost-GraspPlace
Early 79.1 ± 13.4 0.3 ± 1.1 20.6 ± 13.4
Mid 72.9 ± 18.2 3.5 ± 4.1 23.6 ± 18.3
Late 31.9 ± 24.4 4.5 ± 5.0 63.6 ± 25.9
Table 3. Representation probing on PickBanana latents (mean ± std over three seeds). Contact AUROC measures interaction alignment. Δdist R 2 measures progress encoding. ΔObj R 2 is averaged over x , y , z object displacement probes. Occ/Noise ΔAUROC reports the change in Contact AUROC under occlusion and noise (closer to 0 indicates higher robustness).
Table 3. Representation probing on PickBanana latents (mean ± std over three seeds). Contact AUROC measures interaction alignment. Δdist R 2 measures progress encoding. ΔObj R 2 is averaged over x , y , z object displacement probes. Occ/Noise ΔAUROC reports the change in Contact AUROC under occlusion and noise (closer to 0 indicates higher robustness).
VariantContact AUROC ↑ Δ dist R 2 Δ Obj R 2 (avg) ↑Occ Δ AUROC 0 Noise Δ AUROC 0
FiLM (full)0.879 ± 0.0190.841 ± 0.0110.552 ± 0.0290.002 ± 0.0260.001 ± 0.006
State-only0.852 ± 0.0090.813 ± 0.0080.570 ± 0.025 0.000 ± 0.000 0.000 ± 0.000
Image-only0.638 ± 0.1420.821 ± 0.0220.529 ± 0.024 0.038 ± 0.101 0.020 ± 0.101
Concat (no FiLM)0.531 ± 0.2060.846 ± 0.0140.577 ± 0.035 0.058 ± 0.131 + 0.042 ± 0.088
Note: ↑ indicates higher values are better; 0 indicates values closer to zero are better. Bold values indicate the best results.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Beyene, S.W.; Han, J.-H. Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation. Electronics 2026, 15, 1365. https://doi.org/10.3390/electronics15071365

AMA Style

Beyene SW, Han J-H. Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation. Electronics. 2026; 15(7):1365. https://doi.org/10.3390/electronics15071365

Chicago/Turabian Style

Beyene, Sofanit Wubeshet, and Ji-Hyeong Han. 2026. "Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation" Electronics 15, no. 7: 1365. https://doi.org/10.3390/electronics15071365

APA Style

Beyene, S. W., & Han, J.-H. (2026). Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation. Electronics, 15(7), 1365. https://doi.org/10.3390/electronics15071365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop