Next Article in Journal
Supervised Machine Learning for Technical Debt in Python: Analysis and Prediction
Previous Article in Journal
A Tiny Vision-Based Model for Real-Time Student Attention Detection in Online Classes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

COMPAS: Compose Actions and Slots in Object-Centric World Models

1
Moscow Independent Research Institute of Artificial Intelligence, Moscow 105064, Russia
2
Federal Research Center “Computer Science and Control”, Russian Academy of Sciences, Moscow 119333, Russia
3
Cognitive AI Systems Lab, Moscow 123317, Russia
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2026, 8(5), 117; https://doi.org/10.3390/make8050117
Submission received: 18 March 2026 / Revised: 14 April 2026 / Accepted: 27 April 2026 / Published: 29 April 2026
(This article belongs to the Section Learning)

Abstract

In this paper, we propose a novel approach, COMPAS (COMPose Actions and Slots), which leverages the strengths of state-of-the-art object-centric approaches for modeling the dynamics of an environment. Our method encodes the environment’s state into symbol-like, object-centric representations, known as slots, where each slot corresponds to an individual object. This approach offers a structured and interpretable way to model complex environments by combining slots with action representations for accurate next-state prediction. The primary contribution of our work is an efficient world model with a dynamics predictor capable of predicting accurate trajectories in action-dependent environments. Additionally, our slot extractor module enhances the predictive capabilities by extracting deterministic slots that remain consistent both within a single trajectory and across episodes. Unlike slots sampled from a trainable distribution, deterministic slots are generated from a single trainable parameter together with slot positional embeddings. This design improves the consistency across episodes, which in turn leads to more accurate dynamics prediction. We present a comprehensive evaluation of our approach in various environments, demonstrating that our proposed method outperforms competing models in environments with discrete and continuous action spaces.

1. Introduction

The ability to generalize compositionally is the key to learning faster, solving new problems, and understanding new concepts with limited experience [1]. The difficulty in compositional generalization is caused by the so-called binding problem [2]—the inability of modern artificial neural networks to dynamically and flexibly bind information distributed over the network, which arises in learning on unstructured input data.
A possible solution to this problem could be the use of symbol-like representations. Such representations can be slot representations [3], where, instead of a single latent embedding, the input data are encoded by a set of such embeddings (slots). Slots compete with each other to describe a portion of the input data. Such representations have been successfully used for object-centric tasks such as set property prediction and object detection [3] in an image, learning visual dynamics from video [4], image generation [5], and unsupervised object-centric representation learning for real-world data [6].
World models are essential in embodied machine learning tasks such as robotics, autonomous vehicles, game AI, and video synthesis [7,8,9]. They help agents to understand, predict, and interact with dynamic environments. However, modeling high-dimensional, continuous, and time-varying data remains difficult. Traditional models represent the environment holistically, without distinguishing individual objects [10,11], which limits their ability to capture object relationships [12] and evolving dynamics [13]. These models often struggle to generalize and scale to complex, multi-object scenes. In contrast, object-centric world models offer interpretable, factored representations that improve generalization, scalability, and data efficiency by representing scenes in terms of individual objects and their properties.
Previous object-centric world models typically address only part of the overall problem. On the one hand, many slot-based extractors [14,15,16,17] struggle to maintain episodic and long-horizon consistency, since the same object may be assigned to different slots across time or across episodes. On the other hand, existing dynamics models [4,18] often do not provide an efficient action-conditioned transition mechanism for accurate prediction in action-dependent environments. Our main goal in this work is to address both limitations by learning consistent object-centric slot representations and combining them with an efficient transition model for future-state prediction.
In this paper, we propose a novel approach, COMPAS (COMPose Actions and Slots), to encode these dynamics, focusing on object-centric world models that integrate actions and slots. The main contributions of our work are as follows:
  • We introduce a video slot extractor with a deterministic slot initialization procedure that enforces slot consistency both within trajectories and across episodes. Consistent slot acquisition is illustrated in Figure 1, with additional visualizations provided in Appendix C.
  • We apply a temporal consistency block that enforces trajectory consistency in a parallel manner, significantly speeding up inference and training compared to recurrent approaches.
  • We utilize stacked slot attention blocks, enabling scalability and more efficient inference and training compared to the original iterative approach.
  • We present an efficient transition model that incorporates causal slot information, slot interactions, and actions to predict next states effectively.

2. Related Works

2.1. Object-Centric Learning

Many modern object-centric representation models are based on the slot attention module [3]. It implements an iterative attention mechanism, based on soft k-means clustering, for autoregressive slot refinement. Recent improvements of this method are based on better optimization [19], learnable slot initialization [20], slot structure augmentation [21], and improving the underlying clustering algorithm [22] or simplifying it [23]. However, these methods use simple convolutional neural network (CNN) encoders and decoders, which result in worse image reconstructions. The authors of SLATE [5] proposed to use a discrete latent space, which is extracted from the image by dVAE [24]. The computed slots are then passed through a transformer that predicts the latent token. This token is used in the dVAE decoder for image reconstruction. This results in better-quality reconstructions than in other models.
Classical object-centric models encounter issues with slot permutations. Specifically, when the same image is passed through the model multiple times, the same object may be allocated to a different slot each time. To address this problem, the SAVI [15,16] and STEVE [14] models propose the use of a predictor model that predicts the initialization of slots for use in slot attention for the subsequent frame. This predictor can be implemented as either a multi-layer perceptron (MLP) or a transformer.
Recent progress in object-centric learning has been closely linked to the adoption of pre-trained image transformers. Notably, DINOSAur [6] leverages pre-trained DINO transformers [25]. The primary training objective in this paradigm is to reconstruct patched-feature maps from slots using an MLP as a decoder. An extension of this method, VideoSAur [17], adapts DINOSAur for video data. However, achieving slot consistency within a trajectory through predictor models is challenging due to overfitting and gradient vanishing caused by the recurrent nature of training. Additionally, the iterative processing of slot attention models limits their inference speeds and scalability.

2.2. Object-Centric World Models

Previous object-centric world models have used other algorithms besides slot attention for object state extraction. SQAIR [26] uses a special detection model to detect objects in the scene and track their trajectories through RNNs. SCALOR [27] and SILOT [28] are able to scale the number of objects in SQAIR due to a parallel inference mechanism. STOVE [29] introduces a graph neural network (GNN) [30] for dynamics prediction and per-object interactions.
One of the most notable world models is the Generative Structured World Model (G-SWM) [31]. It treats foreground objects and the background separately by encoding them into two different latent vectors. For next-state prediction, it uses two separate RNNs—one for the foreground latent vector and the other for the background. One of the limitations of this model is that it does not consider actions taken in the environment when predicting future trajectories.
Another important approach is Contrastive Learning of Structured World Models (C-SWM) [32]. It uses a contrastive loss function at the slot level, instead of the basic image reconstruction loss. The model predicts the trajectory using a GNN. A notable improvement upon C-SWM is negative sampling [33], which improves the loss function by selecting negative samples from either different time steps in the same episode or the same time step in different episodes. Another improvement is represented by two types of action attention [34]. Soft attention uses simple self-attention with the single head of a transformer [35]. Hard attention calculates the expectation of all possible assignments to objects, takes the index of the object with the highest probability, and maps the action to this object. The quality of predictions in models utilizing contrastive loss is significantly influenced by the quality of negative samples [33]. The necessity of data deduplication imposes additional constraints on the dataset sampling process, complicating it further in complex, real-world environments. However, since our model employs only the mean squared error (MSE) loss for training, it does not impose strict requirements on the datasets.
Slotformer [4] uses either STEVE or SAVI to extract slots from a sequence of images. The slots are then passed through a transformer to predict future frames of the video. This approach has achieved better results in dynamics prediction than previous models. A further improvement is OCVP [18], which decouples slot interaction and causal dynamics into two attention blocks. However, the main limitation of this method is that, unlike our approach, the model cannot work efficiently in action-dependent environments. Since the taken action can significantly change the environment or agent state, predicting the future based only on previous states may prove to be challenging.

3. Materials and Methods

The COMPAS world model consists of two interconnected components (as shown in Figure 2): the slot extractor and the transition model. The purpose of the slot extractor is to generate object-centric slot representations that remain consistent both within a trajectory and across episodes. Consistency ensures that each slot corresponds to a specific object in the scene, maintaining its association without permuting with other slots over time or across episodes. The transition model predicts the next time step based on a context window of length T containing previous slot representations and the actions taken. The formal problem statement of the COMPAS learning task is presented in Appendix D.
Figure 2. The method starts by inputting a trajectory of T observations into a frozen DINO encoder to extract features. Trainable tokens are duplicated and combined with positional encodings to produce deterministic slot initializations S t init (Figure 3a). These, along with the features, are fed into a slot extractor with temporal sinusoidal encoding. The extractor has two components: a shared key slot attention block that maps slots to feature patches and a temporal consistency block to maintain coherence across time. After L extr layers, slot representations S t are produced for each step. A transition model then predicts future states using discrete a t desc or continuous a t cont actions, processed into vectors. Slots and actions, embedded with temporal encodings, are passed through the predictor block (Figure 3c). After L dyn layers, the model outputs next-step slots S T + 1 .
Figure 2. The method starts by inputting a trajectory of T observations into a frozen DINO encoder to extract features. Trainable tokens are duplicated and combined with positional encodings to produce deterministic slot initializations S t init (Figure 3a). These, along with the features, are fed into a slot extractor with temporal sinusoidal encoding. The extractor has two components: a shared key slot attention block that maps slots to feature patches and a temporal consistency block to maintain coherence across time. After L extr layers, slot representations S t are produced for each step. A transition model then predicts future states using discrete a t desc or continuous a t cont actions, processed into vectors. Slots and actions, embedded with temporal encodings, are passed through the predictor block (Figure 3c). After L dyn layers, the model outputs next-step slots S T + 1 .
Make 08 00117 g002

3.1. Slot Representation Extraction

Contrary to iterative approaches used in the classical slot attention algorithm (or its variations), we utilize sequentially stacked blocks. This design enables increased parallelization, scalability, and faster inference and training. The overall architecture of the slot extractor module is presented in Figure 2.
Firstly, we initialize slots by duplicating a trainable token of size D for T × N times, resulting in a slot initialization S ^ t i n i t . T represents the dimension of the starting time window, D is the size of a single slot, and N is the number of slots. By adding a trainable slot position embedding p s , which encodes each token with specified slot position and sinusoidal time position embeddings p t to the corresponding slots, we obtain the slot initialization at time step t:
S t i n i t = S ^ t i n i t + p s + p t
The slot’s positional encoding remains constant across time steps but differs for each slot. Similarly, the time positional encoding is the same for all slots within a single time step but varies between time steps. This deterministic initialization, combined with a trainable token and slot position embeddings, ensures slot consistency across episodes.
After initialization, we feed S t i n i t into the slot extraction block. To generalize the notation, let S t l represent the slots from the l-th layer of the slot extractor block, where S t 0 = S t i n i t . This block extracts slot representations from encoder features F t R M × J , where M is the number of patches, and J denotes the encoder feature dimensions. We utilize a modified version of the shared key slot attention [23]. The encoder features F t are extracted from input image frames using a pre-trained DINO [25] encoder:
Q = Linear ( S t l 1 ) K = Linear ( F t ) W = Softmax K Q T τ , axis = slots R = W T K S ˜ t l = MLP R + S t l 1
Object-centric models for video, such as the SAVI or Videosaur models, implement slot trajectory consistency through a recurrent predictor model, which predicts slot initialization for the next time step based on the slots from the previous time step. However, this approach may be less high-performing and is prone to gradient decay. To address these issues and improve the training and inference speed, we introduce a temporal consistency block. This block processes slots in parallel and is implemented as a transformer encoder layer with a mask M temp . The mask promotes attention to the same slots from previous time steps causally, while ignoring other slots. In other words, the i-th slot s ˜ t , i l S ˜ t l can only attend to the same slot s ˜ m i , l from a previous time step m ( m < t ). The visualization of the matrix is shown in Figure 3b.
By feeding the sequence of extracted slots from the l-th layer in a time window of size T, S ˜ t l t = 1 T , into the temporal consistency block, we obtain the final slots:
S t l t = 1 T = TransfEncLayer ( S ˜ t l t = 1 T , mask = M temp )
Assuming that the slot extractor model comprises L layers, sequentially implementing these operations results in the final slots S t t = 1 T = S t L t = 1 T .
Since the context length of the model is constrained to a fixed time window T, which is typically small, challenges arise regarding the scalability of the model for longer videos. To address this issue, the slot extractor module is trained in an autoregressive manner. The initialization of the next slots, S T + 1 i n i t , is performed as described in Equation (1). Subsequently, the slots are extracted using the procedure detailed in Equation (2).
The extracted slots, S ˜ T + 1 l , are then appended to the shifted time window of previously extracted slots, S t t = 2 T , resulting in the sequence S 2 , S 3 , , S ˜ T + 1 l . These slots are then fed into the temporal consistency block described in Equation (3). By sequentially processing them through all L layers, the final slots, S T + 1 , are obtained. This procedure can be repeated for any desired horizon beyond the initial context length T.
The training approach follows the methodology outlined in DINOSAUr [6]. After extracting the slots S t across the entire training trajectory T + Z , where Z represents the prediction horizon for autoregressive extraction, these slots are processed through a shared MLP decoder to generate feature reconstructions F ^ t . The training objective is to minimize the mean squared error (MSE) loss: L extr = 1 T + Z t = 1 T + Z F t F ^ t 2 .
The overall time complexity of a single forward pass of the extractor model is O ( T · L · K · N · D ) , where T is the initial time window of slots, L is the number of extractor layers, K is the number of frame feature patches, N is the number of slots, and D is the slot dimension.

3.2. Transition Model

The transition model is inspired by OCVP. During the early phases of training of the slot extractor, slots often lack useful information, which can push the transition model away from the optimal solution or cause it to become stuck in a local minimum. To address these challenges and accelerate the training process, the transition model is trained separately from the slot extractor on pre-computed slots.
At the start, slots S t R N × D , t = 0 , , T within a time window of size T are projected into the transition model’s dimension. Since the transition model operates with the same dimensions D as the slots,
S prep , t = Linear ( S t ) + p t
The transition model supports both continuous and discrete actions. Continuous actions a t cont R B are projected into the same dimension as the transition model:
a ^ t R D = Linear ( a t cont )
For discrete actions, we use trainable embeddings for each possible action a t desc , i A t desc = { a t desc , 1 , a t desc , 2 , , a t desc , J } :
a ^ t = EmbeddingsDictionary ( a t desc , i )
To incorporate temporal information, we add sinusoidal time embeddings p t to each slot and action at time step t:
s prep , t i = s t i + p t , s prep , t i S prep , t a t = a ^ t + p t
Subsequently, actions and slots are passed into an action attention block, relation attention block, and temporal attention in parallel to the original OCVP layer. The architecture of the predictor block is presented in Figure 3c.
Action attention is implemented as a simple multi-head cross-attention. In this layer, actions a t serve as keys and values, while slots S t act as queries.
After projecting the features from the final time step into slot dimensions, we obtain the predicted features for the next time step, denoted as S T + 1 . By shifting the input slot time window and appending S T + 1 as the last element, along with obtaining a ^ T + 1 , we can autoregressively predict future states for any horizon H.
The training objective for the COMPAS dynamics model is defined as the L 2 reconstruction loss between the ground truth slots s t , i S t and the model’s predictions s t , i S t :
L d y n = 1 H · N t = T + 1 H i = 1 N | | s t , i s t , i | | 2
The single-forward-pass time complexity of the transition model is O ( L · D · T · N · ( N + 2 · T ) ) .

4. Results

4.1. Environments and Datasets

To evaluate the effectiveness of our proposed model, we perform experiments using trajectories sampled from three environments: ViZDoom [36], Causal World [37], and RoboSuite [38]. By conducting experiments in the selected environments, we aim to demonstrate the versatility and scalability of our model. A detailed description of the datasets is presented in Appendix B.
We primarily focus on comparing the predictive accuracy of our model against that of recent state-of-the-art (SOTA) methods, such as Slotformer and OCVP. Additionally, we include comparisons with the GNN transition model from CSWM, despite it being outdated and relying on a lower-quality slot extraction procedure.
Additionally, we evaluate the COMPAS extractor model in terms of the quality of object detection and slot consistency performance against the VideoSAUR [17] model across 20 and 100 steps.

4.2. Metrics

Other frameworks, such as CSWM [34], use ranking metrics like HITS@1 and MRR for dynamics prediction evaluation. However, this approach is not entirely representative, as these metrics depend on the scaling of the actual slot values and can yield high scores even if the predictions are inaccurate. To address this, we evaluate the models using pre-computed slots S t and decode them back into observations O t with sizes H × W . If a model can accurately predict the next slots S ^ t , the mean squared error (MSE) with the ground truth should be lower compared to competing models. Similarly, the decoder MSE should also be lower, as better dynamics predictions lead to improved reconstructions O t ^ . We calculate the metric at a certain time step t in order to evaluate the quality of predictions at certain horizons.
MSE t = 1 N | | S t S t ^ | | 2 MSE t decoder = 1 3 ( H × W ) | | O t O t ^ | | 2
To measure consistency, we use the Identity F1 (IDF1) score:
IDF 1 = 2 TP 2 TP + FP + FN ,
where TP is the number of correctly matched object identities, FP is the number of predicted identities that do not correspond to any ground truth identity, and FN is the number of ground truth identities not recovered by the tracker.
To count identity switches, we first build a pairwise IoU between each predicted slot mask and each ground truth mask. We then apply the Hungarian algorithm to find the optimal matching. Thus, we detect slot permutations and treat them as an identity switch.
To measure trajectory consistency, we treat the predicted mask from time t 1 as the ground truth for time t.
For episodic consistency, we fix masks as the ground truth from one reference episode. For every other episode of the same object, we compute the IDF1 between its mask and the ground truth mask from the reference episode.
For the ARI, FG-ARI, and mbo metrics, we follow the setup from [17].

4.3. Training and Evaluation Setup

All transition models are trained and evaluated on pre-computed slots extracted by the COMPAS slot extractor. The slot extractor was trained on 150 epochs with a batch size of 128 for each environment.
More detailed descriptions are available in Appendix A and Appendix B. All configurations and code required for the experiments will be made available.

4.4. Experiments

We introduce OCVP models with simple action conditioning. OCVP (slot act.) uses the processed action vector a ^ t as an additional slot. OCVP (add act.) adds a ^ t to each slot at time step t.
As highlighted in Table 1, COMPAS demonstrated superior predictive accuracy in the Causal World environment with the Reach task. This finding underscores the robustness of our approach in handling complex, physically simulated environments.
Furthermore, COMPAS’s performance increased significantly in the more challenging Push task (Table 2). In contrast, non-action-dependent models, such as Slotformer and OCVP, exhibited substantial performance degradation under these conditions. This comparison highlights the resilience and adaptability of COMPAS in diverse and complex tasks.
On the RoboSuite dataset, the GNN baseline achieved better results than the other baseline models, likely due to the lower object density and reduced interaction complexity in this environment. Nevertheless, COMPAS still achieved the best overall performance on this dataset, outperforming all baseline methods (Table 3).
On the ViZDoom dataset, COMPAS performed consistently well, achieving better results compared to the alternative methods. The results are presented in Table 4.
Moreover, COMPAS demonstrated better object detection quality and slot assignment consistency than the Videosaur model for both short (20 steps) and long (100 steps) trajectories in the Causal World (Push) environment. These results are presented in Table 5.
The ablation study of the COMPAS extractor, presented in Table 6, demonstrates the effectiveness of the proposed architectural components. While the temporal block has a smaller impact on the segmentation quality compared to temporal encoding, it significantly enhances the model’s episodic temporal consistency.
Similarly, the ablation study of the dynamics model, shown in Table 7, highlights the importance of the action-processing component in improving the prediction quality.
An ablation study of separated and end-to-end training for the COMPAS dynamics model is conducted in the Causal World (Push) environment. In the separated setup, the extractor and dynamics model are trained separately, whereas, in the end-to-end setup, they are trained jointly using different optimizers. Table 8 shows that end-to-end training is slightly poorer. A possible reason is that the extractor learns less effective representations under joint optimization, which negatively affects the dynamics model.
We also include an ablation study on the effect of the temporal context size. As shown in Table 9 and Table 10, reducing the context size leads to a slight drop in performance, indicating that a shorter temporal window provides less information for accurate representation learning and dynamics prediction. In contrast, increasing the context size beyond 4 does not yield noticeable improvements, as the results remain nearly unchanged. This suggests that a moderate context size is already sufficient to capture the relevant temporal information, while a larger context provides little additional benefit.
We also include experiments demonstrating the use of our world model for decision-making tasks—specifically, planning with the DINO-WM [39] baseline—in Appendix D.
Overall, COMPAS proves to be a more robust and adaptable model, particularly in synthetic environments and tasks with both discrete and continuous action spaces, further proving its advantages over object-centric baseline methods.

5. Discussion

The main limitation of our model is the introduction of a new hyperparameter, the context length, which can impact the consistency of the slots. Additionally, the model’s performance is influenced by hyperparameters such as the number of slots and the context time window, both of which can significantly affect the quality of slot representations and dynamics predictions. Furthermore, handling real-world data poses a significant challenge, as object-centric models inherently struggle to process such data efficiently. Future research could address these issues and improve the model’s robustness and adaptability.

6. Conclusions

Our model demonstrates superior consistency over multiple prediction steps compared to analogous models in the field. This confirms the model’s ability to handle different prediction horizons.
Looking ahead, a promising direction is to explore neuro-symbolic perspectives in object-centric approaches. By combining neural networks with symbolic reasoning, models can integrate structured knowledge with learned data, enhancing interpretability, robustness, and generalization. Another key direction is developing algorithms that efficiently leverage object information to drive agent behavior. This requires the careful handling of composable states, as merging multiple object states into a single representation can obscure critical information and reintroduce the binding problem.
In conclusion, our work represents an important step forward in object-centric world models. By integrating action information directly into the model and using an autoregressive transformer model for prediction, we have demonstrated a novel way to encode and predict the dynamics of complex environments. We expect that our work will inspire further research and advances in this exciting area of machine learning.

Author Contributions

Conceptualization, V.V. and L.U.; methodology, V.V.; software, V.V.; validation, V.V., L.U. and V.F.; formal analysis, L.U.; writing—original draft preparation, V.V., L.U. and V.F.; writing—review and editing, V.V., V.F., L.U., A.K. and A.P.; visualization, V.V.; supervision, A.P. and A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation under Project 075-15-2024-544.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data were obtained from publically available and open-sourced simulators. Data collection is described in Appendix B.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AI    Artificial Intelligence
ARI    Adjusted Rand Index
CNN    Convolutional Neural Network
COMPAS    Compose Actions and Slots
CSWM    Contrastive Structured World Model
DINOSelf-Distillation with No Labels
FGForeground
GNNGraph Neural Network
HITSHuman Importance-Aware Transition Score
MLPMulti-Layer Perceptron
MRRMean Reciprocal Rank
MSEMean Squared Error
OCVPObject-Centric Video Prediction
RGBRed Green Blue (color representation)
RLReinforcement Learning
RNNRecurrent Neural Network
SAVISlot Attention for Video
SCALORScene-Centric Autoencoder for Object Representations
SILOTSequential Inference for Latent Object Tracking
SLATESlot Attention-Based Transformer Encoder
SOTAState-of-the-Art
SQAIRSequential Attend Infer Repeat
STEVESlot-Based Transformer for Video
STOVEStructured Object-Aware Video Prediction
SWMStructured World Model
VIDEOSAURVideo Object-Centric Representation Learning Model

Appendix A. Architecture and Computational Details

Table A1, Table A2, Table A3, Table A4, Table A5, Table A6 and Table A7 describe the architecture details and hyperparameters for our experiments.
Table A1. Hyperparameters of COMPAS slot extractor used for Causal World experiments.
Table A1. Hyperparameters of COMPAS slot extractor used for Causal World experiments.
ModuleParameterValue
Image size224
Number of train episodes collected10,000
Number of validation episodes collected1000
Number of steps per episode100
Number of training epochs150
Batch size128
Video length8
DINOPatch size8
DINOModel sizeS
COMPASSlot dim128
COMPASNumber of slots8
COMPASNumber of layers4
COMPASAction typeContinuous
COMPASAction size9
Table A2. Hyperparameters of COMPAS slot extractor used for ViZDoom experiments.
Table A2. Hyperparameters of COMPAS slot extractor used for ViZDoom experiments.
ModuleParameterValue
Image size224
Number of train episodes collected10,000
Number of validation episodes collected1000
Number of steps per episode100
Number of training epochs150
Batch size128
Video length8
DINOPatch size8
DINOModel sizeS
COMPASSlot dim128
COMPASNumber of slots8
COMPASNumber of layers4
COMPASAction typeDiscrete
COMPASNumber of action embeddings4
Table A3. Hyperparameters of COMPAS slot extractor used for RoboSuite experiments.
Table A3. Hyperparameters of COMPAS slot extractor used for RoboSuite experiments.
ModuleParameterValue
Image size224
Number of train episodes collected5000
Number of validation episodes collected1000
Number of steps per episode100
Number of training epochs150
Batch size128
Video length8
DINOPatch size8
DINOModel sizeS
COMPASSlot dim128
COMPASNumber of slots4
COMPASNumber of layers4
COMPASAction typeContinuous
COMPASAction size4
Table A4. COMPAS transition model hyperparameters.
Table A4. COMPAS transition model hyperparameters.
EnvironmentModel DimMemory LengthNum. LayersHeadsDropoutTraining EpochsBatch SizeHorizon
ViZDoom1284840.1100644
Casual World1284840.1100644
RoboSuite1284840.1100644
Table A5. Slotformer transition model hyperparameters.
Table A5. Slotformer transition model hyperparameters.
EnvironmentModel DimMemory LengthNum. LayersHeadsDropoutTraining EpochsBatch SizeHorizon
ViZDoom1284840.1100644
Casual World1284840.1100644
RoboSuite1284840.1100644
Table A6. OCVP transition model hyperparameters.
Table A6. OCVP transition model hyperparameters.
EnvironmentModel DimMemory LengthNum. LayersHeadsDropoutTraining EpochsBatch SizeHorizon
ViZDoom1284840.1100644
Casual World1284840.1100644
RoboSuite1284840.1100644
Table A7. GNN transition model hyperparameters.
Table A7. GNN transition model hyperparameters.
EnvironmentModel DimMemory LengthNum. LayersHeadsDropoutTraining EpochsBatch SizeHorizon
ViZDoom1284840.1100644
Casual World1284840.1100644
RoboSuite1284840.1100644
All experiments were conducted in a SLURM-managed environment. Each run was performed on a single node with the hardware configuration detailed in Table A8. A GPU served as the primary computational device for all experiments.
Table A8. Hardware configuration for SLURM node.
Table A8. Hardware configuration for SLURM node.
GPUNVIDIA A100 80 GB
CPUIntel Xeon Gold 6248R
Number of cores4
RAM1024 GB

Appendix B. Datasets

Appendix B.1. ViZDoom

ViZDoom is a first-person-perspective environment built upon the classic Doom video game engine. It offers a visually detailed and dynamic setting for experiments, combining elements such as partial observability and complex enemy behavior. We use the Defend the Line task within ViZDoom to evaluate our model’s ability to operate in a discrete action space effectively. In this task, the agent is placed in a rectangular arena, starting at the center of one of the longer walls. On the opposite wall, three melee-only monsters and three shooting monsters are spawned. Enemies can be killed with a single shot. After being defeated, each monster respawns after a delay with increased durability, posing a progressively greater challenge. The discrete action space for this task consists of four actions: no action, turn left, turn right, and shoot.
We collected the data using a uniform random policy, consisting of 10,000 training episodes, each with 100 time steps, and 1000 evaluation episodes of the same length. The observations consist of RGB images with a resolution of 224 × 224.

Appendix B.2. Causal World

At the other end of the complexity spectrum, we use the Causal World environment, which presents a much more complicated and challenging scenario. Causal World is a physically simulated environment with three robotic arms interacting with objects. It provides a continuous action space and requires the model to understand complex physical interactions and the impacts of nuanced actions on the state of the world. In the Causal World environment, we consider two tasks, both with an action space of size 9. In the Push task, the goal of the agent is to push one block towards a goal position with a specific orientation. In the Reach task, the agent needs to reach the target block from four blocks overall via a manipulator. For the Reach and Push tasks, we followed the same data collection scheme as for the ViZDoom environment. Observations in Causal World are rendered as images of size 224 × 224.

Appendix B.3. RoboSuite

We additionally experiment with a more difficult robotic environment, RoboSuite, which uses a Franka Panda robotic arm for manipulation. We choose the Block Lifting task, in which a robotic arm must pick up a red cube placed on the tabletop in front of it and lift it above a certain height. The cube location and the robot’s initial states are randomized at the start of each and every episode. The size of the continuous action space for this task is four. The dataset was split into a training set containing 5000 trajectories and a validation set containing 1000 trajectories. The data were collected using a uniform random policy. In the RoboSuite environment, we captured images of size 256 × 256 from the front-view camera. These images were resized to a 224 × 224 resolution.

Appendix C. Additional Slot Mapping Visualization

We provide additional visualizations from different environments for the slot extractor model in Figure A1 and Figure A2.
Figure A1. Episode-consistent slot visualizations from RoboSuite environment, extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.
Figure A1. Episode-consistent slot visualizations from RoboSuite environment, extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.
Make 08 00117 g0a1
Figure A2. Episode-consistent slot visualizations from ViZDoom environment, extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.
Figure A2. Episode-consistent slot visualizations from ViZDoom environment, extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.
Make 08 00117 g0a2

Appendix D. Model-Based Planning

In this section, we present the results of planning using the COMPAS world model, trained in an offline setting. We adopt the same planning setup as for DINO-WM [39] and compare our approach directly with theirs in our experiments.
We evaluate on two environments: the Gym Fetch Reach task, where a robotic manipulator must reach a target position, and the Point Maze environment—a 3D labyrinth where a red ball must navigate to a pre-defined green target position.
We report the success rate over 80 runs for each environment, using the same Point Maze configuration as for DINO-WM. For the Fetch Reach task, COMPAS uses the RoboSuite architecture. For the Point Maze, we use the same configuration as DINO-WM, with the only difference being that COMPAS uses five slots. The results are summarized in Table A9.
Table A9. Performance comparison of COMPAS and DINO-WM across Fetch Reach and Point Maze tasks. Mean ± std for 3 seeds.
Table A9. Performance comparison of COMPAS and DINO-WM across Fetch Reach and Point Maze tasks. Mean ± std for 3 seeds.
ModelFetch ReachPoint Maze
COMPAS1.000.69
DINO-WM1.000.98
While our model performs poorly in the Point Maze environment, we suspect that this is due to the reintroduction of the binding problem. Since planning requires combining all slots to compute the planning loss, this process may inadvertently merge distinct object representations, leading to information mixing.
Additionally, another factor affecting planning performance is the absence of proprioceptive state encoding in our model. Since it is trained solely on pixel observations, it may lack crucial internal state information needed for precise control.
Nonetheless, the experiment shows that our model can still capture meaningful environmental representations that are useful for planning.

Appendix E. Problem Statement

We work with a dataset of trajectories. Let x 1 , , x t , denote a sequence of images constituting a trajectory. We then extract patch-encoded features from the images using a pre-trained frozen DINO encoder: F ^ t = DINO ( x t ) R K × J (where K is the number of patches of size J for a single image). Let N slot denote the chosen number of slots. For time steps 1 t T within the context window of size T, the slots are initialized deterministically: z t i = μ i , for 1 i < N slot , where μ i denotes learnable parameters (Equation (1) in the main text). Let f θ denote a slot attention module with parameters θ implementing the neural network layers described in Equations (2) and (3) in the main text. The slots for time steps within the context window 1 t T are obtained using the initialized values:
f θ ( F ^ t , z t 1 , , z t N slot ) s t 1 , , s t N slot .
The slots for time steps from the autoregressive horizon T < t T + Z are calculated using the values of slots from the previous time step:
f θ ( F ^ t , s t 1 1 , , s t 1 N ) s t 1 , , s t N slot .
Let g η denote a decoder network with parameters η . We use it to reconstruct the ground truth DINO features F ^ t from the slot representations:
g η ( s t 1 , , s t N ) F ˜ t .
We assume the multivariate Gaussian statistical model P ( F t | img t ; { μ i } , θ , η ) with a diagonal identity covariance matrix Σ for the DINO features:
P ( F t | x t ; { μ i } , θ , η ) = 1 ( det Σ ) 0.5 ( 2 π ) 0.5 K J exp 1 2 ( F t F ^ t ) T Σ 1 ( F t F ^ t )
We optimize the slot initialization parameters { μ i } and the parameters of the neural networks f θ and g η by maximizing the likelihood of the observed data, assuming independence between time steps. This gives us an expression that is proportional to the MSE objective L extr for slot extraction from the main text:
{ μ i * } , θ * , η * = arg max P ( F ˜ 1 | x 1 ; { μ i } , θ , η ) P ( F ˜ T + Z | x T + Z ; { μ i } , θ , η ) = arg max t = 1 T + Z log P ( F ˜ t | x t ; { μ i } , θ , η ) = arg max T + Z 2 K J log 2 π t = 1 T + Z F ˜ t ( { μ i } , θ , η ) F ^ t 2 2 = arg min t = 1 T + Z F ˜ t ( { μ i } , θ , η ) F ^ t 2
When we train the transition model, we work with a dataset of slots extracted from the trajectory images. Let S ^ 1 = ( s ^ 1 1 , , s ^ 1 N s l o t s ) , , S ^ t = ( s ^ t 1 , , s ^ t N s l o t s ) , denote a sequence of slot representations extracted from the sequence of images x 1 , , x t , constituting a trajectory. Let a 1 , , a t , denote a sequence of actions from the same trajectory. We project actions into the internal dimension of size D of the transition model by employing a learnable function p ν ( a t ) a ^ t R D . In the case of a continuous action space, p ν is a neural network with parameters ν (Equation (5) in the main text), while, in the case of a discrete action space, p ν is a set of learnable embeddings for each possible action (Equation (6) in the main text).
Let h ϕ denote the transition model with parameters ϕ (Equations (4) and (7) and Figure 2 in the main text). It takes a sequence of size T of slots and projected actions as input and predicts the values of slots for the next time step T + 1 :
h ϕ ( S ^ 1 , , S ^ T , a ^ 1 , , a ^ T ) S ˜ T + 1 = ( s ˜ T + 1 1 , , s ˜ T + 1 N s l o t s ) .
For time steps within the prediction horizon, T + 1 < t T + L , we predict slots in an autoregressive manner. For instance, for time step t = T + 2 , we obtain
h ϕ ( S ^ 1 , , S ^ T , S ˜ T + 1 , a ^ 1 , , a ^ T + 1 ) S ˜ T + 2 .
We assume the multivariate Gaussian statistical model
P ( S T + 1 | S 1 , , S T , a ^ 1 , , a ^ T ; ν , ϕ )
with a diagonal identity covariance matrix Σ for the slot representations of the next step:
P ( S T + 1 | S 1 , , S T , a ^ 1 , , a ^ T ; ν , ϕ ) = 1 ( det Σ ) 0.5 ( 2 π ) 0.5 D N s l o t s × exp 1 2 ( S T + 1 S ^ T ) T Σ 1 ( S T S ^ T ) .
We optimize the parameters of p ν and h ϕ by maximizing the likelihood of the observed data, assuming independence between time steps within the prediction horizon. This gives us an expression that is proportional to the MSE objective L s (Equation (8) in the main text) for slot extraction:
ν * , ϕ * = arg max P ( S ˜ T + 1 | S ^ 1 , , S ^ T , a ^ 1 , , a ^ T ; ν , ϕ ) P ( S ˜ T + L | S ^ L 1 , , S ˜ T + L 1 , a ^ L , , a ^ T + L 1 ; ν , ϕ ) = arg max t = T + 1 T + L log P ( S ˜ t | S ^ t T , , S ˜ t , a ^ t T , , a ^ t ; ν , ϕ ) = arg max L D N s l o t s 2 log 2 π t = T + 1 T + L S ˜ t ( ν , ϕ ) S ^ t 2 2 = arg min t = T + 1 T + L S ˜ t ( ν , ϕ ) S ^ t 2 .

References

  1. Lin, B.; Bouneffouf, D.; Rish, I. A Survey on Compositional Generalization in Applications. arXiv 2023, arXiv:2302.01067. [Google Scholar] [CrossRef]
  2. Greff, K.; Van Steenkiste, S.; Schmidhuber, J. On the binding problem in artificial neural networks. arXiv 2020, arXiv:2012.05208. [Google Scholar] [CrossRef]
  3. Locatello, F.; Weissenborn, D.; Unterthiner, T.; Mahendran, A.; Heigold, G.; Uszkoreit, J.; Dosovitskiy, A.; Kipf, T. Object-Centric Learning with Slot Attention. arXiv 2020. [Google Scholar] [CrossRef]
  4. Wu, Z.; Dvornik, N.; Greff, K.; Kipf, T.; Garg, A. SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. arXiv 2023, arXiv:2210.05861. [Google Scholar] [CrossRef]
  5. Singh, G.; Deng, F.; Ahn, S. Illiterate DALL-E Learns to Compose. In Proceedings of the 10th International Conference on Learning Representations, ICLR 2022, Virtual, 25–29 April 2022. [Google Scholar]
  6. Seitzer, M.; Horn, M.; Zadaianchuk, A.; Zietlow, D.; Xiao, T.; Simon-Gabriel, C.J.; He, T.; Zhang, Z.; Schölkopf, B.; Brox, T.; et al. Bridging the Gap to Real-World Object-Centric Learning. In Proceedings of the The Eleventh International Conference on Learning Representations; OpenReview: Amherst, MA, USA, 2023. [Google Scholar]
  7. Wu, P.; Escontrela, A.; Hafner, D.; Abbeel, P.; Goldberg, K. Daydreamer: World models for physical robot learning. In Proceedings of the Conference on Robot Learning; PMLR: New York, NY, USA, 2023; pp. 2226–2240. [Google Scholar]
  8. Burgard, W.; Hebert, M.; Bennewitz, M. World modeling. In Springer Handbook of Robotics; Springer International Publishing: Cham, Switzerland, 2016; pp. 1135–1152. [Google Scholar]
  9. Ha, D.; Schmidhuber, J. World models. arXiv 2018, arXiv:1803.10122. [Google Scholar]
  10. Hafner, D.; Lillicrap, T.P.; Norouzi, M.; Ba, J. Mastering Atari with Discrete World Models. arXiv 2020, arXiv:2010.02193. [Google Scholar] [CrossRef]
  11. Micheli, V.; Alonso, E.; Fleuret, F. Transformers are Sample-Efficient World Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  12. Santoro, A.; Raposo, D.; Barrett, D.G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
  13. Oprea, S.; Martinez-Gonzalez, P.; Garcia-Garcia, A.; Castro-Vargas, J.A.; Orts-Escolano, S.; Garcia-Rodriguez, J.; Argyros, A. A Review on Deep Learning Techniques for Video Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2806–2826. [Google Scholar] [CrossRef] [PubMed]
  14. Singh, G.; Wu, Y.F.; Ahn, S. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A.H., Agarwal, A., Belgrave, D., Cho, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022. [Google Scholar]
  15. Kipf, T.; Elsayed, G.F.; Mahendran, A.; Stone, A.; Sabour, S.; Heigold, G.; Jonschkowski, R.; Dosovitskiy, A.; Greff, K. Conditional Object-Centric Learning from Video. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
  16. Elsayed, G.F.; Mahendran, A.; van Steenkiste, S.; Greff, K.; Mozer, M.C.; Kipf, T. SAVi++: Towards end-to-end object-centric learning from real-world videos. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: New York, NY, USA, 2022. [Google Scholar]
  17. Zadaianchuk, A.; Seitzer, M.; Martius, G. Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS 2023); Curran Associates, Inc.: New York, NY, USA, 2023. [Google Scholar]
  18. Villar-Corrales, A.; Wahdan, I.; Behnke, S. Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions. In Proceedings of the Internation Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
  19. Chang, M.; Griffiths, T.; Levine, S. Object Representations as Fixed Points: Training Iterative Refinement Algorithms with Implicit Differentiation. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35. [Google Scholar]
  20. Jia, B.; Liu, Y.; Huang, S. Improving Object-centric Learning with Query Optimization. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  21. Singh, G.; Kim, Y.; Ahn, S. Neural Systematic Binder. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  22. Kirilenko, D.; Vorobyov, V.; Kovalev, A.; Panov, A. Object-Centric Learning with Slot Mixture Module. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  23. Patil, V.; Radler, A.; Klotz, D.; Hochreiter, S. Simplified priors for Object-Centric Learning. arXiv 2024, arXiv:2410.00728. [Google Scholar] [CrossRef]
  24. Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  25. Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021. [Google Scholar]
  26. Kosiorek, A.; Kim, H.; Teh, Y.W.; Posner, I. Sequential attend, infer, repeat: Generative modelling of moving objects. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2018; pp. 8606–8616. [Google Scholar]
  27. Jiang, J.; Janghorbani, S.; de Melo, G.; Ahn, S. SCALOR: Generative World Models with Scalable Object Representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  28. Crawford, E.; Pineau, J. Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2020; pp. 3684–3692. [Google Scholar]
  29. Kossen, J.; Stelzner, K.; Hussing, M.; Voelcker, C.; Kersting, K. Structured Object-Aware Physics Prediction for Video Modeling and Planning. arXiv 2019, arXiv:1910.02425. [Google Scholar]
  30. Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
  31. Lin, Z.; Wu, Y.F.; Peri, S.; Fu, B.; Jiang, J.; Ahn, S. Improving generative imagination in object-centric world models. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2020; pp. 6140–6149. [Google Scholar]
  32. Kipf, T.N.; van der Pol, E.; Welling, M. Contrastive Learning of Structured World Models. In Proceedings of the 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  33. Biza, O.; van der Pol, E.; Kipf, T. The Impact of Negative Sampling on Contrastive Structured World Models. In Proceedings of the ICML Workshop: Self-Supervised Learning for Reasoning and Perception, Online, 24 July 2021. [Google Scholar]
  34. Biza, O.; Platt, R.; van de Meent, J.W.; Wong, L.L.S.; Kipf, T. Binding Actions to Objects in World Models. In Proceedings of the ICLR 2022 Workshop on Objects, Structure and Causality, Online, 25–29 April 2022. [Google Scholar]
  35. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NeurIPS; Curran Associates, Inc.: New York, NY, USA, 2017. [Google Scholar]
  36. Wydmuch, M.; Kempka, M.; Jaśkowski, W. ViZDoom Competitions: Playing Doom from Pixels. IEEE Trans. Games 2019, 11, 248–259. [Google Scholar] [CrossRef]
  37. Ahmed, O.; Träuble, F.; Goyal, A.; Neitz, A.; Bengio, Y.; Schölkopf, B.; Wüthrich, M.; Bauer, S. Causalworld: A robotic manipulation benchmark for causal structure and transfer learning. arXiv 2020, arXiv:2010.04296. [Google Scholar] [CrossRef]
  38. Zhu, Y.; Wong, J.; Mandlekar, A.; Martín-Martín, R.; Joshi, A.; Nasiriany, S.; Zhu, Y. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning. arXiv 2022, arXiv:2009.12293. [Google Scholar] [CrossRef]
  39. Zhou, G.; Pan, H.; LeCun, Y.; Pinto, L. DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning. arXiv 2024, arXiv:2411.04983. [Google Scholar] [CrossRef]
Figure 1. Episode-consistent slot visualizations extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.
Figure 1. Episode-consistent slot visualizations extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.
Make 08 00117 g001
Figure 3. COMPAS components: (a) slot initialization architecture; (b) temporal attention mask; (c) COMPAS dynamics predictor block.
Figure 3. COMPAS components: (a) slot initialization architecture; (b) temporal attention mask; (c) COMPAS dynamics predictor block.
Make 08 00117 g003
Table 1. Comparative results for the Causal World (Reach) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
Table 1. Comparative results for the Causal World (Reach) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
10 Steps30 Steps60 Steps
Model MSE MSEdecoder MSE MSE decoder MSE MSE decoder
Slotformer 6.48 ± 0.02 15.35 ± 0.03 7.42 ± 0.12 17.95 ± 0.10 8.22 ± 0.19 19.30 ± 0.19
OCVP 6.32 ± 0.03 15.10 ± 0.04 7.28 ± 0.11 17.20 ± 0.13 8.08 ± 0.18 19.25 ± 0.17
OCVP (add act.) 4.66 ± 0.03 11.72 ± 0.03 5.88 ± 0.09 13.58 ± 0.10 6.94 ± 0.17 15.66 ± 0.17
OCVP (slot act.) 4.26 ± 0.02 10.86 ± 0.02 5.46 ± 0.08 12.84 ± 0.09 6.62 ± 0.16 14.92 ± 0.16
GNN 5.18 ± 0.03 14.48 ± 0.03 6.75 ± 0.10 16.18 ± 0.12 7.88 ± 0.18 18.42 ± 0.18
COMPAS (ours)3.18 ± 0.014.38 ± 0.014.23 ± 0.076.82 ± 0.075.12 ± 0.1510.22 ± 0.15
Table 2. Comparative results for the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
Table 2. Comparative results for the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
10 Steps30 Steps60 Steps
Model MSE MSE decoder MSE MSE decoder MSE MSE decoder
Slotformer 6.36 ± 0.02 15.33 ± 0.04 7.38 ± 0.10 17.94 ± 0.12 8.07 ± 0.18 19.08 ± 0.20
OCVP 8.65 ± 0.03 16.11 ± 0.05 9.40 ± 0.12 18.98 ± 0.15 10.04 ± 0.19 21.00 ± 0.18
OCVP (add act.) 4.72 ± 0.03 12.84 ± 0.03 5.98 ± 0.10 14.76 ± 0.11 7.02 ± 0.17 16.92 ± 0.18
OCVP (slot act.) 4.38 ± 0.02 11.96 ± 0.03 5.62 ± 0.09 13.88 ± 0.10 6.76 ± 0.16 15.84 ± 0.17
GNN 5.20 ± 0.04 14.50 ± 0.03 6.80 ± 0.11 16.20 ± 0.14 7.90 ± 0.17 18.40 ± 0.19
COMPAS (ours)3.15 ± 0.016.40 ± 0.014.25 ± 0.088.80 ± 0.085.10 ± 0.1610.20 ± 0.16
Table 3. Comparative results for the RoboSuite environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
Table 3. Comparative results for the RoboSuite environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
10 Steps30 Steps60 Steps
Model MSE MSE decoder MSE MSE decoder MSE MSE decoder
Slotformer 1.70 ± 0.03 9.83 ± 0.08 1.99 ± 0.11 11.52 ± 0.14 2.65 ± 0.18 12.01 ± 0.20
OCVP 1.82 ± 0.05 10.55 ± 0.07 2.20 ± 0.13 12.19 ± 0.15 2.58 ± 0.19 12.94 ± 0.18
GNN 1.00 ± 0.02 9.46 ± 0.04 1.49 ± 0.12 12.08 ± 0.13 1.89 ± 0.17 12.97 ± 0.19
COMPAS (ours)0.50 ± 0.018.08 ± 0.030.86 ± 0.1110.53 ± 0.121.25 ± 0.1611.93 ± 0.17
Table 4. Comparative results for the ViZDoom (Defend the Line) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
Table 4. Comparative results for the ViZDoom (Defend the Line) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
10 Steps30 Steps60 Steps
Model MSE MSE decoder MSE MSE decoder MSE MSE decoder
Slotformer 14.58 ± 0.05 6.51 ± 0.02 20.22 ± 0.12 14.94 ± 0.10 27.39 ± 0.18 30.31 ± 0.20
OCVP 14.10 ± 0.03 6.08 ± 0.03 20.41 ± 0.10 14.89 ± 0.12 26.86 ± 0.20 31.04 ± 0.18
GNN 8.75 ± 0.04 5.20 ± 0.03 15.60 ± 0.11 12.40 ± 0.11 23.10 ± 0.17 32.50 ± 0.19
COMPAS (ours)4.31 ± 0.012.85 ± 0.019.43 ± 0.0810.27 ± 0.0821.23 ± 0.1629.17 ± 0.16
Table 5. Performance comparison of COMPAS extractor and Videosaur model over 20 and 100 steps in Causal World (Push) environment. Mean ± std for 3 seeds.
Table 5. Performance comparison of COMPAS extractor and Videosaur model over 20 and 100 steps in Causal World (Push) environment. Mean ± std for 3 seeds.
ModelARIFG-ARImboTrajectory IDF1Episodic IDF1
COMPAS (100 steps) 0 . 43 ± 0 . 14 0 . 61 ± 0 . 05 0 . 79 ± 0 . 03 1 . 00 ± 0 . 00 0 . 78 ± 0 . 05
Videosaur (100 steps) 0.19 ± 0.03 0.20 ± 0.02 0.19 ± 0.02 0.21 ± 0.04 0.11 ± 0.02
COMPAS (20 steps) 0 . 51 ± 0 . 06 0 . 69 ± 0 . 03 0 . 82 ± 0 . 01 1 . 00 ± 0 . 00 0 . 82 ± 0 . 04
Videosaur (20 steps) 0.44 ± 0.05 0.51 ± 0.06 0.76 ± 0.02 0.74 ± 0.04 0.12 ± 0.01
Table 6. Ablation analysis of COMPAS extractor model. Mean ± std for 3 seeds.
Table 6. Ablation analysis of COMPAS extractor model. Mean ± std for 3 seeds.
ModelARIFG-ARImboTrajectory IDF1Episodic IDF1
COMPAS 0.43 ± 0.14 ̲ 0.61 ± 0.05 ̲ 0.79 ± 0.03 ̲ 1 . 00 ± 0 . 00 0.78 ± 0.05 ̲
No temporal block 0.42 ± 0.12 ̲ 0.60 ± 0.04 ̲ 0.75 ± 0.02 ̲ 0.89 ± 0.04 0.75 ± 0.04 ̲
No temporal encoding 0.18 ± 0.05 0.21 ± 0.01 0.21 ± 0.03 0.41 ± 0.05 0.34 ± 0.05
No temporal block and encoding 0.16 ± 0.04 0.18 ± 0.04 0.19 ± 0.02 0.32 ± 0.05 0.28 ± 0.03
No slot encoding 0.00 ± 0.00 0.00 ± 0.00 0.18 ± 0.01 0.21 ± 0.05 0.19 ± 0.05
Table 7. Ablation analysis of COMPAS dynamics model. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
Table 7. Ablation analysis of COMPAS dynamics model. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
10 Steps30 Steps60 Steps
Model MSE MSE decoder MSE MSE decoder MSE MSE decoder
COMPAS 3 . 15 ± 0 . 01 6 . 40 ± 0 . 01 4 . 25 ± 0 . 08 8 . 80 ± 0 . 08 5 . 10 ± 0 . 16 10 . 20 ± 0 . 16
No interaction block 4.83 ± 0.02 12.17 ± 0.03 5.41 ± 0.07 13.71 ± 0.09 5.96 ± 0.12 15.45 ± 0.14
No temporal block 4.84 ± 0.03 12.02 ± 0.02 5.35 ± 0.06 13.50 ± 0.10 5.82 ± 0.11 15.25 ± 0.13
No action block 8.65 ± 0.03 16.11 ± 0.05 9.40 ± 0.12 18.98 ± 0.15 10.04 ± 0.19 21.00 ± 0.18
Table 8. Ablation analysis of separated and end-to-end training for the COMPAS dynamics model in the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
Table 8. Ablation analysis of separated and end-to-end training for the COMPAS dynamics model in the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
10 Steps30 Steps60 Steps
Model MSE MSE decoder MSE MSE decoder MSE MSE decoder
COMPAS (separated)3.15 ± 0.016.40 ± 0.014.25 ± 0.088.80 ± 0.085.10 ± 0.1610.20 ± 0.16
COMPAS (end-to-end) 3.48 ± 0.02 6.92 ± 0.02 4.76 ± 0.09 9.54 ± 0.09 5.84 ± 0.17 11.18 ± 0.17
Table 9. Ablation analysis of context size for the COMPAS extractor model in the Causal World (Push) environment. Mean ± std for 3 seeds.
Table 9. Ablation analysis of context size for the COMPAS extractor model in the Causal World (Push) environment. Mean ± std for 3 seeds.
Context SizeARIFG-ARImboTrajectory IDF1Episodic IDF1
3 0.40 ± 0.13 0.58 ± 0.05 0.76 ± 0.03 1.00 ± 0.00 0.75 ± 0.05
4 0.43 ± 0.14 0.61 ± 0.05 0.79 ± 0.03 1.00 ± 0.00 0.78 ± 0.05
10 0.44 ± 0.13 0.60 ± 0.04 0.80 ± 0.02 1.00 ± 0.00 0.79 ± 0.04
Table 10. Ablation analysis of context size for the COMPAS dynamics model in the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
Table 10. Ablation analysis of context size for the COMPAS dynamics model in the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.
10 Steps30 Steps60 Steps
Context Size MSE MSE decoder MSE MSE decoder MSE MSE decoder
3 3.48 ± 0.02 6.92 ± 0.02 4.76 ± 0.09 9.54 ± 0.09 5.84 ± 0.17 11.18 ± 0.17
4 3.15 ± 0.01 ̲ 6.40 ± 0.01 ̲ 4.25 ± 0.08 ̲ 8.80 ± 0.08 ̲ 5.10 ± 0.16 ̲ 10.20 ± 0.16 ̲
10 3.16 ± 0.01 ̲ 6.42 ± 0.01 ̲ 4.24 ± 0.08 ̲ 8.82 ± 0.08 ̲ 5.11 ± 0.16 ̲ 10.19 ± 0.16 ̲
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vorobyov, V.; Ugadiarov, L.; Frolov, V.; Kovalev, A.; Panov, A. COMPAS: Compose Actions and Slots in Object-Centric World Models. Mach. Learn. Knowl. Extr. 2026, 8, 117. https://doi.org/10.3390/make8050117

AMA Style

Vorobyov V, Ugadiarov L, Frolov V, Kovalev A, Panov A. COMPAS: Compose Actions and Slots in Object-Centric World Models. Machine Learning and Knowledge Extraction. 2026; 8(5):117. https://doi.org/10.3390/make8050117

Chicago/Turabian Style

Vorobyov, Vitaliy, Leonid Ugadiarov, Vladimir Frolov, Alexey Kovalev, and Aleksandr Panov. 2026. "COMPAS: Compose Actions and Slots in Object-Centric World Models" Machine Learning and Knowledge Extraction 8, no. 5: 117. https://doi.org/10.3390/make8050117

APA Style

Vorobyov, V., Ugadiarov, L., Frolov, V., Kovalev, A., & Panov, A. (2026). COMPAS: Compose Actions and Slots in Object-Centric World Models. Machine Learning and Knowledge Extraction, 8(5), 117. https://doi.org/10.3390/make8050117

Article Metrics

Back to TopTop