COMPAS: Compose Actions and Slots in Object-Centric World Models

Vorobyov, Vitaliy; Ugadiarov, Leonid; Frolov, Vladimir; Kovalev, Alexey; Panov, Aleksandr

doi:10.3390/make8050117

Open AccessArticle

COMPAS: Compose Actions and Slots in Object-Centric World Models

by

Vitaliy Vorobyov

^1,2

,

Leonid Ugadiarov

^1,3

,

Vladimir Frolov

¹

,

Alexey Kovalev

^1,3 and

Aleksandr Panov

^2,3,*

¹

Moscow Independent Research Institute of Artificial Intelligence, Moscow 105064, Russia

²

Federal Research Center “Computer Science and Control”, Russian Academy of Sciences, Moscow 119333, Russia

³

Cognitive AI Systems Lab, Moscow 123317, Russia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(5), 117; https://doi.org/10.3390/make8050117

Submission received: 18 March 2026 / Revised: 14 April 2026 / Accepted: 27 April 2026 / Published: 29 April 2026

(This article belongs to the Section Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In this paper, we propose a novel approach, COMPAS (COMPose Actions and Slots), which leverages the strengths of state-of-the-art object-centric approaches for modeling the dynamics of an environment. Our method encodes the environment’s state into symbol-like, object-centric representations, known as slots, where each slot corresponds to an individual object. This approach offers a structured and interpretable way to model complex environments by combining slots with action representations for accurate next-state prediction. The primary contribution of our work is an efficient world model with a dynamics predictor capable of predicting accurate trajectories in action-dependent environments. Additionally, our slot extractor module enhances the predictive capabilities by extracting deterministic slots that remain consistent both within a single trajectory and across episodes. Unlike slots sampled from a trainable distribution, deterministic slots are generated from a single trainable parameter together with slot positional embeddings. This design improves the consistency across episodes, which in turn leads to more accurate dynamics prediction. We present a comprehensive evaluation of our approach in various environments, demonstrating that our proposed method outperforms competing models in environments with discrete and continuous action spaces.

Keywords:

object-centric world models; object-centric learning; world models

1. Introduction

The ability to generalize compositionally is the key to learning faster, solving new problems, and understanding new concepts with limited experience [1]. The difficulty in compositional generalization is caused by the so-called binding problem [2]—the inability of modern artificial neural networks to dynamically and flexibly bind information distributed over the network, which arises in learning on unstructured input data.

A possible solution to this problem could be the use of symbol-like representations. Such representations can be slot representations [3], where, instead of a single latent embedding, the input data are encoded by a set of such embeddings (slots). Slots compete with each other to describe a portion of the input data. Such representations have been successfully used for object-centric tasks such as set property prediction and object detection [3] in an image, learning visual dynamics from video [4], image generation [5], and unsupervised object-centric representation learning for real-world data [6].

World models are essential in embodied machine learning tasks such as robotics, autonomous vehicles, game AI, and video synthesis [7,8,9]. They help agents to understand, predict, and interact with dynamic environments. However, modeling high-dimensional, continuous, and time-varying data remains difficult. Traditional models represent the environment holistically, without distinguishing individual objects [10,11], which limits their ability to capture object relationships [12] and evolving dynamics [13]. These models often struggle to generalize and scale to complex, multi-object scenes. In contrast, object-centric world models offer interpretable, factored representations that improve generalization, scalability, and data efficiency by representing scenes in terms of individual objects and their properties.

Previous object-centric world models typically address only part of the overall problem. On the one hand, many slot-based extractors [14,15,16,17] struggle to maintain episodic and long-horizon consistency, since the same object may be assigned to different slots across time or across episodes. On the other hand, existing dynamics models [4,18] often do not provide an efficient action-conditioned transition mechanism for accurate prediction in action-dependent environments. Our main goal in this work is to address both limitations by learning consistent object-centric slot representations and combining them with an efficient transition model for future-state prediction.

In this paper, we propose a novel approach, COMPAS (COMPose Actions and Slots), to encode these dynamics, focusing on object-centric world models that integrate actions and slots. The main contributions of our work are as follows:

We introduce a video slot extractor with a deterministic slot initialization procedure that enforces slot consistency both within trajectories and across episodes. Consistent slot acquisition is illustrated in Figure 1, with additional visualizations provided in Appendix C.
We apply a temporal consistency block that enforces trajectory consistency in a parallel manner, significantly speeding up inference and training compared to recurrent approaches.
We utilize stacked slot attention blocks, enabling scalability and more efficient inference and training compared to the original iterative approach.
We present an efficient transition model that incorporates causal slot information, slot interactions, and actions to predict next states effectively.

2. Related Works

2.1. Object-Centric Learning

Many modern object-centric representation models are based on the slot attention module [3]. It implements an iterative attention mechanism, based on soft k-means clustering, for autoregressive slot refinement. Recent improvements of this method are based on better optimization [19], learnable slot initialization [20], slot structure augmentation [21], and improving the underlying clustering algorithm [22] or simplifying it [23]. However, these methods use simple convolutional neural network (CNN) encoders and decoders, which result in worse image reconstructions. The authors of SLATE [5] proposed to use a discrete latent space, which is extracted from the image by dVAE [24]. The computed slots are then passed through a transformer that predicts the latent token. This token is used in the dVAE decoder for image reconstruction. This results in better-quality reconstructions than in other models.

Classical object-centric models encounter issues with slot permutations. Specifically, when the same image is passed through the model multiple times, the same object may be allocated to a different slot each time. To address this problem, the SAVI [15,16] and STEVE [14] models propose the use of a predictor model that predicts the initialization of slots for use in slot attention for the subsequent frame. This predictor can be implemented as either a multi-layer perceptron (MLP) or a transformer.

Recent progress in object-centric learning has been closely linked to the adoption of pre-trained image transformers. Notably, DINOSAur [6] leverages pre-trained DINO transformers [25]. The primary training objective in this paradigm is to reconstruct patched-feature maps from slots using an MLP as a decoder. An extension of this method, VideoSAur [17], adapts DINOSAur for video data. However, achieving slot consistency within a trajectory through predictor models is challenging due to overfitting and gradient vanishing caused by the recurrent nature of training. Additionally, the iterative processing of slot attention models limits their inference speeds and scalability.

2.2. Object-Centric World Models

Previous object-centric world models have used other algorithms besides slot attention for object state extraction. SQAIR [26] uses a special detection model to detect objects in the scene and track their trajectories through RNNs. SCALOR [27] and SILOT [28] are able to scale the number of objects in SQAIR due to a parallel inference mechanism. STOVE [29] introduces a graph neural network (GNN) [30] for dynamics prediction and per-object interactions.

One of the most notable world models is the Generative Structured World Model (G-SWM) [31]. It treats foreground objects and the background separately by encoding them into two different latent vectors. For next-state prediction, it uses two separate RNNs—one for the foreground latent vector and the other for the background. One of the limitations of this model is that it does not consider actions taken in the environment when predicting future trajectories.

Another important approach is Contrastive Learning of Structured World Models (C-SWM) [32]. It uses a contrastive loss function at the slot level, instead of the basic image reconstruction loss. The model predicts the trajectory using a GNN. A notable improvement upon C-SWM is negative sampling [33], which improves the loss function by selecting negative samples from either different time steps in the same episode or the same time step in different episodes. Another improvement is represented by two types of action attention [34]. Soft attention uses simple self-attention with the single head of a transformer [35]. Hard attention calculates the expectation of all possible assignments to objects, takes the index of the object with the highest probability, and maps the action to this object. The quality of predictions in models utilizing contrastive loss is significantly influenced by the quality of negative samples [33]. The necessity of data deduplication imposes additional constraints on the dataset sampling process, complicating it further in complex, real-world environments. However, since our model employs only the mean squared error (MSE) loss for training, it does not impose strict requirements on the datasets.

Slotformer [4] uses either STEVE or SAVI to extract slots from a sequence of images. The slots are then passed through a transformer to predict future frames of the video. This approach has achieved better results in dynamics prediction than previous models. A further improvement is OCVP [18], which decouples slot interaction and causal dynamics into two attention blocks. However, the main limitation of this method is that, unlike our approach, the model cannot work efficiently in action-dependent environments. Since the taken action can significantly change the environment or agent state, predicting the future based only on previous states may prove to be challenging.

3. Materials and Methods

The COMPAS world model consists of two interconnected components (as shown in Figure 2): the slot extractor and the transition model. The purpose of the slot extractor is to generate object-centric slot representations that remain consistent both within a trajectory and across episodes. Consistency ensures that each slot corresponds to a specific object in the scene, maintaining its association without permuting with other slots over time or across episodes. The transition model predicts the next time step based on a context window of length T containing previous slot representations and the actions taken. The formal problem statement of the COMPAS learning task is presented in Appendix D.

Figure 2. The method starts by inputting a trajectory of T observations into a frozen DINO encoder to extract features. Trainable tokens are duplicated and combined with positional encodings to produce deterministic slot initializations

S_{t}^{init}

(Figure 3a). These, along with the features, are fed into a slot extractor with temporal sinusoidal encoding. The extractor has two components: a shared key slot attention block that maps slots to feature patches and a temporal consistency block to maintain coherence across time. After

L_{extr}

layers, slot representations

S_{t}

are produced for each step. A transition model then predicts future states using discrete

a_{t}^{desc}

or continuous

a_{t}^{cont}

actions, processed into vectors. Slots and actions, embedded with temporal encodings, are passed through the predictor block (Figure 3c). After

L_{dyn}

layers, the model outputs next-step slots

S_{T + 1}

.

Figure 2. The method starts by inputting a trajectory of T observations into a frozen DINO encoder to extract features. Trainable tokens are duplicated and combined with positional encodings to produce deterministic slot initializations

S_{t}^{init}

(Figure 3a). These, along with the features, are fed into a slot extractor with temporal sinusoidal encoding. The extractor has two components: a shared key slot attention block that maps slots to feature patches and a temporal consistency block to maintain coherence across time. After

L_{extr}

layers, slot representations

S_{t}

are produced for each step. A transition model then predicts future states using discrete

a_{t}^{desc}

or continuous

a_{t}^{cont}

actions, processed into vectors. Slots and actions, embedded with temporal encodings, are passed through the predictor block (Figure 3c). After

L_{dyn}

layers, the model outputs next-step slots

S_{T + 1}

.

3.1. Slot Representation Extraction

Contrary to iterative approaches used in the classical slot attention algorithm (or its variations), we utilize sequentially stacked blocks. This design enables increased parallelization, scalability, and faster inference and training. The overall architecture of the slot extractor module is presented in Figure 2.

Firstly, we initialize slots by duplicating a trainable token of size D for

T \times N

times, resulting in a slot initialization

{\hat{S}}_{t}^{i n i t}

. T represents the dimension of the starting time window, D is the size of a single slot, and N is the number of slots. By adding a trainable slot position embedding

p_{s}

, which encodes each token with specified slot position and sinusoidal time position embeddings

p_{t}

to the corresponding slots, we obtain the slot initialization at time step t:

S_{t}^{i n i t} = {\hat{S}}_{t}^{i n i t} + p_{s} + p_{t}

(1)

The slot’s positional encoding remains constant across time steps but differs for each slot. Similarly, the time positional encoding is the same for all slots within a single time step but varies between time steps. This deterministic initialization, combined with a trainable token and slot position embeddings, ensures slot consistency across episodes.

After initialization, we feed

S_{t}^{i n i t}

into the slot extraction block. To generalize the notation, let

S_{t}^{l}

represent the slots from the l-th layer of the slot extractor block, where

S_{t}^{0} = S_{t}^{i n i t}

. This block extracts slot representations from encoder features

F_{t} \in R^{M \times J}

, where M is the number of patches, and J denotes the encoder feature dimensions. We utilize a modified version of the shared key slot attention [23]. The encoder features

F_{t}

are extracted from input image frames using a pre-trained DINO [25] encoder:

\begin{matrix} \begin{matrix} Q = Linear (S_{t}^{l - 1}) \\ K = Linear (F_{t}) \\ W = Softmax (\frac{K Q^{T}}{τ}, axis = slots) \\ R = W^{T} K \\ {\tilde{S}}_{t}^{l} = MLP (R) + S_{t}^{l - 1} \end{matrix} \end{matrix}

(2)

Object-centric models for video, such as the SAVI or Videosaur models, implement slot trajectory consistency through a recurrent predictor model, which predicts slot initialization for the next time step based on the slots from the previous time step. However, this approach may be less high-performing and is prone to gradient decay. To address these issues and improve the training and inference speed, we introduce a temporal consistency block. This block processes slots in parallel and is implemented as a transformer encoder layer with a mask

M_{temp}

. The mask promotes attention to the same slots from previous time steps causally, while ignoring other slots. In other words, the i-th slot

{\tilde{s}}_{t, i}^{l} \in {\tilde{S}}_{t}^{l}

can only attend to the same slot

{\tilde{s}}_{m}^{i, l}

from a previous time step m (

m < t

). The visualization of the matrix is shown in Figure 3b.

By feeding the sequence of extracted slots from the l-th layer in a time window of size T,

{\{{\tilde{S}}_{t}^{l}\}}_{t = 1}^{T}

, into the temporal consistency block, we obtain the final slots:

{\{S_{t}^{l}\}}_{t = 1}^{T} = TransfEncLayer ({\{{\tilde{S}}_{t}^{l}\}}_{t = 1}^{T}, mask = M_{temp})

(3)

Assuming that the slot extractor model comprises L layers, sequentially implementing these operations results in the final slots

{\{S_{t}\}}_{t = 1}^{T} = {\{S_{t}^{L}\}}_{t = 1}^{T}

.

Since the context length of the model is constrained to a fixed time window T, which is typically small, challenges arise regarding the scalability of the model for longer videos. To address this issue, the slot extractor module is trained in an autoregressive manner. The initialization of the next slots,

S_{T + 1}^{i n i t}

, is performed as described in Equation (1). Subsequently, the slots are extracted using the procedure detailed in Equation (2).

The extracted slots,

{\tilde{S}}_{T + 1}^{l}

, are then appended to the shifted time window of previously extracted slots,

{\{S_{t}\}}_{t = 2}^{T}

, resulting in the sequence

\{S_{2}, S_{3}, \dots, {\tilde{S}}_{T + 1}^{l}\}

. These slots are then fed into the temporal consistency block described in Equation (3). By sequentially processing them through all L layers, the final slots,

S_{T + 1}

, are obtained. This procedure can be repeated for any desired horizon beyond the initial context length T.

The training approach follows the methodology outlined in DINOSAUr [6]. After extracting the slots

S_{t}

across the entire training trajectory

T + Z

, where Z represents the prediction horizon for autoregressive extraction, these slots are processed through a shared MLP decoder to generate feature reconstructions

{\hat{F}}_{t}

. The training objective is to minimize the mean squared error (MSE) loss:

L_{extr} = \frac{1}{T + Z} \sum_{t = 1}^{T + Z} {∥ F_{t} - {\hat{F}}_{t} ∥}^{2}

.

The overall time complexity of a single forward pass of the extractor model is

O (T \cdot L \cdot K \cdot N \cdot D)

, where T is the initial time window of slots, L is the number of extractor layers, K is the number of frame feature patches, N is the number of slots, and D is the slot dimension.

3.2. Transition Model

The transition model is inspired by OCVP. During the early phases of training of the slot extractor, slots often lack useful information, which can push the transition model away from the optimal solution or cause it to become stuck in a local minimum. To address these challenges and accelerate the training process, the transition model is trained separately from the slot extractor on pre-computed slots.

At the start, slots

S_{t} \in R^{N \times D}, t = 0, \dots, T

within a time window of size T are projected into the transition model’s dimension. Since the transition model operates with the same dimensions D as the slots,

S_{prep, t} = Linear (S_{t}) + p_{t}

(4)

The transition model supports both continuous and discrete actions. Continuous actions

a_{t}^{cont} \in R^{B}

are projected into the same dimension as the transition model:

{\hat{a}}_{t} \in R^{D} = Linear (a_{t}^{cont})

(5)

For discrete actions, we use trainable embeddings for each possible action

a_{t}^{desc, i} \in A_{t}^{desc} = {a_{t}^{desc, 1}, a_{t}^{desc, 2}, \dots, a_{t}^{desc, J}}

:

{\hat{a}}_{t} = EmbeddingsDictionary (a_{t}^{desc, i})

(6)

To incorporate temporal information, we add sinusoidal time embeddings

p_{t}

to each slot and action at time step t:

\begin{matrix} \begin{matrix} s_{prep, t}^{i} = s_{t}^{i} + p_{t}, s_{prep, t}^{i} \in S_{prep, t} \\ a_{t} = {\hat{a}}_{t} + p_{t} \end{matrix} \end{matrix}

(7)

Subsequently, actions and slots are passed into an action attention block, relation attention block, and temporal attention in parallel to the original OCVP layer. The architecture of the predictor block is presented in Figure 3c.

Action attention is implemented as a simple multi-head cross-attention. In this layer, actions

a_{t}

serve as keys and values, while slots

S_{t}

act as queries.

After projecting the features from the final time step into slot dimensions, we obtain the predicted features for the next time step, denoted as

S_{T + 1}^{'}

. By shifting the input slot time window and appending

S_{T + 1}^{'}

as the last element, along with obtaining

{\hat{a}}_{T + 1}

, we can autoregressively predict future states for any horizon H.

The training objective for the COMPAS dynamics model is defined as the

L_{2}

reconstruction loss between the ground truth slots

s_{t, i} \in S_{t}

and the model’s predictions

s_{t, i}^{'} \in S_{t}^{'}

:

L_{d y n} = \frac{1}{H \cdot N} \sum_{t = T + 1}^{H} \sum_{i = 1}^{N} | | s_{t, i}^{'} - s_{t, i} {| |}^{2}

(8)

The single-forward-pass time complexity of the transition model is

O (L \cdot D \cdot T \cdot N \cdot (N + 2 \cdot T)) .

4. Results

4.1. Environments and Datasets

To evaluate the effectiveness of our proposed model, we perform experiments using trajectories sampled from three environments: ViZDoom [36], Causal World [37], and RoboSuite [38]. By conducting experiments in the selected environments, we aim to demonstrate the versatility and scalability of our model. A detailed description of the datasets is presented in Appendix B.

We primarily focus on comparing the predictive accuracy of our model against that of recent state-of-the-art (SOTA) methods, such as Slotformer and OCVP. Additionally, we include comparisons with the GNN transition model from CSWM, despite it being outdated and relying on a lower-quality slot extraction procedure.

Additionally, we evaluate the COMPAS extractor model in terms of the quality of object detection and slot consistency performance against the VideoSAUR [17] model across 20 and 100 steps.

4.2. Metrics

Other frameworks, such as CSWM [34], use ranking metrics like HITS@1 and MRR for dynamics prediction evaluation. However, this approach is not entirely representative, as these metrics depend on the scaling of the actual slot values and can yield high scores even if the predictions are inaccurate. To address this, we evaluate the models using pre-computed slots

S_{t}

and decode them back into observations

O_{t}

with sizes

H \times W

. If a model can accurately predict the next slots

{\hat{S}}_{t}

, the mean squared error (MSE) with the ground truth should be lower compared to competing models. Similarly, the decoder MSE should also be lower, as better dynamics predictions lead to improved reconstructions

\hat{O_{t}}

. We calculate the metric at a certain time step t in order to evaluate the quality of predictions at certain horizons.

\begin{matrix} \begin{matrix} {MSE}_{t} = \frac{1}{N} | | S_{t} - \hat{S_{t}} {| |}^{2} \\ {MSE}_{t}^{decoder} = \frac{1}{3 (H \times W)} | | O_{t} - \hat{O_{t}} {| |}^{2} \end{matrix} \end{matrix}

(9)

To measure consistency, we use the Identity F1 (IDF1) score:

IDF 1 = \frac{2 TP}{2 TP + FP + FN},

(10)

where

TP

is the number of correctly matched object identities,

FP

is the number of predicted identities that do not correspond to any ground truth identity, and

FN

is the number of ground truth identities not recovered by the tracker.

To count identity switches, we first build a pairwise IoU between each predicted slot mask and each ground truth mask. We then apply the Hungarian algorithm to find the optimal matching. Thus, we detect slot permutations and treat them as an identity switch.

To measure trajectory consistency, we treat the predicted mask from time

t - 1

as the ground truth for time t.

For episodic consistency, we fix masks as the ground truth from one reference episode. For every other episode of the same object, we compute the IDF1 between its mask and the ground truth mask from the reference episode.

For the ARI, FG-ARI, and mbo metrics, we follow the setup from [17].

4.3. Training and Evaluation Setup

All transition models are trained and evaluated on pre-computed slots extracted by the COMPAS slot extractor. The slot extractor was trained on 150 epochs with a batch size of 128 for each environment.

More detailed descriptions are available in Appendix A and Appendix B. All configurations and code required for the experiments will be made available.

4.4. Experiments

We introduce OCVP models with simple action conditioning. OCVP (slot act.) uses the processed action vector

{\hat{a}}_{t}

as an additional slot. OCVP (add act.) adds

{\hat{a}}_{t}

to each slot at time step t.

As highlighted in Table 1, COMPAS demonstrated superior predictive accuracy in the Causal World environment with the Reach task. This finding underscores the robustness of our approach in handling complex, physically simulated environments.

Furthermore, COMPAS’s performance increased significantly in the more challenging Push task (Table 2). In contrast, non-action-dependent models, such as Slotformer and OCVP, exhibited substantial performance degradation under these conditions. This comparison highlights the resilience and adaptability of COMPAS in diverse and complex tasks.

On the RoboSuite dataset, the GNN baseline achieved better results than the other baseline models, likely due to the lower object density and reduced interaction complexity in this environment. Nevertheless, COMPAS still achieved the best overall performance on this dataset, outperforming all baseline methods (Table 3).

On the ViZDoom dataset, COMPAS performed consistently well, achieving better results compared to the alternative methods. The results are presented in Table 4.

Moreover, COMPAS demonstrated better object detection quality and slot assignment consistency than the Videosaur model for both short (20 steps) and long (100 steps) trajectories in the Causal World (Push) environment. These results are presented in Table 5.

The ablation study of the COMPAS extractor, presented in Table 6, demonstrates the effectiveness of the proposed architectural components. While the temporal block has a smaller impact on the segmentation quality compared to temporal encoding, it significantly enhances the model’s episodic temporal consistency.

Similarly, the ablation study of the dynamics model, shown in Table 7, highlights the importance of the action-processing component in improving the prediction quality.

An ablation study of separated and end-to-end training for the COMPAS dynamics model is conducted in the Causal World (Push) environment. In the separated setup, the extractor and dynamics model are trained separately, whereas, in the end-to-end setup, they are trained jointly using different optimizers. Table 8 shows that end-to-end training is slightly poorer. A possible reason is that the extractor learns less effective representations under joint optimization, which negatively affects the dynamics model.

We also include an ablation study on the effect of the temporal context size. As shown in Table 9 and Table 10, reducing the context size leads to a slight drop in performance, indicating that a shorter temporal window provides less information for accurate representation learning and dynamics prediction. In contrast, increasing the context size beyond 4 does not yield noticeable improvements, as the results remain nearly unchanged. This suggests that a moderate context size is already sufficient to capture the relevant temporal information, while a larger context provides little additional benefit.

We also include experiments demonstrating the use of our world model for decision-making tasks—specifically, planning with the DINO-WM [39] baseline—in Appendix D.

Overall, COMPAS proves to be a more robust and adaptable model, particularly in synthetic environments and tasks with both discrete and continuous action spaces, further proving its advantages over object-centric baseline methods.

5. Discussion

The main limitation of our model is the introduction of a new hyperparameter, the context length, which can impact the consistency of the slots. Additionally, the model’s performance is influenced by hyperparameters such as the number of slots and the context time window, both of which can significantly affect the quality of slot representations and dynamics predictions. Furthermore, handling real-world data poses a significant challenge, as object-centric models inherently struggle to process such data efficiently. Future research could address these issues and improve the model’s robustness and adaptability.

6. Conclusions

Our model demonstrates superior consistency over multiple prediction steps compared to analogous models in the field. This confirms the model’s ability to handle different prediction horizons.

Looking ahead, a promising direction is to explore neuro-symbolic perspectives in object-centric approaches. By combining neural networks with symbolic reasoning, models can integrate structured knowledge with learned data, enhancing interpretability, robustness, and generalization. Another key direction is developing algorithms that efficiently leverage object information to drive agent behavior. This requires the careful handling of composable states, as merging multiple object states into a single representation can obscure critical information and reintroduce the binding problem.

In conclusion, our work represents an important step forward in object-centric world models. By integrating action information directly into the model and using an autoregressive transformer model for prediction, we have demonstrated a novel way to encode and predict the dynamics of complex environments. We expect that our work will inspire further research and advances in this exciting area of machine learning.

Author Contributions

Conceptualization, V.V. and L.U.; methodology, V.V.; software, V.V.; validation, V.V., L.U. and V.F.; formal analysis, L.U.; writing—original draft preparation, V.V., L.U. and V.F.; writing—review and editing, V.V., V.F., L.U., A.K. and A.P.; visualization, V.V.; supervision, A.P. and A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation under Project 075-15-2024-544.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data were obtained from publically available and open-sourced simulators. Data collection is described in Appendix B.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ARI	Adjusted Rand Index
CNN	Convolutional Neural Network
COMPAS	Compose Actions and Slots
CSWM	Contrastive Structured World Model
DINO	Self-Distillation with No Labels
FG	Foreground
GNN	Graph Neural Network
HITS	Human Importance-Aware Transition Score
MLP	Multi-Layer Perceptron
MRR	Mean Reciprocal Rank
MSE	Mean Squared Error
OCVP	Object-Centric Video Prediction
RGB	Red Green Blue (color representation)
RL	Reinforcement Learning
RNN	Recurrent Neural Network
SAVI	Slot Attention for Video
SCALOR	Scene-Centric Autoencoder for Object Representations
SILOT	Sequential Inference for Latent Object Tracking
SLATE	Slot Attention-Based Transformer Encoder
SOTA	State-of-the-Art
SQAIR	Sequential Attend Infer Repeat
STEVE	Slot-Based Transformer for Video
STOVE	Structured Object-Aware Video Prediction
SWM	Structured World Model
VIDEOSAUR	Video Object-Centric Representation Learning Model

Appendix A. Architecture and Computational Details

Table A1, Table A2, Table A3, Table A4, Table A5, Table A6 and Table A7 describe the architecture details and hyperparameters for our experiments.

Table A1. Hyperparameters of COMPAS slot extractor used for Causal World experiments.

Module	Parameter	Value
	Image size	224
	Number of train episodes collected	10,000
	Number of validation episodes collected	1000
	Number of steps per episode	100
	Number of training epochs	150
	Batch size	128
	Video length	8
DINO	Patch size	8
DINO	Model size	S
COMPAS	Slot dim	128
COMPAS	Number of slots	8
COMPAS	Number of layers	4
COMPAS	Action type	Continuous
COMPAS	Action size	9

Table A2. Hyperparameters of COMPAS slot extractor used for ViZDoom experiments.

Module	Parameter	Value
	Image size	224
	Number of train episodes collected	10,000
	Number of validation episodes collected	1000
	Number of steps per episode	100
	Number of training epochs	150
	Batch size	128
	Video length	8
DINO	Patch size	8
DINO	Model size	S
COMPAS	Slot dim	128
COMPAS	Number of slots	8
COMPAS	Number of layers	4
COMPAS	Action type	Discrete
COMPAS	Number of action embeddings	4

Table A3. Hyperparameters of COMPAS slot extractor used for RoboSuite experiments.

Module	Parameter	Value
	Image size	224
	Number of train episodes collected	5000
	Number of validation episodes collected	1000
	Number of steps per episode	100
	Number of training epochs	150
	Batch size	128
	Video length	8
DINO	Patch size	8
DINO	Model size	S
COMPAS	Slot dim	128
COMPAS	Number of slots	4
COMPAS	Number of layers	4
COMPAS	Action type	Continuous
COMPAS	Action size	4

Table A4. COMPAS transition model hyperparameters.

Environment	Model Dim	Memory Length	Num. Layers	Heads	Dropout	Training Epochs	Batch Size	Horizon
ViZDoom	128	4	8	4	0.1	100	64	4
Casual World	128	4	8	4	0.1	100	64	4
RoboSuite	128	4	8	4	0.1	100	64	4

Table A5. Slotformer transition model hyperparameters.

Environment	Model Dim	Memory Length	Num. Layers	Heads	Dropout	Training Epochs	Batch Size	Horizon
ViZDoom	128	4	8	4	0.1	100	64	4
Casual World	128	4	8	4	0.1	100	64	4
RoboSuite	128	4	8	4	0.1	100	64	4

Table A6. OCVP transition model hyperparameters.

Environment	Model Dim	Memory Length	Num. Layers	Heads	Dropout	Training Epochs	Batch Size	Horizon
ViZDoom	128	4	8	4	0.1	100	64	4
Casual World	128	4	8	4	0.1	100	64	4
RoboSuite	128	4	8	4	0.1	100	64	4

Table A7. GNN transition model hyperparameters.

Environment	Model Dim	Memory Length	Num. Layers	Heads	Dropout	Training Epochs	Batch Size	Horizon
ViZDoom	128	4	8	4	0.1	100	64	4
Casual World	128	4	8	4	0.1	100	64	4
RoboSuite	128	4	8	4	0.1	100	64	4

All experiments were conducted in a SLURM-managed environment. Each run was performed on a single node with the hardware configuration detailed in Table A8. A GPU served as the primary computational device for all experiments.

Table A8. Hardware configuration for SLURM node.

GPU	NVIDIA A100 80 GB
CPU	Intel Xeon Gold 6248R
Number of cores	4
RAM	1024 GB

Appendix B. Datasets

Appendix B.1. ViZDoom

ViZDoom is a first-person-perspective environment built upon the classic Doom video game engine. It offers a visually detailed and dynamic setting for experiments, combining elements such as partial observability and complex enemy behavior. We use the Defend the Line task within ViZDoom to evaluate our model’s ability to operate in a discrete action space effectively. In this task, the agent is placed in a rectangular arena, starting at the center of one of the longer walls. On the opposite wall, three melee-only monsters and three shooting monsters are spawned. Enemies can be killed with a single shot. After being defeated, each monster respawns after a delay with increased durability, posing a progressively greater challenge. The discrete action space for this task consists of four actions: no action, turn left, turn right, and shoot.

We collected the data using a uniform random policy, consisting of 10,000 training episodes, each with 100 time steps, and 1000 evaluation episodes of the same length. The observations consist of RGB images with a resolution of 224 × 224.

Appendix B.2. Causal World

At the other end of the complexity spectrum, we use the Causal World environment, which presents a much more complicated and challenging scenario. Causal World is a physically simulated environment with three robotic arms interacting with objects. It provides a continuous action space and requires the model to understand complex physical interactions and the impacts of nuanced actions on the state of the world. In the Causal World environment, we consider two tasks, both with an action space of size 9. In the Push task, the goal of the agent is to push one block towards a goal position with a specific orientation. In the Reach task, the agent needs to reach the target block from four blocks overall via a manipulator. For the Reach and Push tasks, we followed the same data collection scheme as for the ViZDoom environment. Observations in Causal World are rendered as images of size 224 × 224.

Appendix B.3. RoboSuite

We additionally experiment with a more difficult robotic environment, RoboSuite, which uses a Franka Panda robotic arm for manipulation. We choose the Block Lifting task, in which a robotic arm must pick up a red cube placed on the tabletop in front of it and lift it above a certain height. The cube location and the robot’s initial states are randomized at the start of each and every episode. The size of the continuous action space for this task is four. The dataset was split into a training set containing 5000 trajectories and a validation set containing 1000 trajectories. The data were collected using a uniform random policy. In the RoboSuite environment, we captured images of size 256 × 256 from the front-view camera. These images were resized to a 224 × 224 resolution.

Appendix C. Additional Slot Mapping Visualization

We provide additional visualizations from different environments for the slot extractor model in Figure A1 and Figure A2.

Figure A1. Episode-consistent slot visualizations from RoboSuite environment, extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.

Figure A2. Episode-consistent slot visualizations from ViZDoom environment, extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.

Appendix D. Model-Based Planning

In this section, we present the results of planning using the COMPAS world model, trained in an offline setting. We adopt the same planning setup as for DINO-WM [39] and compare our approach directly with theirs in our experiments.

We evaluate on two environments: the Gym Fetch Reach task, where a robotic manipulator must reach a target position, and the Point Maze environment—a 3D labyrinth where a red ball must navigate to a pre-defined green target position.

We report the success rate over 80 runs for each environment, using the same Point Maze configuration as for DINO-WM. For the Fetch Reach task, COMPAS uses the RoboSuite architecture. For the Point Maze, we use the same configuration as DINO-WM, with the only difference being that COMPAS uses five slots. The results are summarized in Table A9.

Table A9. Performance comparison of COMPAS and DINO-WM across Fetch Reach and Point Maze tasks. Mean ± std for 3 seeds.

Model	Fetch Reach	Point Maze
COMPAS	1.00	0.69
DINO-WM	1.00	0.98

While our model performs poorly in the Point Maze environment, we suspect that this is due to the reintroduction of the binding problem. Since planning requires combining all slots to compute the planning loss, this process may inadvertently merge distinct object representations, leading to information mixing.

Additionally, another factor affecting planning performance is the absence of proprioceptive state encoding in our model. Since it is trained solely on pixel observations, it may lack crucial internal state information needed for precise control.

Nonetheless, the experiment shows that our model can still capture meaningful environmental representations that are useful for planning.

Appendix E. Problem Statement

We work with a dataset of trajectories. Let

x_{1}, \dots, x_{t}, \dots

denote a sequence of images constituting a trajectory. We then extract patch-encoded features from the images using a pre-trained frozen DINO encoder:

{\hat{F}}_{t} = DINO (x_{t}) \in R^{K \times J}

(where K is the number of patches of size J for a single image). Let

N_{slot}

denote the chosen number of slots. For time steps

1 \leq t \leq T

within the context window of size T, the slots are initialized deterministically:

z_{t}^{i} = μ_{i}

, for

1 \leq i < N_{slot}

, where

μ_{i}

denotes learnable parameters (Equation (1) in the main text). Let

f_{θ}

denote a slot attention module with parameters

θ

implementing the neural network layers described in Equations (2) and (3) in the main text. The slots for time steps within the context window

1 \leq t \leq T

are obtained using the initialized values:

f_{θ} ({\hat{F}}_{t}, z_{t}^{1}, \dots, z_{t}^{N_{slot}}) \to s_{t}^{1}, \dots, s_{t}^{N_{slot}} .

The slots for time steps from the autoregressive horizon

T < t \leq T + Z

are calculated using the values of slots from the previous time step:

f_{θ} ({\hat{F}}_{t}, s_{t - 1}^{1}, \dots, s_{t - 1}^{N}) \to s_{t}^{1}, \dots, s_{t}^{N_{slot}} .

Let

g_{η}

denote a decoder network with parameters

η

. We use it to reconstruct the ground truth DINO features

{\hat{F}}_{t}

from the slot representations:

g_{η} (s_{t}^{1}, \dots, s_{t}^{N}) \to {\tilde{F}}_{t} .

We assume the multivariate Gaussian statistical model

P (F_{t} | {img}_{t}; {μ_{i}}, θ, η)

with a diagonal identity covariance matrix

Σ

for the DINO features:

P (F_{t} | x_{t}; {μ_{i}}, θ, η) = \frac{1}{{(det Σ)}^{0.5} {(2 π)}^{0.5 K J}} exp (- \frac{1}{2} {(F_{t} - {\hat{F}}_{t})}^{T} Σ^{- 1} (F_{t} - {\hat{F}}_{t}))

We optimize the slot initialization parameters

{μ_{i}}

and the parameters of the neural networks

f_{θ}

and

g_{η}

by maximizing the likelihood of the observed data, assuming independence between time steps. This gives us an expression that is proportional to the MSE objective

L_{extr}

for slot extraction from the main text:

\begin{matrix} {μ_{i}^{*}}, θ^{*}, η^{*} & = \arg \max P ({\tilde{F}}_{1} | x_{1}; {μ_{i}}, θ, η) \dots P ({\tilde{F}}_{T + Z} | x_{T + Z}; {μ_{i}}, θ, η) \\ = \arg \max \sum_{t = 1}^{T + Z} log P ({\tilde{F}}_{t} | x_{t}; {μ_{i}}, θ, η) \\ = \arg \max - \frac{T + Z}{2} K J log 2 π - \sum_{t = 1}^{T + Z} \frac{∥ {\tilde{F}}_{t} ({μ_{i}}, θ, η) - {\hat{F}}_{t} ∥^{2}}{2} \\ = \arg \min \sum_{t = 1}^{T + Z} {∥ {\tilde{F}}_{t} ({μ_{i}}, θ, η) - {\hat{F}}_{t} ∥}^{2} \end{matrix}

When we train the transition model, we work with a dataset of slots extracted from the trajectory images. Let

{\hat{S}}_{1} = ({\hat{s}}_{1}^{1}, \dots, {\hat{s}}_{1}^{N_{s l o t s}}), \dots, {\hat{S}}_{t} = ({\hat{s}}_{t}^{1}, \dots, {\hat{s}}_{t}^{N_{s l o t s}}), \dots

denote a sequence of slot representations extracted from the sequence of images

x_{1}, \dots, x_{t}, \dots

constituting a trajectory. Let

a_{1}, \dots, a_{t}, \dots

denote a sequence of actions from the same trajectory. We project actions into the internal dimension of size D of the transition model by employing a learnable function

p_{ν} (a_{t}) \to {\hat{a}}_{t} \in R^{D}

. In the case of a continuous action space,

p_{ν}

is a neural network with parameters

ν

(Equation (5) in the main text), while, in the case of a discrete action space,

p_{ν}

is a set of learnable embeddings for each possible action (Equation (6) in the main text).

Let

h_{ϕ}

denote the transition model with parameters

ϕ

(Equations (4) and (7) and Figure 2 in the main text). It takes a sequence of size T of slots and projected actions as input and predicts the values of slots for the next time step

T + 1

:

h_{ϕ} ({\hat{S}}_{1}, \dots, {\hat{S}}_{T}, {\hat{a}}_{1}, \dots, {\hat{a}}_{T}) \to {\tilde{S}}_{T + 1} = ({\tilde{s}}_{T + 1}^{1}, \dots, {\tilde{s}}_{T + 1}^{N_{s l o t s}}) .

For time steps within the prediction horizon,

T + 1 < t \leq T + L

, we predict slots in an autoregressive manner. For instance, for time step

t = T + 2

, we obtain

h_{ϕ} ({\hat{S}}_{1}, \dots, {\hat{S}}_{T}, {\tilde{S}}_{T + 1}, {\hat{a}}_{1}, \dots, {\hat{a}}_{T + 1}) \to {\tilde{S}}_{T + 2} .

We assume the multivariate Gaussian statistical model

P (S_{T + 1} | S_{1}, \dots, S_{T}, {\hat{a}}_{1}, \dots, {\hat{a}}_{T}; ν, ϕ)

with a diagonal identity covariance matrix

Σ

for the slot representations of the next step:

\begin{matrix} P (S_{T + 1} | S_{1}, \dots, S_{T}, {\hat{a}}_{1}, \dots, {\hat{a}}_{T}; ν, ϕ) = \frac{1}{{(det Σ)}^{0.5} {(2 π)}^{0.5 D N_{s l o t s}}} \\ \times exp (- \frac{1}{2} {(S_{T + 1} - {\hat{S}}_{T})}^{T} Σ^{- 1} (S_{T} - {\hat{S}}_{T})) . \end{matrix}

We optimize the parameters of

p_{ν}

and

h_{ϕ}

by maximizing the likelihood of the observed data, assuming independence between time steps within the prediction horizon. This gives us an expression that is proportional to the MSE objective

L_{s}

(Equation (8) in the main text) for slot extraction:

\begin{matrix} ν^{*}, ϕ^{*} = \arg \max P ({\tilde{S}}_{T + 1} | {\hat{S}}_{1}, \dots, {\hat{S}}_{T}, {\hat{a}}_{1}, \dots, {\hat{a}}_{T}; ν, ϕ) \dots \\ \dots P ({\tilde{S}}_{T + L} | {\hat{S}}_{L - 1}, \dots, {\tilde{S}}_{T + L - 1}, {\hat{a}}_{L}, \dots, {\hat{a}}_{T + L - 1}; ν, ϕ) \\ = \arg \max \sum_{t = T + 1}^{T + L} log P ({\tilde{S}}_{t} | {\hat{S}}_{t - T}, \dots, {\tilde{S}}_{t}, {\hat{a}}_{t - T}, \dots, {\hat{a}}_{t}; ν, ϕ) \\ = \arg \max - \frac{L D N_{s l o t s}}{2} log 2 π - \sum_{t = T + 1}^{T + L} \frac{∥ {\tilde{S}}_{t} (ν, ϕ) - {\hat{S}}_{t} ∥^{2}}{2} \\ = \arg \min \sum_{t = T + 1}^{T + L} {∥ {\tilde{S}}_{t} (ν, ϕ) - {\hat{S}}_{t} ∥}^{2} . \end{matrix}

References

Lin, B.; Bouneffouf, D.; Rish, I. A Survey on Compositional Generalization in Applications. arXiv 2023, arXiv:2302.01067. [Google Scholar] [CrossRef]
Greff, K.; Van Steenkiste, S.; Schmidhuber, J. On the binding problem in artificial neural networks. arXiv 2020, arXiv:2012.05208. [Google Scholar] [CrossRef]
Locatello, F.; Weissenborn, D.; Unterthiner, T.; Mahendran, A.; Heigold, G.; Uszkoreit, J.; Dosovitskiy, A.; Kipf, T. Object-Centric Learning with Slot Attention. arXiv 2020. [Google Scholar] [CrossRef]
Wu, Z.; Dvornik, N.; Greff, K.; Kipf, T.; Garg, A. SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. arXiv 2023, arXiv:2210.05861. [Google Scholar] [CrossRef]
Singh, G.; Deng, F.; Ahn, S. Illiterate DALL-E Learns to Compose. In Proceedings of the 10th International Conference on Learning Representations, ICLR 2022, Virtual, 25–29 April 2022. [Google Scholar]
Seitzer, M.; Horn, M.; Zadaianchuk, A.; Zietlow, D.; Xiao, T.; Simon-Gabriel, C.J.; He, T.; Zhang, Z.; Schölkopf, B.; Brox, T.; et al. Bridging the Gap to Real-World Object-Centric Learning. In Proceedings of the The Eleventh International Conference on Learning Representations; OpenReview: Amherst, MA, USA, 2023. [Google Scholar]
Wu, P.; Escontrela, A.; Hafner, D.; Abbeel, P.; Goldberg, K. Daydreamer: World models for physical robot learning. In Proceedings of the Conference on Robot Learning; PMLR: New York, NY, USA, 2023; pp. 2226–2240. [Google Scholar]
Burgard, W.; Hebert, M.; Bennewitz, M. World modeling. In Springer Handbook of Robotics; Springer International Publishing: Cham, Switzerland, 2016; pp. 1135–1152. [Google Scholar]
Ha, D.; Schmidhuber, J. World models. arXiv 2018, arXiv:1803.10122. [Google Scholar]
Hafner, D.; Lillicrap, T.P.; Norouzi, M.; Ba, J. Mastering Atari with Discrete World Models. arXiv 2020, arXiv:2010.02193. [Google Scholar] [CrossRef]
Micheli, V.; Alonso, E.; Fleuret, F. Transformers are Sample-Efficient World Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Santoro, A.; Raposo, D.; Barrett, D.G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Oprea, S.; Martinez-Gonzalez, P.; Garcia-Garcia, A.; Castro-Vargas, J.A.; Orts-Escolano, S.; Garcia-Rodriguez, J.; Argyros, A. A Review on Deep Learning Techniques for Video Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2806–2826. [Google Scholar] [CrossRef] [PubMed]
Singh, G.; Wu, Y.F.; Ahn, S. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A.H., Agarwal, A., Belgrave, D., Cho, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022. [Google Scholar]
Kipf, T.; Elsayed, G.F.; Mahendran, A.; Stone, A.; Sabour, S.; Heigold, G.; Jonschkowski, R.; Dosovitskiy, A.; Greff, K. Conditional Object-Centric Learning from Video. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Elsayed, G.F.; Mahendran, A.; van Steenkiste, S.; Greff, K.; Mozer, M.C.; Kipf, T. SAVi++: Towards end-to-end object-centric learning from real-world videos. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: New York, NY, USA, 2022. [Google Scholar]
Zadaianchuk, A.; Seitzer, M.; Martius, G. Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS 2023); Curran Associates, Inc.: New York, NY, USA, 2023. [Google Scholar]
Villar-Corrales, A.; Wahdan, I.; Behnke, S. Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions. In Proceedings of the Internation Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
Chang, M.; Griffiths, T.; Levine, S. Object Representations as Fixed Points: Training Iterative Refinement Algorithms with Implicit Differentiation. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35. [Google Scholar]
Jia, B.; Liu, Y.; Huang, S. Improving Object-centric Learning with Query Optimization. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Singh, G.; Kim, Y.; Ahn, S. Neural Systematic Binder. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Kirilenko, D.; Vorobyov, V.; Kovalev, A.; Panov, A. Object-Centric Learning with Slot Mixture Module. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Patil, V.; Radler, A.; Klotz, D.; Hochreiter, S. Simplified priors for Object-Centric Learning. arXiv 2024, arXiv:2410.00728. [Google Scholar] [CrossRef]
Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021. [Google Scholar]
Kosiorek, A.; Kim, H.; Teh, Y.W.; Posner, I. Sequential attend, infer, repeat: Generative modelling of moving objects. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2018; pp. 8606–8616. [Google Scholar]
Jiang, J.; Janghorbani, S.; de Melo, G.; Ahn, S. SCALOR: Generative World Models with Scalable Object Representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Crawford, E.; Pineau, J. Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2020; pp. 3684–3692. [Google Scholar]
Kossen, J.; Stelzner, K.; Hussing, M.; Voelcker, C.; Kersting, K. Structured Object-Aware Physics Prediction for Video Modeling and Planning. arXiv 2019, arXiv:1910.02425. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Lin, Z.; Wu, Y.F.; Peri, S.; Fu, B.; Jiang, J.; Ahn, S. Improving generative imagination in object-centric world models. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2020; pp. 6140–6149. [Google Scholar]
Kipf, T.N.; van der Pol, E.; Welling, M. Contrastive Learning of Structured World Models. In Proceedings of the 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Biza, O.; van der Pol, E.; Kipf, T. The Impact of Negative Sampling on Contrastive Structured World Models. In Proceedings of the ICML Workshop: Self-Supervised Learning for Reasoning and Perception, Online, 24 July 2021. [Google Scholar]
Biza, O.; Platt, R.; van de Meent, J.W.; Wong, L.L.S.; Kipf, T. Binding Actions to Objects in World Models. In Proceedings of the ICLR 2022 Workshop on Objects, Structure and Causality, Online, 25–29 April 2022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NeurIPS; Curran Associates, Inc.: New York, NY, USA, 2017. [Google Scholar]
Wydmuch, M.; Kempka, M.; Jaśkowski, W. ViZDoom Competitions: Playing Doom from Pixels. IEEE Trans. Games 2019, 11, 248–259. [Google Scholar] [CrossRef]
Ahmed, O.; Träuble, F.; Goyal, A.; Neitz, A.; Bengio, Y.; Schölkopf, B.; Wüthrich, M.; Bauer, S. Causalworld: A robotic manipulation benchmark for causal structure and transfer learning. arXiv 2020, arXiv:2010.04296. [Google Scholar] [CrossRef]
Zhu, Y.; Wong, J.; Mandlekar, A.; Martín-Martín, R.; Joshi, A.; Nasiriany, S.; Zhu, Y. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning. arXiv 2022, arXiv:2009.12293. [Google Scholar] [CrossRef]
Zhou, G.; Pan, H.; LeCun, Y.; Pinto, L. DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning. arXiv 2024, arXiv:2411.04983. [Google Scholar] [CrossRef]

Figure 1. Episode-consistent slot visualizations extracted by COMPAS slot extractor. First column: ground truth image; subsequent columns: slot visualizations.

Figure 3. COMPAS components: (a) slot initialization architecture; (b) temporal attention mask; (c) COMPAS dynamics predictor block.

Table 1. Comparative results for the Causal World (Reach) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.

	10 Steps		30 Steps		60 Steps
Model	MSE	MSE^decoder	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$
Slotformer	$6.48 \pm 0.02$	$15.35 \pm 0.03$	$7.42 \pm 0.12$	$17.95 \pm 0.10$	$8.22 \pm 0.19$	$19.30 \pm 0.19$
OCVP	$6.32 \pm 0.03$	$15.10 \pm 0.04$	$7.28 \pm 0.11$	$17.20 \pm 0.13$	$8.08 \pm 0.18$	$19.25 \pm 0.17$
OCVP (add act.)	$4.66 \pm 0.03$	$11.72 \pm 0.03$	$5.88 \pm 0.09$	$13.58 \pm 0.10$	$6.94 \pm 0.17$	$15.66 \pm 0.17$
OCVP (slot act.)	$4.26 \pm 0.02$	$10.86 \pm 0.02$	$5.46 \pm 0.08$	$12.84 \pm 0.09$	$6.62 \pm 0.16$	$14.92 \pm 0.16$
GNN	$5.18 \pm 0.03$	$14.48 \pm 0.03$	$6.75 \pm 0.10$	$16.18 \pm 0.12$	$7.88 \pm 0.18$	$18.42 \pm 0.18$
COMPAS (ours)	3.18 ± 0.01	4.38 ± 0.01	4.23 ± 0.07	6.82 ± 0.07	5.12 ± 0.15	10.22 ± 0.15

Table 2. Comparative results for the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.

	10 Steps		30 Steps		60 Steps
Model	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$
Slotformer	$6.36 \pm 0.02$	$15.33 \pm 0.04$	$7.38 \pm 0.10$	$17.94 \pm 0.12$	$8.07 \pm 0.18$	$19.08 \pm 0.20$
OCVP	$8.65 \pm 0.03$	$16.11 \pm 0.05$	$9.40 \pm 0.12$	$18.98 \pm 0.15$	$10.04 \pm 0.19$	$21.00 \pm 0.18$
OCVP (add act.)	$4.72 \pm 0.03$	$12.84 \pm 0.03$	$5.98 \pm 0.10$	$14.76 \pm 0.11$	$7.02 \pm 0.17$	$16.92 \pm 0.18$
OCVP (slot act.)	$4.38 \pm 0.02$	$11.96 \pm 0.03$	$5.62 \pm 0.09$	$13.88 \pm 0.10$	$6.76 \pm 0.16$	$15.84 \pm 0.17$
GNN	$5.20 \pm 0.04$	$14.50 \pm 0.03$	$6.80 \pm 0.11$	$16.20 \pm 0.14$	$7.90 \pm 0.17$	$18.40 \pm 0.19$
COMPAS (ours)	3.15 ± 0.01	6.40 ± 0.01	4.25 ± 0.08	8.80 ± 0.08	5.10 ± 0.16	10.20 ± 0.16

Table 3. Comparative results for the RoboSuite environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.

	10 Steps		30 Steps		60 Steps
Model	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$
Slotformer	$1.70 \pm 0.03$	$9.83 \pm 0.08$	$1.99 \pm 0.11$	$11.52 \pm 0.14$	$2.65 \pm 0.18$	$12.01 \pm 0.20$
OCVP	$1.82 \pm 0.05$	$10.55 \pm 0.07$	$2.20 \pm 0.13$	$12.19 \pm 0.15$	$2.58 \pm 0.19$	$12.94 \pm 0.18$
GNN	$1.00 \pm 0.02$	$9.46 \pm 0.04$	$1.49 \pm 0.12$	$12.08 \pm 0.13$	$1.89 \pm 0.17$	$12.97 \pm 0.19$
COMPAS (ours)	0.50 ± 0.01	8.08 ± 0.03	0.86 ± 0.11	10.53 ± 0.12	1.25 ± 0.16	11.93 ± 0.17

Table 4. Comparative results for the ViZDoom (Defend the Line) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.

	10 Steps		30 Steps		60 Steps
Model	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$
Slotformer	$14.58 \pm 0.05$	$6.51 \pm 0.02$	$20.22 \pm 0.12$	$14.94 \pm 0.10$	$27.39 \pm 0.18$	$30.31 \pm 0.20$
OCVP	$14.10 \pm 0.03$	$6.08 \pm 0.03$	$20.41 \pm 0.10$	$14.89 \pm 0.12$	$26.86 \pm 0.20$	$31.04 \pm 0.18$
GNN	$8.75 \pm 0.04$	$5.20 \pm 0.03$	$15.60 \pm 0.11$	$12.40 \pm 0.11$	$23.10 \pm 0.17$	$32.50 \pm 0.19$
COMPAS (ours)	4.31 ± 0.01	2.85 ± 0.01	9.43 ± 0.08	10.27 ± 0.08	21.23 ± 0.16	29.17 ± 0.16

Table 5. Performance comparison of COMPAS extractor and Videosaur model over 20 and 100 steps in Causal World (Push) environment. Mean ± std for 3 seeds.

Model	ARI	FG-ARI	mbo	Trajectory IDF1	Episodic IDF1
COMPAS (100 steps)	$0.43 \pm 0.14$	$0.61 \pm 0.05$	$0.79 \pm 0.03$	$1.00 \pm 0.00$	$0.78 \pm 0.05$
Videosaur (100 steps)	$0.19 \pm 0.03$	$0.20 \pm 0.02$	$0.19 \pm 0.02$	$0.21 \pm 0.04$	$0.11 \pm 0.02$
COMPAS (20 steps)	$0.51 \pm 0.06$	$0.69 \pm 0.03$	$0.82 \pm 0.01$	$1.00 \pm 0.00$	$0.82 \pm 0.04$
Videosaur (20 steps)	$0.44 \pm 0.05$	$0.51 \pm 0.06$	$0.76 \pm 0.02$	$0.74 \pm 0.04$	$0.12 \pm 0.01$

Table 6. Ablation analysis of COMPAS extractor model. Mean ± std for 3 seeds.

Model	ARI	FG-ARI	mbo	Trajectory IDF1	Episodic IDF1
COMPAS	$\underset{̲}{0.43 \pm 0.14}$	$\underset{̲}{0.61 \pm 0.05}$	$\underset{̲}{0.79 \pm 0.03}$	$1.00 \pm 0.00$	$\underset{̲}{0.78 \pm 0.05}$
No temporal block	$\underset{̲}{0.42 \pm 0.12}$	$\underset{̲}{0.60 \pm 0.04}$	$\underset{̲}{0.75 \pm 0.02}$	$0.89 \pm 0.04$	$\underset{̲}{0.75 \pm 0.04}$
No temporal encoding	$0.18 \pm 0.05$	$0.21 \pm 0.01$	$0.21 \pm 0.03$	$0.41 \pm 0.05$	$0.34 \pm 0.05$
No temporal block and encoding	$0.16 \pm 0.04$	$0.18 \pm 0.04$	$0.19 \pm 0.02$	$0.32 \pm 0.05$	$0.28 \pm 0.03$
No slot encoding	$0.00 \pm 0.00$	$0.00 \pm 0.00$	$0.18 \pm 0.01$	$0.21 \pm 0.05$	$0.19 \pm 0.05$

Table 7. Ablation analysis of COMPAS dynamics model. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.

	10 Steps		30 Steps		60 Steps
Model	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$
COMPAS	$3.15 \pm 0.01$	$6.40 \pm 0.01$	$4.25 \pm 0.08$	$8.80 \pm 0.08$	$5.10 \pm 0.16$	$10.20 \pm 0.16$
No interaction block	$4.83 \pm 0.02$	$12.17 \pm 0.03$	$5.41 \pm 0.07$	$13.71 \pm 0.09$	$5.96 \pm 0.12$	$15.45 \pm 0.14$
No temporal block	$4.84 \pm 0.03$	$12.02 \pm 0.02$	$5.35 \pm 0.06$	$13.50 \pm 0.10$	$5.82 \pm 0.11$	$15.25 \pm 0.13$
No action block	$8.65 \pm 0.03$	$16.11 \pm 0.05$	$9.40 \pm 0.12$	$18.98 \pm 0.15$	$10.04 \pm 0.19$	$21.00 \pm 0.18$

Table 8. Ablation analysis of separated and end-to-end training for the COMPAS dynamics model in the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.

	10 Steps		30 Steps		60 Steps
Model	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$
COMPAS (separated)	3.15 ± 0.01	6.40 ± 0.01	4.25 ± 0.08	8.80 ± 0.08	5.10 ± 0.16	10.20 ± 0.16
COMPAS (end-to-end)	$3.48 \pm 0.02$	$6.92 \pm 0.02$	$4.76 \pm 0.09$	$9.54 \pm 0.09$	$5.84 \pm 0.17$	$11.18 \pm 0.17$

Table 9. Ablation analysis of context size for the COMPAS extractor model in the Causal World (Push) environment. Mean ± std for 3 seeds.

Context Size	ARI	FG-ARI	mbo	Trajectory IDF1	Episodic IDF1
3	$0.40 \pm 0.13$	$0.58 \pm 0.05$	$0.76 \pm 0.03$	$1.00 \pm 0.00$	$0.75 \pm 0.05$
4	$0.43 \pm 0.14$	$0.61 \pm 0.05$	$0.79 \pm 0.03$	$1.00 \pm 0.00$	$0.78 \pm 0.05$
10	$0.44 \pm 0.13$	$0.60 \pm 0.04$	$0.80 \pm 0.02$	$1.00 \pm 0.00$	$0.79 \pm 0.04$

Table 10. Ablation analysis of context size for the COMPAS dynamics model in the Causal World (Push) environment. Mean ± std for 3 seeds. All values are multiplied by 100 for better visual presentation.

	10 Steps		30 Steps		60 Steps
Context Size	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$	$MSE$	${MSE}^{decoder}$
3	$3.48 \pm 0.02$	$6.92 \pm 0.02$	$4.76 \pm 0.09$	$9.54 \pm 0.09$	$5.84 \pm 0.17$	$11.18 \pm 0.17$
4	$\underset{̲}{3.15 \pm 0.01}$	$\underset{̲}{6.40 \pm 0.01}$	$\underset{̲}{4.25 \pm 0.08}$	$\underset{̲}{8.80 \pm 0.08}$	$\underset{̲}{5.10 \pm 0.16}$	$\underset{̲}{10.20 \pm 0.16}$
10	$\underset{̲}{3.16 \pm 0.01}$	$\underset{̲}{6.42 \pm 0.01}$	$\underset{̲}{4.24 \pm 0.08}$	$\underset{̲}{8.82 \pm 0.08}$	$\underset{̲}{5.11 \pm 0.16}$	$\underset{̲}{10.19 \pm 0.16}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vorobyov, V.; Ugadiarov, L.; Frolov, V.; Kovalev, A.; Panov, A. COMPAS: Compose Actions and Slots in Object-Centric World Models. Mach. Learn. Knowl. Extr. 2026, 8, 117. https://doi.org/10.3390/make8050117

AMA Style

Vorobyov V, Ugadiarov L, Frolov V, Kovalev A, Panov A. COMPAS: Compose Actions and Slots in Object-Centric World Models. Machine Learning and Knowledge Extraction. 2026; 8(5):117. https://doi.org/10.3390/make8050117

Chicago/Turabian Style

Vorobyov, Vitaliy, Leonid Ugadiarov, Vladimir Frolov, Alexey Kovalev, and Aleksandr Panov. 2026. "COMPAS: Compose Actions and Slots in Object-Centric World Models" Machine Learning and Knowledge Extraction 8, no. 5: 117. https://doi.org/10.3390/make8050117

APA Style

Vorobyov, V., Ugadiarov, L., Frolov, V., Kovalev, A., & Panov, A. (2026). COMPAS: Compose Actions and Slots in Object-Centric World Models. Machine Learning and Knowledge Extraction, 8(5), 117. https://doi.org/10.3390/make8050117

Article Menu

COMPAS: Compose Actions and Slots in Object-Centric World Models

Abstract

1. Introduction

2. Related Works

2.1. Object-Centric Learning

2.2. Object-Centric World Models

3. Materials and Methods

3.1. Slot Representation Extraction

3.2. Transition Model

4. Results

4.1. Environments and Datasets

4.2. Metrics

4.3. Training and Evaluation Setup

4.4. Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Architecture and Computational Details

Appendix B. Datasets

Appendix B.1. ViZDoom

Appendix B.2. Causal World

Appendix B.3. RoboSuite

Appendix C. Additional Slot Mapping Visualization

Appendix D. Model-Based Planning

Appendix E. Problem Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI