Figure 2.
The method starts by inputting a trajectory of
T observations into a frozen DINO encoder to extract features. Trainable tokens are duplicated and combined with positional encodings to produce deterministic slot initializations
(
Figure 3a). These, along with the features, are fed into a slot extractor with temporal sinusoidal encoding. The extractor has two components: a shared key slot attention block that maps slots to feature patches and a temporal consistency block to maintain coherence across time. After
layers, slot representations
are produced for each step. A transition model then predicts future states using discrete
or continuous
actions, processed into vectors. Slots and actions, embedded with temporal encodings, are passed through the predictor block (
Figure 3c). After
layers, the model outputs next-step slots
.
Figure 2.
The method starts by inputting a trajectory of
T observations into a frozen DINO encoder to extract features. Trainable tokens are duplicated and combined with positional encodings to produce deterministic slot initializations
(
Figure 3a). These, along with the features, are fed into a slot extractor with temporal sinusoidal encoding. The extractor has two components: a shared key slot attention block that maps slots to feature patches and a temporal consistency block to maintain coherence across time. After
layers, slot representations
are produced for each step. A transition model then predicts future states using discrete
or continuous
actions, processed into vectors. Slots and actions, embedded with temporal encodings, are passed through the predictor block (
Figure 3c). After
layers, the model outputs next-step slots
.
3.1. Slot Representation Extraction
Contrary to iterative approaches used in the classical slot attention algorithm (or its variations), we utilize sequentially stacked blocks. This design enables increased parallelization, scalability, and faster inference and training. The overall architecture of the slot extractor module is presented in
Figure 2.
Firstly, we initialize slots by duplicating a trainable token of size
D for
times, resulting in a slot initialization
.
T represents the dimension of the starting time window,
D is the size of a single slot, and
N is the number of slots. By adding a trainable slot position embedding
, which encodes each token with specified slot position and sinusoidal time position embeddings
to the corresponding slots, we obtain the slot initialization at time step
t:
The slot’s positional encoding remains constant across time steps but differs for each slot. Similarly, the time positional encoding is the same for all slots within a single time step but varies between time steps. This deterministic initialization, combined with a trainable token and slot position embeddings, ensures slot consistency across episodes.
After initialization, we feed
into the slot extraction block. To generalize the notation, let
represent the slots from the
l-th layer of the slot extractor block, where
. This block extracts slot representations from encoder features
, where
M is the number of patches, and
J denotes the encoder feature dimensions. We utilize a modified version of the shared key slot attention [
23]. The encoder features
are extracted from input image frames using a pre-trained DINO [
25] encoder:
Object-centric models for video, such as the SAVI or Videosaur models, implement slot trajectory consistency through a recurrent predictor model, which predicts slot initialization for the next time step based on the slots from the previous time step. However, this approach may be less high-performing and is prone to gradient decay. To address these issues and improve the training and inference speed, we introduce a temporal consistency block. This block processes slots in parallel and is implemented as a transformer encoder layer with a mask
. The mask promotes attention to the same slots from previous time steps causally, while ignoring other slots. In other words, the
i-th slot
can only attend to the same slot
from a previous time step
m (
). The visualization of the matrix is shown in
Figure 3b.
By feeding the sequence of extracted slots from the
l-th layer in a time window of size
T,
, into the temporal consistency block, we obtain the final slots:
Assuming that the slot extractor model comprises L layers, sequentially implementing these operations results in the final slots .
Since the context length of the model is constrained to a fixed time window
T, which is typically small, challenges arise regarding the scalability of the model for longer videos. To address this issue, the slot extractor module is trained in an autoregressive manner. The initialization of the next slots,
, is performed as described in Equation (
1). Subsequently, the slots are extracted using the procedure detailed in Equation (
2).
The extracted slots,
, are then appended to the shifted time window of previously extracted slots,
, resulting in the sequence
. These slots are then fed into the temporal consistency block described in Equation (
3). By sequentially processing them through all
L layers, the final slots,
, are obtained. This procedure can be repeated for any desired horizon beyond the initial context length
T.
The training approach follows the methodology outlined in DINOSAUr [
6]. After extracting the slots
across the entire training trajectory
, where
Z represents the prediction horizon for autoregressive extraction, these slots are processed through a shared MLP decoder to generate feature reconstructions
. The training objective is to minimize the mean squared error (MSE) loss:
.
The overall time complexity of a single forward pass of the extractor model is , where T is the initial time window of slots, L is the number of extractor layers, K is the number of frame feature patches, N is the number of slots, and D is the slot dimension.
3.2. Transition Model
The transition model is inspired by OCVP. During the early phases of training of the slot extractor, slots often lack useful information, which can push the transition model away from the optimal solution or cause it to become stuck in a local minimum. To address these challenges and accelerate the training process, the transition model is trained separately from the slot extractor on pre-computed slots.
At the start, slots
within a time window of size
T are projected into the transition model’s dimension. Since the transition model operates with the same dimensions
D as the slots,
The transition model supports both continuous and discrete actions. Continuous actions
are projected into the same dimension as the transition model:
For discrete actions, we use trainable embeddings for each possible action
:
To incorporate temporal information, we add sinusoidal time embeddings
to each slot and action at time step
t:
Subsequently, actions and slots are passed into an action attention block, relation attention block, and temporal attention in parallel to the original OCVP layer. The architecture of the predictor block is presented in
Figure 3c.
Action attention is implemented as a simple multi-head cross-attention. In this layer, actions serve as keys and values, while slots act as queries.
After projecting the features from the final time step into slot dimensions, we obtain the predicted features for the next time step, denoted as . By shifting the input slot time window and appending as the last element, along with obtaining , we can autoregressively predict future states for any horizon H.
The training objective for the COMPAS dynamics model is defined as the
reconstruction loss between the ground truth slots
and the model’s predictions
:
The single-forward-pass time complexity of the transition model is