Learning to Predict 3D Rotational Dynamics from Images of a Rigid Body with Unknown Mass Distribution

In many real-world settings, image observations of freely rotating 3D rigid bodies may be available when low-dimensional measurements are not. However, the high-dimensionality of image data precludes the use of classical estimation techniques to learn the dynamics. The usefulness of standard deep learning methods is also limited, because an image of a rigid body reveals nothing about the distribution of mass inside the body, which, together with initial angular velocity, is what determines how the body will rotate. We present a physics-based neural network model to estimate and predict 3D rotational dynamics from image sequences. We achieve this using a multi-stage prediction pipeline that maps individual images to a latent representation homeomorphic to $\mathbf{SO}(3)$, computes angular velocities from latent pairs, and predicts future latent states using the Hamiltonian equations of motion. We demonstrate the efficacy of our approach on new rotating rigid-body datasets of sequences of synthetic images of rotating objects, including cubes, prisms and satellites, with unknown uniform and non-uniform mass distributions. Our model outperforms competing baselines on our datasets, producing better qualitative predictions and reducing the error observed for the state-of-the-art Hamiltonian Generative Network by a factor of 2.


Introduction
The study and control of a range of systems can benefit from the means to predict the rotational dynamics of 3D rigid bodies that are only observed through images.A compelling example is the navigation and control of space robotic systems that interact with resident space objects (RSOs).RSOs are natural or designed freely rotating rigid bodies that orbit a planet or moon.Space robotic system missions that involve interaction with RSOs include collecting samples from an asteroid [1], servicing a malfunctioning satellite [2], and removing active space debris [3].A challenge is that space robotic systems may have limited information on the mass distribution of RSOs.However, they do typically have onboard cameras to observe sequences of RSO movement.Thus, learning to predict the dynamics of the RSOs from onboard images can make a difference for mission success.
Whether a freely rotating 3D rigid body tumbles unstably or spins stably depends on the distribution of mass inside the body and the body's initial angular velocity (compare Figure 1a and Figure 1b).This means that to predict the body's rotational dynamics, it is not enough to know the external geometry of the body.That would be insufficient, for instance, to predict the different behavior of two bodies with the same external geometry and different internal mass distribution.Even if the bodies start at the same initial angular velocity, one body could tumble or wobble while the other spins stably (compare Figure 1b and Figure 1d).Figure 1 shows four simulations of a freely rotating rigid body that illustrate the role of mass distribution and initial velocity.The distribution of mass determines J ∈ R 3×3 , where J is the moment-of-inertia matrix for a 3D rigid body expressed with respect to the body-fixed frame, i.e., an orthonormal reference frame B = {i, j, k} fixed to the body with origin at the body's center of mass (see Appendix A.1 for details).Figure 1a-c all have the same moment-of-inertia matrix J = J 1 , which corresponds to that of a rectangular prism with uniform mass distribution (see Table A1 in Appendix A.2). Steady spin about the longest and shortest principal axes is stable and about the intermediate principal axis is unstable (see Appendix A.1). So, if the initial angular velocity is near the unstable solution, the body tumbles (Figure 1a), whereas if it is near the stable axis, the body spins (Figure 1b).This is independent of the external geometry, which explains why the satellite in Figure 1c spins identically to the prism in Figure 1b.In Figure 1d, mass is non-uniformly distributed, such that J = J 3 (see Table A1 in Appendix A.2) and the same initial velocity as in Figure 1b is no longer close to a stable solution, which explains why the prism wobbles.
Predicting 3D rigid body rotational dynamics is possible if the body's mass distribution can be learned from observations of the body in motion.This is easier if the observations consist of low-dimensional data, e.g., measurements of the body's angular velocity and the rotation matrix that defines the body's orientation.It is much more challenging, however, if the only available measurements consist of images of the body in motion, as in the case of remote observations of a satellite or asteroid or space debris.
We address the challenge of learning and predicting 3D rotational dynamics from image sequences of a rigid body with unknown mass distribution and unknown initial angular velocity.To do so we design a neural network model that leverages Hamiltonian structure associated with 3D rigid body dynamics.We show how our approach outperforms applicable methods from the existing literature.
Deep learning has proven to be an effective tool to learn dynamics from images.Previous work [4][5][6] has made significant progress in using physics-based priors to learn dynamics from images of 2D rigid bodies, such as a pendulum.Learning dynamics of 3D rigid-body motion has also been explored with various types of input data [7][8][9].We believe our method is the first to use the Hamiltonian formalism to learn 3D rigid-body rotational dynamics from images.
In this work, we introduce a model, with architecture depicted in Figure 2, that (1) learns 3D rigid-body rotational dynamics from images, (2) predicts future image sequences in time, and (3) generates a low-dimensional, interpretable representation of the latent state.During training, our model encodes a sequence of images (input) to a sequence of latent orientations (Figure 2a).The sequence of orientations is processed by two pathways.
In one, the sequence is decoded to a sequence of images which are used to compute the auto-encoding reconstruction loss (Figure 2c).In the other, the first element of the sequence is processed by the dynamics pipeline.The resulting sequence is decoded to a sequence of images, which are used to compute the dynamics-based reconstruction loss (Figure 2d).During inference, our model encodes a pair of images (input) to a single latent orientation (Figure 2b).This latent orientation is processed by the dynamics pipeline and decoding pipeline resulting in a predicted image sequence (Figure 2d).Our model incorporates the Hamiltonian formulation of the dynamics as an inductive bias to facilitate learning the moment-of-inertia matrix, J φ , and an auto-encoding map between images and the special orthogonal group represents the space of all 3D rotations: the orientation of the rigid body at time t is described by the rotation matrix R(t) ∈ SO(3) that maps points on the body from body frame coordinates to inertial frame coordinates at time t.
The efficacy of our approach is demonstrated through long-term image prediction on synthetic datasets.Due to the scarcity of appropriate datasets, we have created publicly available, synthetic datasets of rotating objects (e.g., cubes, prisms, and satellites) applicable for evaluation of our model, as well as other tasks on 3D rigid-body rotation such as pose estimation.

Related Work
A growing body of work incorporates Hamiltonian and Lagrangian formalisms to improve the accuracy and interpretability of learned representations in neural network-based dynamical systems forecasting [10][11][12].Greydanus et al. [10] predict symplectic gradients of a Hamiltonian system using a Hamiltonian parameterized by a neural network.They show that the Hamiltonian neural network (HNN) predicts the evolution of conservative systems better than a baseline black-box model.Chen et al. [11] improve the long-term prediction performance of [10] by minimizing the mean-squared error (MSE) between ground-truth and predicted state trajectories rather than one-step symplectic gradients.Cranmer et al. [12] propose parameterization of the system Lagrangian by a neural network arguing that momentum coordinates may be difficult to compute in some settings.Each of the aforementioned learn from sequences of phase-space measurements; our model learns from images.
The authors of [4][5][6] leverage Hamiltonian and Lagrangian neural networks to learn the dynamics of 2D rigid bodies (e.g., the planar pendulum) from image sequences.Zhong and Leonard [4] introduce a coordinate-aware variational autoencoder (VAE) [13] with a latent Lagrangian neural network (LNN) which learns the underlying dynamics and facilitates control.Allen-Blanchette et al. [6] use a latent LNN in an auto-encoding neural network to learn dynamics without control or prior knowledge of the configuration-space structure.Toth et al. [5] use a latent HNN in a VAE to learn dynamics without control, prior knowledge of the configuration-space structure or dimension.Similarly to Toth et al. [5], we use a latent HNN to learn dynamics.Distinctly, however, we consider 3D rigid body dynamics and incorporate prior knowledge of the configuration-space structure to ensure interpretability of the learned representations.
Others have considered the problem of learning 3D rigid-body dynamics [7][8][9].Byravan and Fox [7] uses point-cloud data and action vectors (forces) as inputs to a black-box neural network to predict the resulting SE(3) transformation matrix, which represents the motion of objects within the input scene.The special Euclidean group SE(3) = (R, r)| R ∈ SO(3), r ∈ R 3 represents the space of all 3D rotations and translations: the orientation and position of the rigid body at time t is described by the rotation matrix and vector pair (R(t), r(t)) ∈ SE(3) that maps points on the body from body frame coordinates to inertial frame coordinates at time t.Peretroukhin et al. [8] create a novel symmetric matrix representation of SO(3) and incorporate it into a neural network to perform orientation prediction on synthetic point-cloud data and images.Duong and Atanasov [9] use lowdimensional measurement data (i.e., the rotation matrix and angular momenta) to learn rigid body dynamics on SO(3) and SE(3) for control.
The combination of deep learning with physics-based priors allows models to learn dynamics from high-dimensional data such as images [4][5][6].However, as far as we know, our method is the first to use the Hamiltonian formalism to learn 3D rigid-body rotational dynamics from images.

Background
The S 2 × S 2 parameterization of the 3D rotation group SO(3) is a surjective and differentiable mapping with a continuous right inverse [14].Define the n-sphere: where w i are renormalized to have unit norm.
Intuitively, this mapping constructs an orthonormal frame from the unit vectors u and v by Gram-Schmidt orthogonalization.The right inverse of the parameterization is given by (w 1 , w 2 , w 3 ) → (w 1 , w 2 ).Other parameterizations of SO(3), such as the exponential map (so(3) → SO(3)) and the quaternion map (S 3 → SO(3)), do not have continuous inverses and therefore are more difficult to use in deep manifold regression [14][15][16][17].

3D Rotating Rigid-Body Kinematics
The orientation of a rotating 3D rigid body R(t) ∈ SO(3) changing over time t can be computed from body angular velocity Ω(t) ∈ R 3 , i.e., the angular velocity of the body expressed with respect to the body frame B, at time t ≥ 0 using the kinematic equations given by the time-rate-of-change of R(t) shown in Equation (A3).For computational purposes, 3D rigid-body rotational kinematics are commonly expressed in terms of the quaternion representation q(t) ∈ S 3 of the rigid-body orientation R(t).The kinematics (A3), written in terms of quaternions [18], are where Ω × is the 3 × 3 skew-symmetric matrix defined by (Ω × )y = Ω × y for y ∈ R 3 .

3D Rigid-Body Dynamics in Hamiltonian Form
The canonical Hamiltonian formulation derives the equations of motion for a mechanical system using only the symplectic form and a Hamiltonian function, which maps the state of the system to its total (kinetic plus potential) energy [19].This formulation has been used by several authors to learn unknown dynamics: the Hamiltonian structure (canonical symplectic form) is used as a physics prior and the unknown dynamics are uncovered by learning the Hamiltonian [5,10,[20][21][22].Consider a system with configuration space R n and a choice of n generalized coordinates that represent configuration.Let z(t) ∈ R 2n represent the vector of n generalized coordinates and their n conjugate momenta at time t.Define the Hamiltonian function H : R 2n → R such that H(z) is the sum of the kinetic plus potential energy.Then, the equations of motion [19,23] derive as where 0 n ∈ R n×n is the matrix of all zeros and Λ can is the matrix representation of the canonical symplectic form.
The Hamiltonian equations of motion for a freely rotating 3D rigid body evolve on the six-dimensional space T * SO(3), the co-tangent bundle of SO(3).However, because of rotational symmetry in the dynamics, i.e., the invariance of the dynamics of a freely rotating rigid body to the choice of inertial frame, the Hamiltonian formulation of the dynamics can be reduced using the Lie-Poisson Reduction Theorem [24] to the space R 3 ∼ so * (3), the Lie co-algebra of SO(3).These reduced Hamiltonian dynamics are equivalent to (A2), where the body angular momentum is Π(t) = JΩ(t) ∈ so * (3) for t ≥ 0. The invariance can be seen by observing that the rotation matrix R(t), which describes the orientation of the body at time t, does not appear in (A2).R(t) is calculated from the solution of (A2) using (A3).
The reduced Hamiltonian h : so * (3) → R for the freely rotating 3D rigid body (kinetic energy) is The reduced Hamiltonian formulation [24] is which can be seen to be equivalent to (A2).Equation ( 4), called the Lie-Poisson equation, generalizes the canonical Hamiltonian formulation.The generalization allows for different symplectic forms, i.e., Λ so * (3) instead of Λ can in this case, each of which is only related to the latent space and symmetry.Our physics prior is the generalized symplectic form and learning the unknown dynamics means learning the reduced Hamiltonian.This is a generalization of the existing literature, where dynamics of canonical Hamiltonian systems are learned with the canonical symplectic form as the physics prior [5,[10][11][12].Using the generalized Hamiltonian formulation allows extension of the approach to a much larger class of systems than those described by Hamilton's canonical equations, including rotating and translating 3D rigid bodies, rigid bodies in a gravitational field, multi-body systems, and more.

Materials and Methods
In this section, we outline our approach for learning and predicting rigid-body dynamics from image sequences.The multi-stage prediction pipeline maps individual images to an SO(3) latent space where angular velocities are computed from latent pairs.Future latent states are computed using the generalized Hamiltonian equations of motion (4) and a learned representation of the reduced Hamiltonian (3).Finally, the predicted latent representations are mapped to images giving a predicted image sequence.

Notation
N denotes the number of image sequences in the dataset, and T + 1 is the length of each image sequence.Image sequences are written x k = {x k 0 , . . ., x k T }, sequences of latent rotation matrices are written R k = {R k 0 , . . ., R k T } with R k i ∈ SO(3), and quaternion latent sequences are written q k = {q k 0 , . . ., q k T } with q k i ∈ S(3).Each element y k i represents the quantity y at time step t = i for sequence k from the dataset, where k ∈ {1, . . ., N}. Quantities generated with the learned dynamics are denoted with a hat (e.g., q).

Embedding Images to an SO(3) Latent Space
In the first stage of our prediction pipeline, we embed image observations of a freely rotating rigid body to a low-dimensional latent representation to facilitate computation of the dynamics.The latent representation is constrained to have the same SO(3) structure as the configuration space of the rigid body, making learned representations interpretable and compatible with the equations of motion.Our embedding network Φ is given by the composition of functions Φ := f • π • E ϕ : I → SO(3).The convolutional encoding neural network E ϕ : I → R 6 parameterized by ϕ maps image observations from image space I to a vector z ∈ R 6 .The projection operator π : R 6 → S 2 × S 2 decomposes the vector z into the vectors u, v ∈ R 3 and normalizes them, i.e., π(z) = (u/∥u∥, v/∥v∥).Finally, the function f : S 2 × S 2 → SO(3) maps the normalized vectors u and v to the configuration space using the surjective and differentiable S 2 × S 2 parameterization of SO(3) (see Section 3.1).

Computing Dynamics in the Latent Space
In the second stage of our prediction pipeline, we compute the dynamics of the freely rotating rigid body using a Hamiltonian with a learned moment-of-inertia tensor, J φ .The moment-of-inertia tensor, J φ , is parameterized by the vectors φ 1 , φ 2 ∈ R 3 , representing the diagonal and off-diagonal components of the matrix, and computed using the Cholesky decomposition [25].
To compute the dynamics, we first construct an initial condition (R k 0 , )), we perform this in two steps.First, we compute the angular velocity Ω k 0 by Algorithm 1.Then, we compute the angular momentum by the matrix product of the learned moment-of-inertia and angular velocity, i.e., are computed using the Lie-Poisson Equation ( 4) and the reduced Hamiltonian formed using the learned momentum-of-inertia J φ .We integrate the Lie-Poisson equations forward in time using a Runge-Kutta fourth-order (RK45) numerical solver [26].
Algorithm 1: An algorithm to calculate the body angular velocity given two sequential orientation matrices and the time step in between them.Data: R t , R t+1 , ∆t Result: are computed in two steps.First, we compute the sequence of quaternions { qk i } T i=1 by Equation (1), using the quaternion representation q k 0 of the initial rotation R k 0 and the initial angular velocity Ω k 0 .We integrate Equation (1) forward in time using an RK45 solver with a normalization step [18] that ensures elements of the resulting sequence are valid quaternions.Then, we transform the sequence of quaternions { qk i } T i=1 to a sequence of rotations rotations { Rk i } T i=1 using a modified Shepperd's algorithm [27].

Decoding SO(3) Latent States to Images
In the final stage of our prediction pipeline, we decode the sequence of SO(3) latent states produced by the dynamics pipeline to a sequence of images (see Figure 2d).Our decoding network Ψ is given as the composition of functions , where the convolutional decoding network D ψ : R 6 → I parameterized by ψ maps a vector z = (u, v), where (u, v) ∈ S 2 × S 2 , to the image space I.

Training Methodology
In this section, we describe the loss functions used to optimize our model: the autoencoding reconstruction loss (L ae ), the dynamics-based reconstruction loss (L dyn ), the latent orientation loss (L latent, R ), and latent momentum loss (L latent, Π ).L ae ensures the embedding to SO(3) is sufficiently expressive to represent the entire image dataset, and L dyn ensures correspondence between the input image sequences and the images sequences produced by the learned dynamics.The latent loss functions, L latent, R and L latent, Π , ensure consistency between the latent states produced by the encoding pipeline and those produced by the dynamics pipeline.
For notational convenience, we denote the encoding pipeline E : I → S 3 and the decoding pipeline D : S 3 → I. Quantities computed in the encoding pipeline use subscript ae (e.g., R k aei ), while those computed in the dynamics pipeline use subscript dyn (e.g., R k dyn i ).

Reconstruction Losses
The auto-encoding reconstruction loss is the mean squared error (MSE) between the ground-truth image sequence and the reconstructed image sequence without dynamics: The dynamics-based reconstruction loss is the MSE between the ground-truth image sequence and the image sequence produced by the dynamics pipeline:

Latent Losses
We define L latentR as the SO(3) distance [19] between the 3 × 3 identity matrix and right-difference of orientations produced in the encoding pipeline and the orientations produced in the dynamics pipeline: We define L latentΠ as the MSE between the angular momenta estimated in the encoding pipeline and the angular momenta computed in the dynamics pipeline (see Figure 2): The hyperparameters we use to train our model are given in Table A2 in Appendix A.3.We train our model for 500 epochs, on a single NVIDIA A100 SXM4 GPU.Our training time is approximately 12 hours, and our inference time is approximately 300 milliseconds.

3D Rotating Rigid-Body Datasets
To evaluate our model, we introduce six synthetic datasets of freely rotating objects.Previous efforts in learning dynamics from images [4][5][6]10] consider only 2D planar systems (e.g., the simple pendulum, Acrobot, and cart-pole); existing datasets of freely rotating rigid bodies in 3D such as SPEED [28,29], SPEED+ [30], and URSO [31], contain random image-pose pairs rather than sequential pairs needed for video prediction and dynamics extraction.Our datasets showcase the rich dynamical behaviors of 3D rotational dynamics through images, and can be used for 3D dynamics learning tasks.Specifically, we introduce the following five datasets (see Table A1  For each dataset, N = 1000 trajectories are created.Each trajectory consists of an initial condition x 0 = (R 0 , Π 0 ) that is integrated forward in time using a Python-based Runge-Kutta solver for T = 100 time steps with spacing ∆t = 10 −3 .Initial conditions are chosen such that (R 0 , Π 0 ) ∼ Uniform SO(3) × S 2 with Π 0 scaled to have ∥Π 0 ∥ 2 = 50.
The orientations q from the integrated trajectories are passed to Blender [32] to render images of 28 × 28 pixels (as shown in Figure 1).
The synthetic image datasets are generated using Blender [32] with ideal and fixed lighting conditions.Models trained on this dataset may exhibit sensitivity to variations in lighting conditions, and may not generalize to real data.

Results
Figures 3 and 4 show the model's performance on the datasets for both short-and long-term predictions.Figure 3 results show that the model is capable of predicting into the future at least five fold longer than the length of the time horizon used at training time.Figure 4 results show that the model is capable of predicting the future with images of more complex geometries and surface properties, i.e., those of the CALIPSO and CloudSat satellites, at least ten fold longer than the length of the time horizon used at training time.The model's performance on the datasets is indicative of its capabilities to predict dynamics and map them to image space.The uniform mass density cube and prism datasets are used to demonstrate baseline capabilities of our approach for image prediction.The non-uniform mass density datasets validate the model's capability to predict a mass distribution that may not be visible from the exterior, e.g., for an asteroid or space debris or as part of failure diagnostics on a satellite where there may be broken or shifted internal components.The satellite datasets are used to validate the model's capability to handle bodies with less regular and more realistic external geometries.
We compare the performance of our model to three baseline models: (1) the Long Short Term Memory (LSTM) network, (2) the Neural ODE [33] network, and (3) the Hamiltonian Generative Network (HGN) [5].Recurrent neural networks like the LSTM-baseline provide a discrete dynamics model.Neural ODE can be combined with a multi-layer perceptron to predict continuous dynamics.HGN is a generative model with a Hamiltonian inductive bias.Architecture and training details for each baseline are given in Appendix A.4.The prediction performances of our model and the baselines are shown in Table 1.Our model has the lowest MSE on the majority of our datasets with good prediction performance on all of our datasets.Our model outperforms the state-of-the-art HGN model, reducing the expected MSE by nearly half on all datasets.Overall, our model outperforms the baseline models on the majority of the datasets with a more interpretable latent space, continuous dynamics, and fewer model parameters.In Appendix A.5, we present the results of ablations studies and provide discussion.We find that the latent losses improve performance.However, the model may be over constrained with both the dynamics-based and auto-encoding based reconstruction losses.

Summary
In this work, we have presented the first physics-informed deep learning framework for predicting image sequences of 3D rotating rigid-bodies by embedding the images as measurements in the configuration space SO(3) and propagating the Hamiltonian dynamics forward in time.We have evaluated our approach on new datasets of freerotating 3D bodies with different inertial properties, and have demonstrated the ability to perform long-term image predictions.We outperform the LSTM, Neural ODE and Hamiltonian Generative Network (HGN) baselines on our datasets, producing better qualitative predictions and reducing the error observed for the state-of-the-art HGN by a factor of 2.

Conclusions
By enforcing the representation of the latent space to be SO(3), this work provides the advantage of interpretability over black-box physics-informed approaches.The extra interpretability of our approach is a step towards placing additional trust into sophisticated deep learning models.This work provides a natural path to investigating how to incorporate and evaluate the effect of classical model-based control directly to trajectories in the latent space.

Future Work
Although our approach so far has been limited to embedding RGB-images of rotating rigid-bodies with configuration spaces in SO(3), there are natural extensions to a wider variety of problems.For instance, this framework can be extended to embed different high-dimensional sensor measurements, such as point clouds, by modifying the feature extraction layers of the autoencoder.The latent space can be chosen to reflect generic rigid bodies in SE(3) or systems in more complicated spaces, such as the n-jointed robotic arm on a restricted subspace of Π n i=1 (SO(3)).Another possible extension includes multibody systems, i.e., systems with rigid and flexible body dynamics, which would have applications to systems such as spacecraft with flexible solar panels and aircraft with flexible wings.
Author Contributions: J.J.M. and C.A.-B.are the lead authors of this manuscript.Both assisted in conceptualization, methodology, investigation, writing, and editing of the manuscript.J.J.M. also contributed analyses and led software development associated with the model and assisted in software development for data generation.C.A.-B.also contributed compute resources and assisted in software development, and served in a supervisory role.N.Z.contributed to methodology, investigation, and writing and editing of the manuscript, and led data curation and software development for data generation.E.D. provided resources, served in a supervisory role, and assisted in writing and editing of the manuscript.N.E.L. was the principal investigator, contributed to conceptualization, methodology, investigation, writing, and editing the manuscript, and served in a supervisory role and acquired funding.All authors have read and agreed to the published version of the manuscript.
in Figure A2, and the quantitative performance is given in Table 1.The total number of parameters in the network is 11,406.The Neural ODE-baseline has poorer performance than our proposed approach on all evaluated datasets.HGN [5] uses a combination of a variational auto-encoding (VAE) neural network and Hamiltonian dynamics to perform video prediction.When testing HGN as a baseline, we use the implementation provided by Balsells Rodas et al. [37].We train HGN on our datasets using the hyperparameters, loss function, and integrator described in Toth et al. [5].The qualitative performance for the HGN-baseline is given in Figure A3, and the quantitative performance is given in Table 1.

Figure 1 .
Figure 1.Simulations illustrating how mass distribution and initial angular velocity determine behavior.(a) Tumbling prism: uniform mass distribution (J = J 1 ) and initial angular velocity near an unstable solution.(b) Spinning prism: J = J 1 and initial angular velocity near a stable solution.(c) Spinning CALIPSO satellite: J = J 1 and same initial angular velocity as (b).(d) Wobbling prism: non-uniform mass distribution (J = J 3 ) and same initial velocity as (b).

Figure 2 .
Figure 2. A schematic of the model's forward pass at training time and inference.(a) Encoding pipeline at training; (b) encoding pipeline at inference; (c) decoding for auto-encoding reconstruction; and (d) dynamics prediction and decoding for dynamics-based reconstruction.
in Appendix A.2 for moment-of-inertia matrices): • Uniform mass density cube: a multi-colored cube of uniform mass density; • Uniform mass density prism: a multi-colored rectangular prism with uniform mass density; • Non-uniform mass density cube: a multi-colored cube with non-uniform mass density; • Non-uniform mass density prism: a multi-colored prism with non-uniform mass density; • Uniform density synthetic-satellites: renderings of CALIPSO and CloudSat satellites with uniform mass density.

Figure 3 .
Figure 3. Predicted sequences for uniform and non-uniform mass density cube and prism datasets given by our model.The figure shows predicted images at time steps τ = 0 to 5 and τ = 45 to 50.

Figure 4 .
Figure 4. Predicted sequences for the CALIPSO satellite (top) and CloudSat satellite (bottom) with uniform mass densities given by our model.The figure shows predicted images at every 10th time step from τ = 0 to 90.

Figure A4 .
Figure A4.Evaluation of the image prediction performance of an ablated version of our model trained with only the dynamics-based reconstruction loss (L dyn ) from Section 4.5.1.The ablated model has poorer performance than the proposed approach over all datasets.The prediction performance worsens earlier than the proposed model's performance, as shown in Figure 3.

Figure A5 .
Figure A5.Evaluation of the image prediction performance of an ablated version of our model trained with only the auto-encoding reconstruction and dynamics-based reconstruction losses (L ae and L dyn ) from Section 4.5.1.The ablated model has poorer performance than the proposed approach over all datasets-even failing to predict after ∼30 time steps for the non-uniform cube dataset.

Figure A6 .
Figure A6.Evaluation of the image prediction performance of an ablated version of our model trained with only dynamics-based reconstruction losses and latent losses (L dyn , L latent R , and L latentΠ ) from Section 4.5.1.The ablated model has better performance than the proposed approach on the majority of the datasets-implying that the proposed model may be over constrained.

Table A3 .
Average pixel MSE over a 30-step unroll on the train and test data on four datasets for our ablation study.
dyn L dyn + L ae L dyn + L