Unsupervised and Generic Short-Term Anticipation of Human Body Motions

Various neural network based methods are capable of anticipating human body motions from data for a short period of time. What these methods lack are the interpretability and explainability of the network and its results. We propose to use Dynamic Mode Decomposition with delays to represent and anticipate human body motions. Exploring the influence of the number of delays on the reconstruction and prediction of various motion classes, we show that the anticipation errors in our results are comparable to or even better for very short anticipation times (<0.4 s) than a recurrent neural network based method. We perceive our method as a first step towards the interpretability of the results by representing human body motions as linear combinations of previous states and delays. In addition, compared to the neural network based methods large training times are not needed. Actually, our methods do not even regress to any other motions than the one to be anticipated and hence it is of a generic nature.


Introduction
Various kinds of neural network architectures are the main technical basis for the current state of the art for anticipation of human body motions from data [1][2][3][4][5][6][7][8][9].However, as is the case in many other application domains, there is a fundamental lack of interpretability of the neural networks.In these approaches the two main conceptual ingredients of human motion prediction are also intermixed: Whereas for point (a) mechanistic models might be hard to obtain, for (b) models as dynamical systems partially reflecting bio-physical knowledge are possible in principle.In this paper, we will focus on point (b).Instead of suggesting another neural network based anticipation architecture, we will try to separate several possible constituents: • Can recently developed so called equation free modeling techniques [10][11][12][13][14][15] already explain and predict motions in a short time horizon?
For this purpose the view of the motion time series as dynamical systems is used.• What is the role of incorporating delay inputs?Many neural network architectures incorporate delays [16][17][18], recurrent connections [2,19] or temporal convolutions [5,20,21], but the contribution of the delays or the memory cannot be separated from the overall network architecture.
In general, dynamical systems are usually described via a set of differential equations.For many systems a variety of appropriate data in form of observables are available.However, if the process is complex the recovery of arXiv:1912.06688v1[cs.LG] 13 Dec 2019 the underlying differential equation from data is a challenging task [22].Instead, the set of m observables sampled at time steps n is used for the investigation of the considered process.For the identification of temporal structures, the Fourier theory is usually utilized.Therefore, a Fourier analysis on the observables is performed, to extract amplitude and frequencies leading to a decomposition into trigonometric series.This approach has some drawbacks for human motion capture data as these phenomena not exclusively consist of periodic components.Hence, the decomposition will be distorted.An algorithm that take this point into account is Dynamic Mode Decomposition (DMD).
Introduced in the first version by Schmid and Sesterhenn in 2008 [11], DMD is a data-driven and-in its original form-"equation free" algorithm extracting spatio-temporal patterns in the form of so-called DMD modes, DMD eigenvalues, and DMD amplitudes from the given observables.However, whereas Fourier-decomposition is obliged to use a fixed set of frequencies (i.e.eigenvalues), with DMD we detect and can use only the important ones and therefore adapt better to the process considered.
Since its introduction there has been many improvements and modification to this method.One of these improvements is the combination of DMD with delay coordinates.In this paper, we will investigate the use of DMD with delays (DMDd) in the context of short term anticipation of human motion from data.Due to the structure of motion capture data (more time steps than observables) the use of delays optimizes the application of DMD.Therefore, a more precise extraction of spatio-temporal patterns is achieved which leads an more accurate short-term prediction.
Thus, the paper is structured as follow: In Section 2 we describe the DMD algorithm as well as Taken's Theorem and DMDd.In Section 3 we explain our experiments with DMDd and give some examples of our results.In Section 4 we discuss future possibilities given by our method.

Theoretical background
In this section, we clarify the background of Dynamic Mode Decomposition.For the application on motion capture data we assume a vector-valued time series x 1 , x 2 , . . ., x n ∈ R m , where each snapshot consists of marker positions (in 3D) or joint angles of a skeleton to a certain time step.Before we formulate the algorithm in more details, we briefly highlight the basic concept of DMD: In a first step, the data were used to determine frequencies, the so-called DMD eigenvalues.These are defined by the non-zero eigenvalues of a solution to the following minimization problem: ( Then, the data were fitted to the previously computed frequencies (this process is similar to a discrete Fourier transformation or a trigonometric interpolation).However, in many application areas the number of observables is considerably larger than the number of snapshots, i.e. m > n.Therefore, this approach leads to a sufficient number of frequencies and it can be proven that the reconstruction is error-free [23].For motion capture data, however, the converse is true, i.e. m < n.Hence, in most cases we do not have enough frequencies for an adequate reconstruction, which even results in a bad anticipation as well.
We approach this issue by manipulating the data in a preprocessing step, i.e. before applying EXDMD.To this end, the theory of delays justified by Takens' Theorem is consulted, which is described in Subsection 2.2.Applying this technique leads to Dynamic Mode Decomposition with delay (DMDd) [10].The exact procedure is explained in Subsection 2.3.

Exact Dynamic Mode Decomposition
EXDMD is the most modern variant of DMD that is applied directly on raw data.It was published in 2014 by Tu et al. [12].However, we have chosen the algorithmic formulation by Krake et al. [23], which differs in the computation of DMD amplitudes.Algorithm 1 shows an adjusted version of the algorithm.Since we mainly focus on anticipation, we are not interested in the reconstruction of the first snapshot and therefore some steps are skipped.
After defining the snapshot matrices X and Y, which are related by one time-shift, a reduced singular value decomposition of X is performed in line 2.These components are used to determine the low-dimensional matrix S that owns the dynamic relevant information in form of (DMD) eigenvalues λ j .Therefore, only the non-zero eigenvalues are used to compute the so-called DMD modes ϑ j in line 7. Finally, the DMD amplitudes are calculated via a = Λ −1 Θ + x 2 , where the second initial snapshot x 2 is used.
Given the DMD modes, DMD eigenvalues and DMD amplitudes we can both reconstruct the original snapshot matrix and make predictions for future states.But as mentioned before a good reconstruction might not be possible depending on the matrix dimensions.However if all conditions are met we can an exact reconstruction.

Delay vectors and Takens' Theorem
Most real world dynamical systems are only partially observable, i.e. we can observe only a low-dimensional projection of a dynamical system acting on a high dimensional state space.This means that from a certain observed snapshot of a dynamical system it is even in principle not possible to reconstruct the full current state of the dynamical system.Fortunately, the information contained in observations made at several different time steps can be combined to reconstruct, at least in principle, the complete current state, and (under certain technical assumptions) the dynamics on these delay vectors is diffeomorphic to the true dynamics on the hidden state space.This delay embedding theorem is also known as Takens' theorem, first proved by Floris Takens in 1981 [24].This result has led to a branch of dynamical systems theory now referred to as "embedology" [25].
Here we give a brief sketch of the delay embedding theorem for discrete-time dynamical systems.Let the state space of the dynamical system be a k-dimensional manifold M. The dynamics is defined by a smooth map and the observations are generated by a twice-differentiable map y : M → R (the observation function), projecting the full state of the dynamical system to a scalar observable.From a time series of observed values we can build m-dimensional delay vectors: The delay vectors are elements of R m and by mapping a delay vector to its successor we get a mapping The delay embedding theorem now implies that the evolution of points y m (n) in the reconstruction space R m driven by ρ follows (i.e., is diffeomorphic to) the unknown dynamics in the original state space M driven by φ when m = 2k + 1.Here 2k + 1 is a maximal value, faithful reconstruction could already occur for delay vectors of lower dimension.Thus long enough delay vectors represent the full hidden state of the observed dynamical system, meaning that the prediction of the next observed value based on a long enough history of past observations becomes possible.
For our purposes, we take the delay embedding theorem as an indication that adding delay dimensions to the observed state vector can improve the anticipation quality of a DMD model.

Dynamic Mode Decomposition with delays (DMDd)
Our motion capture data has the following form: Each state x i at time step 1 i n, is a vector of length m.To augment this matrix with d delays we use a window of size m × (n − d), with 1 < n − d < n, to move along the motion data.This window starts at the first frame of X and makes a copy of the first n − d frames of the data, before taking a step of one frame along X.We continue with this process until the window reaches the end of the motion data.The copied data are then stacked one above the other resulting in a matrix X with n − d columns and (d + 1)m rows: Depending on how we choose d, the problem where our data has more columns than rows is no longer given.Applying the DMD algorithm to X provides us with a good representation of the data and a good short-term future prediction is also possible, as will be detailed in Section 3.

Results
We tested DMDd on the Human3.6Mdataset [26], which consists of different kinds of actions like walking, sitting and eating.These actions are performed by different actors.For our experiments we choose the motion sequences performed by actor number 5 (to have comparable results to the literature, as this actor was used for testing in the neural network based approaches, whereas the motions of the other actors were used for training).The data we use is sampled at 50 Hz and contains absolute coordinates for each joint.For each experiment we divide each action sequence into several sub-sequences of 100 or 150 frames length.The first 50 (1 sec) or 100 frames (2 sec) are taken as input for our methods and we compute a prediction for the next 5 frames (0.1 sec), 10 frames (0.2 sec) and 20 frames (0.4 sec).To measure the distance between the ground truth GT and our prediction P we use two different distance measures.The first measure we use is the mean squared error (MSE): K is the number of motion sequences taken form the same action class and hence the number of predictions made for this action.Both GT and P consist of m observables and p frames.The second distance measure we use is the Kullback-Leibler divergence as it was used in [8].

Comparison with neural network based methods
Before exploring the dependency of out methods with respect to various parameters we compare the setting of having the information of 1 sec of motions as inputs (50 frames) using DMD with 80 delays with a RNN baseline as the one used in [27].We use the mean squared error (MSE) as well as the Kullback-Leibler divergence as error measures for anticipation times of 0.1 sec, 0.2 sec, and 0.4 sec.
The results given in Tables 1 and 2 indicate that our method shows better results for 0.1 sec, 0.2 sec and for most motion classes even for 0.4 s, although they are not only unsupervised but even no knowledge about any other motion is taken into account!Interestingly, the error of the RNN slightly decreases with the anticipation times.This counter-intuitive behavior of the RNN approach might be explained by the fact that the anticipations yielded by the RNN baseline in general shows small jumps at the beginning of the anticipation period [28].

Reconstruction and anticipation of motions using DMD and DMD with delays
Adding time delays already improves the reconstructibility of motions.In Table 3 we show the average reconstruction errors of motion clips of 2 sec length (100 frames) for the different motion classes.Already adding 10 time delays yields a dramatic improvement.After adding 60 delays the reconstruction error drops to less than 10 −5 for all motion classes.
The results of the anticipation errors for 0.4 sec (20 frames) of anticipation using 2 sec (100 frames) as context length is given in Table 4.The anticipation errors for DMD without delays is large (>10 10 for all motion classes and is not reproduced in the Table .In contrast to the reconstruction case, in which the error monotonically decreases with adding additional delays, the anticipation errors have minima at a certain number of delays (ranging between 40 and 90 for the different motion classes.

Using different input lengths of motions to be anticipated
We compare the previously used setting of having the information of 1 sec of motions as inputs (50 frames) to the one with 2 sec of motions as inputs (100 frames), and 4 sec of motions as inputs (200 frames).In Fig. 1 we show the MSE for the anticipation of a trained RNN with 1 sec of motions as inputs, the DMDd with 1 sec of motions as

Reconstruction and anticipation of inertial measurements of root and end-effectors
For assessing short term anticipation on the basis of sparse accelerometer data attached to the end effectors and the hip we used also the marker position data of the Human 3.6M database to have a large collection of motions and "ground truth data".As has already been shown in [29] the use the second time derivatives of marker position data yields reliable estimates for tests using data of accelerometers.
In Figure 2 the results of the anticipation error of just the marker of the right hand is given.The anticipation error of performing the DMDd80 on the time series of just this one marker is given as M1F1.The error of this one  3.Comparison of reconstructions errors using Dynamic Mode Decomposition (DMD) and Dynamic Mode Decomposition with delays for various numbers of delays (10, 20, 30, 40, 50, and 60).The error measure is the mean squared error on the pose sequences of length 2 sec on the different motion classes of the Human 3.6M dataset for actor #5.marker but using DMDd80 on the end effectors and the root is given as M5F1; the one performing DMDd80 on all 17 markers is given as M17F1.Using second derivatives as simulation of accelerometer sensor data are given similarly as A1F1, A5F1, and A17F1.Whereas the addition of a "spatial context" of other markers than the one measured for anticipation in the DMDd computation has little effect for 0.1 sec of anticipation time, there is a considerable effect for 0.2 sec of anticipation time, and a huge effect for 0.4 sec of anticipation time: For the simulated accelorometers the average MSE of the right hand marker with a value of 5.53E+05 was about four orders of magnitude larger when performing the DMDd only on its time series compared to the one when using the spatial context of the 4 additional ones (at left hand, left and right foot, and root) with a value of 1.40E+01!Considering more than 4 additional markers had little additional effect.

Conclusion and Future Work
In contrast to some special classes of human motions, on which the direct application of DMD to the observables of human motion data can be suitable for a good reconstruction of the data [13,30], this direct applications of DMD to the observables of the motions contained in the Human 3.6M dataset do not yield good reconstructions, nor suitable short-term anticipations.
Inspired by Takens' theorem, which emphasizes the usefulness of delays in reconstructing partially observable, high dimensional dynamics, we have extended DMD with delay vectors of different length and evaluated the impact on short-term anticipation using a large real world human motion data base and comparing the performance to a state of the art RNN model.The results show that delays can drastically improve reconstruction and also anticipation performance, often by several orders of magnitude, and in many cases lead to better anticipation performance than the RNN model (for anticipation times less than 0.4 sec).This is especially remarkable, as our methods do not even regress to any other motions than the one to be anticipated.Moreover, DMD effectively solves a convex optimization problem and thus is much faster to evaluate than training RNNs.Additionally, solutions of convex optimization problems are globally optimal, a guarantee which is absent for trained RNNs.
As already mentioned in the introduction the presented work was primarily concerned with modelling the influence of the previous motion on the motion anticipation.For modeling the intent of persons other methods are required, and neural network based methods might be the ones of choice.Coming up with a hybrid DMD and neural network based method for mid-term (or even long-term) motion anticipation will be the topic of future research.As a final remark, we mention that linear methods like DMD can foster the interpretability of results by representing the evolution of motion as a linear combination of "factors", where factors can be previous states, delays, or nonlinear features computed from the previous states or delays.This could prove to be especially useful when machine learning driven systems enter more and more critical application areas, involving aspects of security, safety, privacy, ethics, and politics.To address these concerns, and for many application areas involving anticipation of human motions these concerns play a central role, transparency, explainability, and interpretability become more and more important criteria for the certification of machine learning driven systems.For a comprehensive review of the current literature addressing these rising concerns about safety and trustworthiness in machine learning see [31].
(a) Modelling the intent of the persons.(b) Modelling the influence of the previous motion.
Comparison of anticipation error using a RNN and Dynamic Mode Decomposition with 80 delays for various anticipation times.The error measure is the Kullback-Leibler divergence for anticipation times of 0.1 sec (5 frames), 0.2 sec (10 frames), and 0.4 sec (20 frames) on the different motion classes of the Human 3.6M dataset for actor #5.inputs, DMDd with 2 sec of motions as inputs, and DMDd with 4 sec of motions as input (for anticipation times of 0.1 sec, 0.2 sec, and 0.40 sec).

Figure 1 .
Figure 1.Comparison of anticipation errors for anticipations of 0.1 sec (5 frames), 0.2 sec (10 frames), and 0.4 sec (20 frames).The result of a trained RNN using inputs of 1 sec, and DMDd80 on inputs of 1 sec, 2 sec, and 4 sec.The error measure is the average of the mean squared error on the pose sequences on the different motion classes of the Human 3.6M dataset for actor #5.

Figure 2 .
Figure 2. Left: Single marker anticipation without and with spatial context.The anticipation errors of 0.1 sec (5 frames), 0.2 sec (10 frames), and 0.4 sec (20 frames) are given for the anticipation of marker of the right hand joint using no spatial context (M1F1), all end effector markers and the root position as spatial context (M5F1), and all 17 markers as spatial context (M17F1).The error measure is the average of the mean squared error on the marker sequences of the right hand joint on the different motion classes of the Human 3.6M dataset for actor #5.Right: Same as left, but for second derivatives of marker positions simulating accelerometer sensors.Notice that the error is given in logarithmic scale.

Table 4 .
Comparison of anticipation errors for 0.4 sec (20 frames) using Dynamic Mode Decomposition with delays for various numbers of delays (10, 20, 40, 50, 60, 70, 80 and 90).The error measure is the mean squared error on the pose sequences of length 2 sec on the different motion classes of the Human 3.6M dataset for actor #5.