Next Article in Journal
ZnO Thin Films Growth Optimization for Piezoelectric Application
Next Article in Special Issue
Applications of Pose Estimation in Human Health and Performance across the Lifespan
Previous Article in Journal
A Review of Heartbeat Detection Systems for Automotive Applications
Previous Article in Special Issue
Attention-Based 3D Human Pose Sequence Refinement Network
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Gap Reconstruction in Optical Motion Capture Sequences Using Neural Networks

Department of Graphics, Computer Vision and Digital Systems, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
Polish-Japanese Academy of Information Technology, Koszykowa 86, 02-008 Warsaw, Poland
Author to whom correspondence should be addressed.
Sensors 2021, 21(18), 6115;
Received: 30 July 2021 / Revised: 7 September 2021 / Accepted: 8 September 2021 / Published: 12 September 2021
(This article belongs to the Special Issue Intelligent Sensors for Human Motion Analysis)


Optical motion capture is a mature contemporary technique for the acquisition of motion data; alas, it is non-error-free. Due to technical limitations and occlusions of markers, gaps might occur in such recordings. The article reviews various neural network architectures applied to the gap-filling problem in motion capture sequences within the FBM framework providing a representation of body kinematic structure. The results are compared with interpolation and matrix completion methods. We found out that, for longer sequences, simple linear feedforward neural networks can outperform the other, sophisticated architectures, but these outcomes might be affected by the small amount of data availabe for training. We were also able to identify that the acceleration and monotonicity of input sequence are the parameters that have a notable impact on the obtained results.

1. Introduction

Motion capture (mocap) [1,2], in recent years, has become a mature technology that has an important role in many application areas. Its main application is in computer graphics, where it is applied in gaming and movie FX for the generation of realistic-looking character animation. Other prominent applications areas are biomechanics [3], sports [4], medical sciences (involving biomechanical [5] and the other branches, i.e., neurology [6]), and rehabilitation [7].
Optical motion capture (OMC) relies on the visual tracking and triangulation of active or retro-reflective passive markers. Assuming a rigid body model, successive positions of markers (trajectories) are used in further stages of processing to drive an associated skeleton, which is used as a key model for the animation of human-like or animal characters.
OMC is commonly considered the most reliable mocap technology; it is sometimes called the ‘gold standard’, as it outperforms the other mocap technologies. However, the process of acquiring marker locations is not error-free. Noise, which is immanent in any measurement system, has been studied in numerous works [8,9], which suggests it is not just simple additive Gaussian process. The noise types present in OMC systems were identified in [10]; these are red, pink, white, blue-violet, and Markov–Gaussian-correlated noises; however, they are not a big issue for the mocap operators since they have rather low amplitudes and can be quite efficiently filtered out. The most annoying errors come from marker observation issues. They occur due to marker occlusion and the marker leaving the scene, and result in a lack of the recorded data-gaps that are typically represented as not a number (NaN) values.
The presence of gaps is common and results in everyday praxis, which requires painstaking visual trajectory examination and manual trajectory editing by operators. This can be assisted by software support for trajectory reconstruction.
In this work, we propose a marker-wise approach that addresses the trajectory reconstruction problem. We analyze the usability of various neural network architectures applied to regressive tasks. The regression/prediction exploits inter-marker correlations between markers placed on the same body parts. Therefore, we employed a functional body mesh structure (FBM) [11] as a framework to model the kinematic structure of the subject. I Thisan be calculated ad-hoc for any articulated subject or rigid objects, so we do not need a skeleton model.
The article is organized as follows: in Section 2, we disclose the background for the article—mocap pipeline with sources of distortion and former works on the distortions in optical mocap systems; Section 3 describes the proposed method, with its rationales and design considerations, and experiment plan. In the Section 4 we provide results, and a discussion and interpretation of results. Section 5 summarizes the article.

2. Background

2.1. Optical Motion Capture Pipeline

Optical motion capture systems track the markers—usually passive retro-reflective spheres in near-infrared images (NIR) images. The basic pipeline is shown in Figure 1. The markers are observed by several geometrically calibrated NIR cameras. The visual wavelengths cut-off, and, hence, the images, contain just white dots, which are matched between the views and triangulated, so the outcome of the early stage of mocap is a time series containing Cartesian coordinates of all markers. An actor and/or object wears a sufficient number of markers to represent body segments—marker layout usually follows a predefined layout standard. The body segments are represented by a predefined mesh, which identifies the body segments and is a marker-wise representation of body structure. Finally, mocap recording takes the form of a skeleton angle time series, which represents the mocap sequence as orientations (angles) in joints and a single Cartesian coordinate for body root (pelvis usually).

2.2. Functional Body Mesh

Functional body mesh (FBM) is a authors’ original contribution, that forms a framework for marker-wise mocap data processing, which incorporates also the kinematic structure of a represented object. The FBM structure is not given in advance, but it can be inferred based on the articulated object representative motions [11]. For human actors it resembles standard meshes, but it can be applied for virtually any vertebrates. It assumes the body is divided into rigid segments (submeshes), which are organized into a tree structure. The model represents the hierarchy of subjects’ kinematic structure, reflecting bonds between body segments, where every segment is a local rigid body model—usually based on an underlying bone.
The rigid segments maintain the distance between the markers and, additionally, for each child segment, one representative marker is assumed within the parent one, which is also assumed to maintain a constant distance from the child markers. The typical FBM for the human actor is shown in Figure 2b as a tree. The segments and constituent markers are located in nodes, whereas the parent marker is denoted on the parent–child edge.

2.3. Previous Works

Gap filling is a classical problem frequently addressed in research on mocap technologies. It was in numerous works, which proposed various approaches. The existing methods can be divided into three main groups—skeleton-based, marker-wise, and coordinate-based.
A classical skeleton-based method was proposed by Herda et al. [12], they estimate skeleton motion and regenerate markers on the body envelope. Aristidou and Lanesby [13] proposed the other method based on a similar concept, where the skeleton is a source for constraints in inverse kinematics estimation of marker location. Also, Perepichka et al. [14] combined IK of skeleton model with deep NN to detect erroneously located markers and to place them on a probable trajectory. All aforementioned approaches require either to have a predefined skeleton or to infer the skeleton as the entry step of an algorithm.
The skeleton-free methods consider information from markers only, usually acknowledging the whole sequence as a single multivariable (matrix), thus losing the kinematic structure of the represented actor. They rely on various concepts, starting from the simple interpolating methods [15,16,17]. The proposal by Liu and McMillan [18] employed ‘local’ (neighboring markers) low-dimensional least squares models combined with PCA for missing marker reconstruction. A significant group of gap reconstruction proposals is based on the low-rank matrix completion methods. They employ various mathematical tools (e.g., matrix factorization with SVD) for the missing data completion, relying on inter marker correlations. Among the others, these methods are described in the following works [19,20]. Another approach is somewhat related: it is a fusion of several regressions and interpolation methods, which was proposed in [21].
Predicting markers (or joint) position is another concept that is the basis of gap-filling techniques. One such concept is a predictive model by Piazza et al. [22], which decomposes the motion into linear and circular and finds momentary predictors by curve fitting. More sophisticated dynamical models based on the Kalman filter (KF) are commonly applied. Wy and Boulanger [23] proposed a KF with velocity constraints; however, this achieved moderate success due to drift. A KF with an expectation-maximization algorithm was also used in two related approaches by Li et al.—DynaMMo [24], and BoLeRO [25] (the latter is actually Dynammo with bone length constraints). Another approach was proposed by Burke and Lanesby [26], who applied dimensionality reduction by PCA and then Kalman smoothing for the reconstruction of missing markers.
Another group of methods is dictionary-based. These algorithms recover the trajectories using a dictionary created from previously recorded sequences. They result in satisfactory outcomes as long the specific motion is in the database. They are represented by the works of Wang et al. [27], Aristidou et al. [28], and Zhang and van de Panne [29].
Finally, neural networks are another group of methods used in marker trajectory reconstruction. The task can be described as a sequence-to-sequence regression problem, whereas NN applied for regression has been recognized since the early 1990s in the work of Hornik [30]; hence, NN seems to be a natural choice for the task. Surprisingly, however, they become popular quite late. In the work of Fragkiadaki et al. [31], an encoder–recurrent-decoder (ERD) was proposed, employing long-short term memory (LSTM) as a recurrent layer. A similar approach (ERD) was proposed by Harvey et al. [32] for in-between motion generation on the basis of asmall amount of keyframes. Mall et al. [33] modified the ERD and proposed an encoder–bidirectional-filter (EBF) based on the bidirectional LSTM (BILSTM). In the work of Kucharenko et al. [34], a classical two-layer LSTM and window-based feed-forward NN (FFNN) were employed. A variant of ResNet is applied by Holden [35] to reconstruct marker positions from noisy data as a ttrajectory reconstruction task. A set of extensions to the plain LSTM were proposed by Ji et al. [36]; they introduced attention (a weighting mechanism) and LS-derived spatial constraints, which result in an improvement in performance. Convolution auto-encoders was proposed by Kaufmann et al. [37].

3. Materials and Methods

3.1. Proposed Regression Approach

The proposed approach involves employing various neural networks architectures for the regression task. These are FFNN and three variants of contemporary recursive neural networks—gated recurrent unit (GRU), long-short-term memory (LSTM), and bidirectional LSTM (BILSTM). In our proposal, these methods predict trajectories of lost markers on the basis of a local dataset—the trajectories of neighboring markers.
The proposed utilization procedure of NN differs from the scenario that is typically employed in machine learning. We do not feed the NNs with a massive amount of training sequences in advance to form a predictive model. Instead, we consider each sequence separately and try to reconstruct the gaps in individual motion trajectory on the basis of its own data only. This makes sense as long as the marker motion is correlated and most of the sequence is correct and representative enough. This is the same as for the other common regression methods, starting with the least squares. Therefore, the testing data are the whole ‘lost’ segment (gap), whereas the training is the remaining part of the trajectory. Depending on the gap sizes, and sequence length used in the experiment, the testing can be between 0.6% (for short gaps and long sequences) and up to 57.1% (for long gaps in short sequences).
The selection of such a non-typical approach requires a justification. It is likely that training the NN models for prediction of marker position in a conventional way, using a massive dataset of mocap sequences, would be able to generalize enough to adjust to different body sizes and motions. However, it will be tightly coupled with the marker configuration, not to mention the other actors, such as animals. The other issue is obtaining such a large amount of data. Despite our direct access to the lab resources, this is still quite a cumbersome task, since we believe these might be not enough, especially as the resources available online from various other labs are hardly usable, since they employ different marker setups.
The forecasting of timeseries is a typical problem addressed by RNNs [38]. Usually, numerous training and testing sequences allow for a prediction of the future states of the modelled system (e.g., power consumption or remaining useful life of devices). A more similar situation, where RNNs are also applied, is forecasting the time series for problems lacking massive training data (e.g., COVID-19 [39]). An analysis of LSTM architectures for similar cases is presented in [40]. However, in these works, the forecast of future values is based on the past values. What makes our case a bit different is the fact that we usually have to predict the value in-the-middle, so the past and future values are available.

3.1.1. Feed Forward Neural Network

FFNN is the simplest neural network architecture. In this architecture, the information flows in one direction, as its structure forms an acyclic directed graph. The neurons are modeled in the nodes with activation functions (usually sigmoid) using the weighted sum of inputs. These networks are typically organized into layers, where the output from the previous layer becomes an input to a successive one. This architecture of networks is employed for regression and classification tasks, either alone or as final stages in a larger structures (such as modern deep NN). The architecture of the NN that we employed is shown in Figure 3. The basic equation (output) of a single—k-th artificial neuron is given as:
y k ( x ) = f j w j k x j + b ,
where x j is j-th input, w k j is j-th input weight, b—a bias value, f—is transfer (activation) function. Transfer function depends on the layer purpose; these are typically a sigmoid for hidden layers, threshold, linear, or softmax for final layers (for regression and classification problems, respectively), or others.

3.1.2. Recurrent Neural Networks

Recurrent neural networks (RNN) are the types of architecture that employ cycles in NN structure; this allows for the consideration of current input value as well as preserving the previous inputs and internal states of NN in memory (and future ones for bidirectional architecture). Such an approach allows for NN to deal with timed processes and to recognize process dynamics, not just static values—it applies to such tasks as a signal prediction or recognition of sequences. Regarding the applicability, aside from classic problem dichotomy (classification and regression), RNN results might need another task differentiation. One must decide whether the task is a sequence-to-one or sequence-to-sequence problem, so the network has to return either a single result for the whole sequence or a single result for each data tuple in sequence. The prediction/regression task is a sequence-to-sequence problem, as demonstrated with RNNs in Figure 4 in different variants—both folded and unfolded, uni- and bi-directional.
At present two types of neuron are predominantly applied in RNN–long short term memory (LSTM) and gated recurrent unit (GRU), of which the former is also applied in bidirectional variant (BILSTM). They evolved from a plain RNN called ‘vanilla’, and they prevent vanishing gradient problems when back-propagating errors in the learning process. Their detailed designs are unfolded in Figure 5. These cell types rely on the input information and information from previous time steps, and those previous states are represented in various ways. GRU passes an output (hidden signal h) between the steps, whereas LSTM also passes a h and internal cell state C. These values are interpreted as memory—h as short term, and C as long term. Their activation function is typical sigmoid, which is modeled with a hyperbolic tangent (tanh), but there are additional elements present in the cell. The contributing components, such as input or previous values, are subject to ‘gating’—their share is controlled by Hadamard product (element-wise product denoted as ⊙ or ⊗ in diagram) with 0–1 sigmoid function σ ( x ) = 1 1 + e x . The individual σ values are obtained by weighted input and state values.
In more detail, in LSTM, we pass two variables h , C and have three gates—forget, input and output. They govern how much of the respective contribution passes to further processing. The forget gate ( f t ) decides how much of the past cell internal state ( C t 1 ) is to be kept; the input gate ( i t ) controls how much new contribution C ˜ t caused by input ( x t ) annd taken into the current cell state ( C t ). Finally, the output gate ( o t ) controls what part of activation is based on the cell internal state; ( C t ) is taken as cell output ( h t ). The equations are as follows:
f t = σ ( W f · [ x t , h t 1 ] + b f ) ,
i t = σ ( W i · [ x t , h t 1 ] + b f ) ,
C ˜ t = tanh ( W c · [ x t , h t 1 ] + b c ) ,
C t = f t C t 1 + i t C ˜ t ,
o t = σ ( W o · [ x t , h t 1 ] + b f ) ,
h t = o t tanh ( C t ) .
The detailed schematic of GRU is a bit simpler. Only one signal, hidden (layer output) value (h for hi), is passed between steps. There are two gates present—the reset gate ( r t ), which controls how much past output ( h t 1 ) contributes to the overall cell activation, and the update gate ( u t ), which controls how much current activation ( h ˜ t ) contributes to the final cell output.The above are described by the following equations:
u t = σ ( W u · [ x t , h t 1 ] + b u ) ,
r t = σ ( W u · [ x t , h t 1 ] + b u ) ,
h ˜ t = tanh ( W h · [ x t , r t h t 1 ] + b h ) ,
h t = ( 1 u t ) h t 1 + u t h ˜ t .

3.1.3. Employed Reconstruction Methods

We compared the performance of five architectures of NN—two variants of FFNN and three RNN-FCs based on GRU, LSTM, and BILSTM; the outline of the latter is depicted in Figure 6. The detailed structures and hyperparameters of NNs were established empirically, since there are no strict rules or guidelines. Usually, this requires simulating, with parameters sweeping the domain of feasible numbers of layers and neurons [41]. We shared this approach and reviewed the performance of NN using the test data.
  • FFNN lin , with 1 hidden fully connected (FC) layer—containing 8 linear neurons;
  • FFNN tanh , with 1 hidden FC layer—containing 8 sigmoidal neurons;
  • LSTM followed by 1 FC layer containing 8 sigmoidal neurons;
  • GRU followed by 1 FC layer containing 8 sigmoidal neurons;
  • BILSTM followed by 1 FC layer containing 8 sigmoidal neurons.
The output is three valued x , y , z vectors, containing reconstructed marker coordinates.

3.1.4. Implementation Details

The training process was performed using 600 epochs, with the SGDM solver running on the GPU. It involved the whole input sequence with gaps excluded. There was a single instance of sequence in the batch. The sequence parts containing gaps were used as the test data; the remainder was used for training—therefore, the relative size of test part varies between 0.6% and 57.1%. The other parameters are:
  • Initial Learn Rate: 0.01;
  • Learn Rate Drop Factor: 0.9;
  • Learn Rate Drop Period: 10;
  • Gradient Threshold 0.7;
  • Momentum: 0.8.
We also applied z-score normalization for the input and target data.
Additionally, for comparison, we used a pool of other methods, which should provide nice results for short-term gaps. These are interpolations: linear, spline, modified Akima (makima), piecewise cubic hermite interpolating polynomial (pchip), and the low-rank matrix completion method (mSVD0). All but linear interpolation methods are actually variants of piecewise Hermite cubic polynomial interpolations, which differ in the details of how they compute interpolant slopes. Spline is a generic method, whereas pchip tries to preserve shape, and makima avoids overshooting. However, mSVD [42] is an iterative method decomposing motion capture data with SVD and neglecting the least significant part of the basis transformed signal, reconstructing the original data with replacing missing values using reconstructed ones. The procedure finishes when convergence is reached. We implemented the algorithm, as outlined in [24].
The implementation of methods and experiments was carried out in Matlab 2021a using its implementations of numerical methods and deep learning toolbox.

3.2. Input Data Preparation

Constructing the predictor for certain markers, we obtained the locations from all the sibling markers and a single parent one, as they are organized within an FBM structure. For j-th marker ( X j = [ x j , y j , z j ] ), we consider parent ( X p ) and sibling markers ( X s 1 , , X s L ). To form an input vector, we take two of their values—one for the current moment and with one sample lag. The other variants with more lags or values raised to the higher powers were considered, but after preliminary tests, we neglected them since they did not improve performance.
Each input vector T, for the moment n, is quite long and is assembled of certain parts, as given below:
T ( n , ) = x p ( n ) , y p ( n ) , z p ( n ) , x p ( n 1 ) , y p ( n 1 ) , z p ( n 1 ) current and former values of parent marker   ( p ) , x s 1 ( n ) , y s 1 ( n ) , z s 1 ( n ) , x s 1 ( n 1 ) , y s 1 ( n 1 ) , z s 1 ( n 1 ) current and former value of first sibling s 1 , x s L ( n ) , y s L ( n ) , z s L ( n ) , x s L ( n 1 ) , y s L ( n 1 ) , z s L ( n 1 ) current and former value of last sibling s L .
Finally, the input and output data are z-score standardized—zero centered and standard deviation scaled to 1, since such a step notably improves the final results.

3.3. Test Dataset

For testing purposes, we used a dataset (Table 1) acquired for professional purposes in the motion-capture laboratory. The ground truth sequences were obtained at the PJAIT human motion laboratory using the industrial-grade Vicon MX system. The system capture volume was 9 m × 5 m × 3 m. To minimize the impact of external interference such as infrared interference from sunlight or vibrations, all windows were permanently darkened and cameras were mounted on scaffolding instead of tripods. The system was equipped with 30 NIR cameras manufactured by Vicon: MX-T40, Bonita10, Vantage V5—wth 10 pieces of each kind.
During the recording, we employed a standard animation pipeline, where data were obtained with Vicon Blade software using a 53-marker setup. The trajectories were acquired at 100 Hz and, by default, they were processed in a standard, industrial-quality way, which includes manual data reviewing, cleaning and denoising, so they can be considered distortion-free.
Several parameters for the test sequences are also presented in Table 2. We selected these parameters as one could consider them to potentially describe prediction difficulty. They are various, and based on different concepts such as information theory, statistics, kinematics, and dynamics, but all characterize the variability in the Mocap signal. They are usually the average value per marker, except for standard deviation (std dev), which reports value per coordinate.
Two non-obvious measures are enumerated: monotonicity and complexity. The monotonicity indicates, on average, the extent to which the coordinate is monotonic. For this purpose, we employed an average Spearman rank correlation, which can be described as follows:
m o n o t o n i c i t y = 1 M m = 1 M corr ( rank ( X i ) , 1 N ) ,
where X m is mth coordinate, M is number of coordinates, N is sequence length.
Complexity, on the other hand, is how we estimate the variability of poses in the sequence. For that purpose, we employed PCA, which identifies eigenposes as a new basis for the sequence. The corresponding eigenvalues describe how much of the overall variance is described by each of the eigenposes. Therefore, we decided to take the remainder of the fraction of variance described by the sum of the five largest eigenvalues ( λ i ) as a term describing how complex (or rather simple) the sequence is—the simpler the sequence, the more variance is described, with a few eigenposes. Therefore, our complexity measure is simply given as:
c o m p l e x i t y = 1 i = 1 5 λ i / i = 1 M λ i ,
where M is a number of coordinates.

3.4. Quality Evaluation

The natural criterion for the reconstruction task is root mean square error (RMSE), which, in our case, is calculated only for the time and marker, where the gaps occur:
RMSE = 1 | W | i W ( X ^ i X i ) 2 ,
where W is a gap map, logically indexing locations of gaps, X ^ is a reconstructed coordinate, X is the original coordinate.
Additionally, we calculated RMSEs for individual gaps. Local RMSE is a variant of the above formula, and simply given as:
R M S E k = 1 | w k | i w k ( X ^ i X i ) 2 ,
where w k W is a single gap map logically indexing the location of k-th gap, X ^ is reconstructed coordinate, X is original coordinate. R M S E k is intended to reveal variability in reconstruction capabilities; hence, we used it to obtain statistical descriptors—mean, median, mode, and quartiles and interquartile range.
A more complex evaluation of regression models can be based on infromation criteria. These quality measures incorporate squared error and a number of tunable parameters, as they were designed by searching for a tradeoff between the number of tunable parameters and the obtained error. The two most popular ones are Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). BIC is calculated as:
BIC = n log ( M S E ) + p log ( n ) ,
whereas AIC formula is as follows:
AIC = n log ( M S E ) + 2 p ,
where: mean squared error MSE = RMSE 2 , n is a number of testing data, p is a number of tunable parameters.

3.5. Experimental Protocol

During the experiments, we simulated gap occurrence in perfectly reconstructed source sequences. We simulated gaps of different average lengths—10, 20, 50, 100 and 200 samples (0.1, 0.2, 0.5, 1, and 2 s, respectively). The assumed gap sizes were chosen to represent situations of various levels of difficulty, from short-and-simple to difficult ones, when gaps are long. For every gap length, we performed 100 simulation iterations, where the training and testing data do not intermix between simulation runs. The steps performed in every iteration are as follows:
  • We introduce two gaps of assumed length (on average) to the random markers at random moments; actual values are stored as testing data;
  • The model is trained using the remaining part of the sequence (all but gaps);
  • We reconstruct (predict) the gaps using the pool of methods;
  • The resulting values are stored for evaluation.
We report the results as RMSE and descriptive statistical descriptors for R M S E k for every considered reconstruction technique. Additionally, we verified the correlation between RMSE and the variability descriptors for sequences. It is intended to reveal what are the sources of difficulties in predicting the marker trajectories.

Gap Generation Procedure

The procedure of gap contamination, which was employed, introduces distortions into the sequences in a controlled way. The parameter characterizing the experiment is an average-length number of occurrences of gaps. the sequence of operations distorting the signal is as follows: at first, we draw moments to contaminate, then select a random marker. The duration of distortions and intervals is a Poisson process, an average length of distortion set-up according to the considered gap length in the experiment, whereas the interval length results from the sequence length and number of intervals, which, for two gaps per sequence, are three—ahead of the first gap, in-between, and after the second gap.

4. Results and Discussion

The section comprises two parts. First, we present RMSE results; they illustrate the performance of each of the considered gap reconstruction methods. The second part is the interpretation of results, searching for the aspects of Mocap sequence that might affect the resulting performance.

4.1. Gap Reconstruction Efficiency

The detailed numerical values are presented in Table 3 for the first sequence as an example. In the table, we also emphasize the best result for each measurement of gap size. Forclarity, the numerical outcomes of the experiment are only presented in this chapter with representative examples. To see the complete set of results in the tabular form, please refer to Appendix A. The complete results for the gap reconstruction are also demonstrated in a visual form in Figure 7. Additionally, the zoomed variant of the fragments of the plot (dash square annotated) for gaps 10–50 are presented in Figure 8.
The first observation, regarding the performance measures, is the fact that the results are very coherent, regardless of which measure was used. This is shown in Figure 7, where all the symbols coherently denote statistical descriptors scale. It is also clearly visible in the values emphasized in Table 3, where all measures but one (mode) indicate the same best (smallest) results. Hence, we can use a single quality measure; in our case, we assumed RMSE for further analysis.
Analyzing the results for several sequences, various observations regarding the performance of the considered methods can be noted. These are listed below:
  • It can be seen that, for the short gaps, interpolation methods outperform any of the NN-based methods.
  • For gaps that are 50 samples long, the results become less obvious and NN results are no worse or (usually) better than interpolation methods.
  • Linear FFNN usually performed better than any other methods (including non-linear FFNN tanh ), for gaps of 50 samples or longer, for most of the sequences.
  • In very rare cases of short-gap cases, RNNs performed better than FFNN lin , but, in general, simpler FFNN lin outperformed more complex NN models.
  • There are two situations when the FFNN lin , performed no better or worse than interpolation methods (walking and falling). This occurred for sequences with larger monotonicity values in Table 2. They have also increased velocity/acceleration/jerk values; the ‘running’ sequence has similar values for these, but FFNN lin perform the best in this case, so the kinematic/dynamic parameters should not be considered.
Looking at the results of various NN architectures, it might be surprising that the sophisticated RNNs often returned worse results than relatively simple FFNN, especially for relatively long gaps. Conversely, one might expect that RNNs would outperform other methods, since they would be able to model longer-term dependencies in the motion. Presumably, the source of such a result is in the limited amount of training data, which, depending on the length of the source file, varies between hundreds and thousands of registered coordinates. Therefore, solvers are unable to find actually good values for a massive amount of parameters—see Table 4 for the formulas and numbers of learnable parameters for an exemplary case when input comprises 30 values—coordinates of four siblings and a parent at current and previous frames.
An obvious solution to such an issue would be increasing the training data. We could achieve this by employing very long recordings or by using numerous recordings. In the former, it would be difficult to achieve long enough recordings; the latter is different from the case which we try to address, where we only obtain a fresh mocap recording and reconstruct it with the minimal model given by FBM. Training the predictive model in advance with a massive amount of data is, of course, an interesting solution, but would cost the generality. For every marker configuration, a separate set of predicting NNs would need to be trained, so the result would only be practical for standardized body models.
Considering the length of the training sequences, its contribution to the final results seems far less important than other factors, at least within the range of considered cases. The analysis of its influence is illustrated in Figure 9. Since the MSE results are entangled, we employed two additional information criterions, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), which disentangle the results by accounting for the number of trainable parameters. For every sequence and every NN model, we obtain a series of five results, which decrease, as the training sequence grows longer when we have shorter gaps (i.e., the annotated quintuple in the Figure). Analyzing the results in Figure 9, it is most convenient to observe this in the AIC/BIC plots since, for each model, the number of parameters remains the same (Table 4), so we can easily compare the results of the testing sequences. The zoomed versions (to the right) reveal differences at appropriate scales for the RNN results.
Lookng at the reults, we observe that, regardless the length of the training sequence, the MSE (AIC/BIC) of the NN model remains at the same order of magnitude—this is clearly visible in the Figure, where we have very similar values for each gap size for variable sequences (represented as different marker shapes) for each of the NN types (represented by a color). The most notable reduction in the error is probably observed with the increased sequence length, when the sequence (Seq. 1—static) is several folds longer than the others. However, we cannot observe this difference for shorter sequences in our data, with notably different lengths (e.g., walking—running). The quality of prediction could be likely improved if the recordings were longer, but, in everyday praxis, the length of the motion caputre sequences is only minutes, so one should not expect the results for RNN data to be notably improved compared to those for FFNN.
The observations hold for both FFNN models and all RNNs. These ambiguous outcomes confirm the results shown in [40], where the quality of results does not depend on the length of the training data in a straightforward way.

4.2. Motion Factors Affecting Performance

In this section, we try to identify the correlation in which features (parameters) of the input sequences relate to the performance of gap-filling methods. The results presented here are concise; we only present and discuss the most conclusive results. The complete tables containing correlation values for all gap sizes are presented in Appendix B.
Foremost, a generalized view into the correlation between gap-filling outcomes and input sequence characteristics is given in Table 5. It contains Pearson correlation coefficients (CC) between RMSE and input sequence characteristic parameters; the values are Pearson CCs, averaged across all the considered gap sizes. Additionally, for the interpretation of the results, in Table 6, we provide CCs between RMSE and the descriptive parameters for the whole sequences for all the test recordings.
Knowing that correlation, as a statistical measure, makes little sense for a sparse dataset, we treat it as a kind of measurement of co-linearity between the measures. However, for part of the parameters, the (high) correlation values are connected, with quite satisfactory low p-values; these are given in Appendix B.
Looking into the results in Table 5, we observe that all the considered sequence parameters are related, to some extent, to RMSE. However, for all the gap-filling methods, we identified two key parameters that have higher CCs than the others. These are acceleration and monotonicity, which seem to be promising candidate measures for describing the susceptibility of sequences to the employed reconstruction methods.
Regarding inter-parameter correlations in Table 6, we can observe that most of the measures are correlated with each other. This is expected, since kinematic/dynamic parameters are connected with the location of the markers over time, so values such as entropy, position standard deviation, velocity, acceleration, and jerk are correlated (for the derivatives, the smaller the difference in the derivative order, the higher the CCs).
On the other hand, the two less typical measures, monotonicity and complexity, are different; therefore, their correlation with the other measures is less predictable. Complexity appeared to have a notable negative correlation with most of the typical measures. Monotonicity, on the other hand, is more interesting. Since it is only moderately correlated with remaining measures, it still has quite a high CC, with RMSEs for all the gap reconstruction methods. Therefore, we can suppose this describes an aspect of the sequence that is independent of the other measures, which is related to susceptibility to the gap reconstruction procedures.

5. Summary

In this article, we addressed the issue of filling the gaps that occurred in the mocap signal. We considered this to be a regressive problem and reviewed the results of several NN-based regressors, which were compared with several interpolation and low-rank matrix completion (mSVD) methods.
Generally, in the case of short gaps, the interpolation methods returned the best results, but since the gaps became longer, part of the NNs gained an advantage. We reviewed five variants of neural networks. Surprisingly, the tests revealed that simple linear FFNNs, using momentary (current and previous sample) and local (from neighboring markers) coordinates as input data, outperformed quite advanced recurrent NNs for the longer gaps. For the shorter gaps, RNNs offered better results, but all the NNs were outperformed by interpolations. The boundary between ’long’ and ’short’ terms are gaps of 50 samples long. Finally, we were able to identify which factors of the input mocap sequence influence the reconstruction errors.
The approach to the NNs given here does not incorporate skeletal information. Instead, the kinematic structure is based on the FBM framework and all the predictions are performed with the local data, as obtained from FBM. Currently, none of the analyzed approaches considered body constraints such as limb length or size, but we can easily obtain such information from the FBM model. We plan to apply this as an additional processing stage in the future. In the future, we plan to test more sophisticated NN architectures, such as combined LSTM convolution, or averaged multiregressions.

Supplementary Materials

The following are available at, The motion capture sequences.

Author Contributions

Conceptualization, P.S.; methodology, P.S., M.P.; software, P.S., M.P.; investigation, P.S.; resources, M.P.; data curation, M.P.; writing—original draft preparation, P.S.; writing—review and editing, P.S., M.P.; visualization, P.S. All authors have read and agreed to the published version of the manuscript.


The research described in the paper was performed within the statutory project of the Department of Graphics, Computer Vision and Digital Systems at the Silesian University of Technology, Gliwice (RAU-6, 2021). APC were covered from statutory research funds. M.P. was supported by grant no WND-RPSL.01.02.00-24-00AC/19-011 funded by under the Regional Operational Programme of the Silesia Voivodeship in the years 2014–2020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The motion capture sequences are provided as Supplementary Files accompanying the article.


The research was supported with motion data by Human Motion Laboratory of Polish-Japanese Academy of Information Technology.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.


The following abbreviations are used in this manuscript:
BILSTMbidirectional LSTM
CCcorrelation coefficient
FCfully connected
FBMfunctional body mesh
FFNNfeed forward neural network
GRUgated recurrent unit
HMLHuman Motion Laboratory
IKinverse kinematics
KFKalman filter
LSleast squares
LSTMlong-short term memory
MocapMOtion CAPture
MSEMean Square Error
NARX-NNnonlinear autoregressive exogenous neural network
NaNnot a number
NNneural network
OMCoptical motion capture
PCAprincipal component analysis
PJAITPolish-Japanese Academy of Information Technology
RMSEroot mean squared error
RNNrecurrent neural network
STDDEVstandard deviation
SVDsingular value decomposition

Appendix A. Performance Results for All Sequences

Table A1. Quality measures for the walking (No. 2) sequence.
Table A1. Quality measures for the walking (No. 2) sequence.
mean ( RMSE k )12.39823.2137.6599.0146.4953.4420.8101.6211.6973.442
median ( RMSE k )10.86521.2906.3278.2625.9562.0510.5111.0871.1802.051
mode ( RMSE k )3.4994.0681.7551.1402.3440.5360.0560.2370.2390.536
stddev ( RMSE k )6.93012.6454.6344.7443.3713.5050.9381.7731.7883.505
iqr ( RMSE k )8.98612.9143.6444.2973.4443.1800.6521.2931.3343.180
mean ( RMSE k )13.74327.97810.15511.1718.3969.0712.5914.7984.9049.071
median ( RMSE k )12.33424.5757.5689.1166.2096.5081.8233.7673.7286.508
mode ( RMSE k )2.6545.7743.2425.2472.3520.4010.3140.3160.3820.401
stddev ( RMSE k )6.72316.0428.1616.6098.8278.0202.8284.2904.1758.020
iqr ( RMSE k )7.45415.7264.5455.6672.4916.7911.5713.3083.9216.791
mean ( RMSE k )19.16836.76919.78819.86718.83133.94416.67321.75721.60733.944
median ( RMSE k )16.43232.75215.19615.65514.92623.65212.95216.13415.99623.652
mode ( RMSE k )5.90513.5746.3367.1736.1005.5004.2933.7823.9215.500
stddev ( RMSE k )10.48616.28913.17412.40813.03725.48412.65914.43813.99325.484
iqr ( RMSE k )12.42122.20713.30811.41312.90329.91812.18917.99118.12929.918
mean ( RMSE k )32.28766.70150.19549.01949.45363.44546.47650.80350.69363.445
median ( RMSE k )23.31856.32938.96037.00139.07451.68335.44740.06540.41851.683
mode ( RMSE k )8.12222.94014.12515.09414.33412.94312.40712.07412.49312.943
stddev ( RMSE k )22.39735.70935.10734.70734.68541.56434.50335.37135.70041.564
iqr ( RMSE k )18.93341.44639.72740.42740.81363.06239.06249.78450.44063.062
mean ( RMSE k )87.084121.229108.733111.164107.19275.30791.82669.58570.03175.307
median ( RMSE k )59.288104.71091.98789.52391.01968.56780.42763.55961.70468.567
mode ( RMSE k )26.00746.15023.03223.67522.81342.40821.84121.98421.60242.408
stddev ( RMSE k )71.16057.19766.94471.47063.69326.50253.40139.29640.74626.502
iqr ( RMSE k )61.86471.11690.83990.28590.68542.05788.50065.86266.87342.057
Table A2. Quality measures for the running (No. 3) sequence.
Table A2. Quality measures for the running (No. 3) sequence.
mean ( RMSE k )9.93923.0497.6757.5816.1052.2210.4760.9850.9422.221
median ( RMSE k )8.66120.1226.9736.4855.5401.7430.3460.8310.7201.743
mode ( RMSE k )1.9336.0221.8381.2361.1060.2340.0790.1490.1510.234
stddev ( RMSE k )5.91911.8374.2144.2453.7971.7140.4390.6920.6911.714
iqr ( RMSE k )7.00515.6925.1064.8503.5131.8350.2860.8350.7991.835
mean ( RMSE k )10.33125.1249.3249.4406.9195.6761.2742.6012.5895.676
median ( RMSE k )8.69523.6417.6647.9485.4244.4960.9681.9881.8534.496
mode ( RMSE k )2.5476.9462.4381.5121.9530.6610.2370.4530.4380.661
stddev ( RMSE k )6.21511.4256.5525.7535.7874.0171.0101.8892.0214.017
iqr ( RMSE k )8.16812.4904.4815.4423.1113.9951.0172.1542.0613.995
mean ( RMSE k )14.76731.80117.83515.50414.63727.6248.60814.84216.43127.624
median ( RMSE k )9.52325.41210.90410.5018.85325.1226.83412.89413.84425.122
mode ( RMSE k )3.2299.3794.1192.8883.3062.5590.8961.2911.7372.559
stddev ( RMSE k )18.34522.59625.45618.04918.23118.8658.91411.83712.76018.865
iqr ( RMSE k )6.43216.8386.7197.8117.90320.2246.9209.88311.59020.224
mean ( RMSE k )25.16549.28844.78040.34442.25183.85437.30351.07255.95883.854
median ( RMSE k )18.49341.94433.81131.16832.17777.22032.10346.43850.90377.220
mode ( RMSE k )4.90111.7808.1785.5554.1814.9894.5493.5543.8844.989
stddev ( RMSE k )27.59435.23150.04135.15838.27141.35025.28627.57527.27241.350
iqr ( RMSE k )13.06029.86324.84424.92225.44947.51225.81626.43229.72547.512
mean ( RMSE k )88.708129.262125.767123.634125.213235.787119.848146.780185.085235.787
median ( RMSE k )70.845113.902108.387105.181107.987233.618103.952128.657171.109233.618
mode ( RMSE k )20.09253.43439.11339.72238.72896.33638.44436.02774.14596.336
stddev ( RMSE k )63.96965.13570.99070.69571.28573.29366.96377.02170.62873.293
iqr ( RMSE k )67.20073.34387.74782.08089.94777.98683.01064.86947.08577.986
Table A3. Quality measures for the sitting (No. 4) sequence.
Table A3. Quality measures for the sitting (No. 4) sequence.
mean ( RMSE k )3.2723.3861.4631.7371.2101.2180.4780.6170.6061.218
median ( RMSE k )2.9962.9871.3511.6821.1080.9480.3390.4750.4290.948
mode ( RMSE k )0.4370.5580.1970.2120.2490.0720.0590.0410.0430.072
stddev ( RMSE k )1.8961.7670.8060.8460.6421.0940.4830.5300.5371.094
iqr ( RMSE k )2.2822.0250.9911.3010.7021.0490.2600.4670.4801.049
mean ( RMSE k )3.1063.4291.7081.7971.4753.0570.9421.5151.5593.057
median ( RMSE k )2.9113.3191.5191.5721.3182.4340.7391.2301.1692.434
mode ( RMSE k )0.5220.4970.3000.2400.2710.2110.1260.1550.1610.211
stddev ( RMSE k )1.5771.7501.1220.9620.8122.4150.8381.1531.3112.415
iqr ( RMSE k )2.2332.2631.0381.0690.9342.7620.7810.9790.9952.762
mean ( RMSE k )4.3835.3555.0644.6974.89512.7674.9027.2607.71012.767
median ( RMSE k )3.9824.8314.0073.6233.80311.0363.6525.7886.34311.036
mode ( RMSE k )0.4820.4170.3130.4220.2770.2670.3320.2670.2400.267
stddev ( RMSE k )2.2763.2543.7933.5683.7788.7413.8805.6676.2658.741
iqr ( RMSE k )2.9783.8335.1604.0984.99911.1165.2696.5466.80111.116
mean ( RMSE k )11.90416.46818.44017.53918.22233.43918.24523.43524.03333.439
median ( RMSE k )8.59613.13215.90314.10915.14730.51715.46720.36520.69130.517
mode ( RMSE k )0.6430.7110.9270.7430.9501.3241.1701.1391.1211.324
stddev ( RMSE k )9.98013.83914.49514.48414.52417.84014.45915.56915.54217.840
iqr ( RMSE k )7.81611.08713.38012.47613.20123.41913.05415.40514.28023.419
mean ( RMSE k )31.43941.81144.33143.71144.21966.28044.32154.03054.15666.280
median ( RMSE k )26.42236.79240.17839.25740.09971.20139.39555.23554.31171.201
mode ( RMSE k )1.7832.3422.5922.3722.5580.9722.8190.8750.9120.972
stddev ( RMSE k )20.19825.92426.51426.49626.48030.44326.65926.18326.00130.443
iqr ( RMSE k )22.94730.18829.51029.61729.24137.20929.31629.57228.09437.209
Table A4. Quality measures for the boxing (No. 5) sequence.
Table A4. Quality measures for the boxing (No. 5) sequence.
mean ( RMSE k )2.3212.6971.0871.3160.8850.8480.4840.4610.5070.848
median ( RMSE k )2.0362.4761.0011.1730.7830.6660.2760.3170.3220.666
mode ( RMSE k )0.5050.3090.2700.3030.2180.0360.0430.0350.0340.036
stddev ( RMSE k )1.1741.3540.5210.6130.4560.7120.7650.4200.4730.712
iqr ( RMSE k )1.4491.7690.5040.7050.5420.7090.3070.3410.4900.709
mean ( RMSE k )2.2953.0301.3411.4581.0702.8180.7971.3091.5192.818
median ( RMSE k )2.0222.7801.2421.3530.9342.2820.6080.9831.0712.282
mode ( RMSE k )0.8260.7000.3260.4020.3030.2730.1060.1250.1260.273
stddev ( RMSE k )1.1611.3080.5410.6060.5491.9650.8190.9301.2491.965
iqr ( RMSE k )1.4151.7040.7360.7320.4942.1530.4911.0381.3332.153
mean ( RMSE k )3.2114.1833.6093.1093.30611.9573.2626.0837.56211.957
median ( RMSE k )2.5463.5032.6612.5002.63410.3842.7365.2716.31810.384
mode ( RMSE k )0.6991.1020.6990.5380.4800.4440.5420.5130.5460.444
stddev ( RMSE k )2.4602.7883.4042.6142.7477.2362.2353.8024.9947.236
iqr ( RMSE k )1.7431.5951.9681.5402.0629.8212.0595.0747.3029.821
mean ( RMSE k )8.17513.24117.43815.38417.35731.33617.60823.77926.42131.336
median ( RMSE k )6.39811.33714.70212.28514.62727.83414.82322.00824.82527.834
mode ( RMSE k )0.8641.1561.0850.9731.0900.5140.9120.6320.4900.514
stddev ( RMSE k )5.8379.22012.12311.37212.16918.46512.07514.12814.87618.465
iqr ( RMSE k )6.26112.03316.41516.67216.41425.57716.36118.63719.00825.577
mean ( RMSE k )36.69354.33064.74363.73264.80561.50765.47756.49357.66661.507
median ( RMSE k )33.63150.76461.01760.17061.05760.78262.21855.49257.03060.782
mode ( RMSE k )4.5929.11610.0429.7889.9748.99810.0778.6168.6098.998
stddev ( RMSE k )21.76826.95429.62029.81929.60920.17129.79822.09721.74020.171
iqr ( RMSE k )21.99231.40836.03935.20536.07524.94536.48529.81428.50524.945
Table A5. Quality measures for the falling (No. 6) sequence.
Table A5. Quality measures for the falling (No. 6) sequence.
mean ( RMSE k )15.45515.0227.8188.7726.1663.8270.9941.8511.9683.827
median ( RMSE k )13.18613.5716.9478.3415.6162.3590.6181.1071.1452.359
mode ( RMSE k )2.7603.1392.3102.8802.1100.2440.1050.1450.1490.244
stddev ( RMSE k )11.2708.1633.4943.8522.5554.0231.1382.0392.5514.023
iqr ( RMSE k )9.20310.1743.1014.0093.5203.7230.7891.7951.8133.723
mean ( RMSE k )16.20616.1998.94010.2617.89710.9373.6945.9816.39210.937
median ( RMSE k )14.10814.8978.1309.5307.1067.6132.0893.6874.3197.613
mode ( RMSE k )4.6592.3882.1431.8222.8210.9530.3390.7560.8320.953
stddev ( RMSE k )8.5117.1846.2115.9154.4559.5964.3836.1696.5679.596
iqr ( RMSE k )9.4968.2194.3215.5203.64210.1333.4025.2985.15410.133
mean ( RMSE k )28.14930.79532.29230.35630.21343.42328.31429.54331.60343.423
median ( RMSE k )18.92718.87316.21415.49114.31229.26214.17217.72419.70529.262
mode ( RMSE k )5.5853.8833.0614.1122.7894.5071.3452.7102.6154.507
stddev ( RMSE k )25.91629.39537.23335.34535.58742.05335.66031.33331.91442.053
iqr ( RMSE k )15.41717.12631.87121.09429.04338.82827.00924.23928.16838.828
mean ( RMSE k )55.64172.17281.98376.50381.814104.66781.79476.62081.261104.667
median ( RMSE k )42.72857.53266.46862.27766.31186.11968.99059.71066.75786.119
mode ( RMSE k )7.9678.68810.2477.9129.7497.8099.1466.8927.4497.809
stddev ( RMSE k )43.59353.28357.28657.26858.03369.79655.71252.85754.18569.796
iqr ( RMSE k )52.53372.02982.98085.21886.61885.06092.52971.85975.21185.060
mean ( RMSE k )168.542199.701214.118209.626213.731198.998212.497165.908161.330198.998
median ( RMSE k )145.399185.636190.954187.446191.565196.704189.560169.458163.676196.704
mode ( RMSE k )43.92447.22658.63649.15660.12838.38660.70633.39632.20738.386
stddev ( RMSE k )92.157103.406108.703110.129108.68494.898108.54277.91577.49194.898
iqr ( RMSE k )102.432114.007119.186116.515119.440153.708117.857121.289120.480153.708

Appendix B. Correlations between RMSE an Sequence Parameters

Table A6. Correlation between RMSE and entropy of input sequence.
Table A6. Correlation between RMSE and entropy of input sequence.
Table A7. Correlation between RMSE and standard deviation of input sequence.
Table A7. Correlation between RMSE and standard deviation of input sequence.
Table A8. Correlation between RMSE and velocity of input sequence.
Table A8. Correlation between RMSE and velocity of input sequence.
Table A9. Correlation between RMSE and acceleration of input sequence.
Table A9. Correlation between RMSE and acceleration of input sequence.
Table A10. Correlation between RMSE and jerk of input sequence.
Table A10. Correlation between RMSE and jerk of input sequence.
Table A11. Correlation between RMSE and monotonicity of input sequence.
Table A11. Correlation between RMSE and monotonicity of input sequence.
Table A12. Correlation between RMSE and complexity of input sequence.
Table A12. Correlation between RMSE and complexity of input sequence.


  1. Kitagawa, M.; Windsor, B. MoCap for Artists: Workflow and Techniques for Motion Capture; Elsevier: Amsterdam, The Netherlands; Focal Press: Boston, MA, USA, 2008. [Google Scholar]
  2. Menache, A. Understanding Motion Capture for Computer Animation, 2nd ed.; Morgan Kaufmann: Burlington, MA, USA, 2011. [Google Scholar]
  3. Mündermann, L.; Corazza, S.; Andriacchi, T.P. The evolution of methods for the capture of human movement leading to markerless motion capture for biomechanical applications. J. Neuroeng. Rehabil. 2006, 3, 6. [Google Scholar] [CrossRef] [PubMed][Green Version]
  4. Szczęsna, A.; Błaszczyszyn, M.; Pawlyta, M. Optical motion capture dataset of selected techniques in beginner and advanced Kyokushin karate athletes. Sci. Data 2021, 8, 13. [Google Scholar] [CrossRef] [PubMed]
  5. Świtoński, A.; Mucha, R.; Danowski, D.; Mucha, M.; Polański, A.; Cieślar, G.; Wojciechowski, K.; Sieroń, A. Diagnosis of the motion pathologies based on a reduced kinematical data of a gait. PrzegląD Elektrotechniczny 2011, 87, 173–176. [Google Scholar]
  6. Lachor, M.; Świtoński, A.; Boczarska-Jedynak, M.; Kwiek, S.; Wojciechowski, K.; Polański, A. The Analysis of Correlation between MOCAP-Based and UPDRS-Based Evaluation of Gait in Parkinson’s Disease Patients. In Brain Informatics and Health; Ślęzak, D., Tan, A.H., Peters, J.F., Schwabe, L., Eds.; Number 8609 in Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 335–344. [Google Scholar] [CrossRef]
  7. Josinski, H.; Świtoński, A.; Stawarz, M.; Mucha, R.; Wojciechowski, K. Evaluation of rehabilitation progress of patients with osteoarthritis of the hip, osteoarthritis of the spine or after stroke using gait indices. Przegląd Elektrotechniczny 2013, 89, 279–282. [Google Scholar]
  8. Windolf, M.; Götzen, N.; Morlock, M. Systematic accuracy and precision analysis of video motion capturing systems—Exemplified on the Vicon-460 system. J. Biomech. 2008, 41, 2776–2780. [Google Scholar] [CrossRef]
  9. Jensenius, A.; Nymoen, K.; Skogstad, S.; Voldsund, A. A Study of the Noise-Level in Two Infrared Marker-Based Motion Capture Systems. In Proceedings of the 9th Sound and Music Computing Conference, SMC 2012, Copenhagen, Denmark, 11–14 July 2012; pp. 258–263. [Google Scholar]
  10. Skurowski, P.; Pawlyta, M. On the Noise Complexity in an Optical Motion Capture Facility. Sensors 2019, 19, 4435. [Google Scholar] [CrossRef] [PubMed][Green Version]
  11. Skurowski, P.; Pawlyta, M. Functional Body Mesh Representation, A Simplified Kinematic Model, Its Inference and Applications. Appl. Math. Inf. Sci. 2016, 10, 71–82. [Google Scholar] [CrossRef]
  12. Herda, L.; Fua, P.; Plankers, R.; Boulic, R.; Thalmann, D. Skeleton-based motion capture for robust reconstruction of human motion. In Proceedings of the Proceedings Computer Animation 2000, Philadelphia, PA, USA, 3–5 May 2000; pp. 77–83, ISSN: 1087-4844. [Google Scholar] [CrossRef][Green Version]
  13. Aristidou, A.; Lasenby, J. Real-time marker prediction and CoR estimation in optical motion capture. Vis. Comput. 2013, 29, 7–26. [Google Scholar] [CrossRef]
  14. Perepichka, M.; Holden, D.; Mudur, S.P.; Popa, T. Robust Marker Trajectory Repair for MOCAP using Kinematic Reference. In Motion, Interaction and Games; Association for Computing Machinery: New York, NY, USA, 2019; MIG’19; pp. 1–10. [Google Scholar] [CrossRef]
  15. Lee, J.; Shin, S.Y. A hierarchical approach to interactive motion editing for human-like figures. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 8–13 August 1999; ACM Press/Addison-Wesley Publishing Co.: New York, NY, USA, 1999; pp. 39–48. [Google Scholar] [CrossRef]
  16. Howarth, S.J.; Callaghan, J.P. Quantitative assessment of the accuracy for three interpolation techniques in kinematic analysis of human movement. Comput. Methods Biomech. Biomed. Eng. 2010, 13, 847–855. [Google Scholar] [CrossRef]
  17. Reda, H.E.A.; Benaoumeur, I.; Kamel, B.; Zoubir, A.F. MoCap systems and hand movement reconstruction using cubic spline. In Proceedings of the 2018 5th International Conference on Control, Decision and Information Technologies (CoDIT), Thessaloniki, Greece, 10–13 April 2018; pp. 1–5. [Google Scholar] [CrossRef]
  18. Liu, G.; McMillan, L. Estimation of missing markers in human motion capture. Vis. Comput. 2006, 22, 721–728. [Google Scholar] [CrossRef]
  19. Lai, R.Y.Q.; Yuen, P.C.; Lee, K.K.W. Motion Capture Data Completion and Denoising by Singular Value Thresholding. In Eurographics 2011—Short Papers; Avis, N., Lefebvre, S., Eds.; The Eurographics Association: Geneve, Switzerland, 2011. [Google Scholar] [CrossRef]
  20. Gløersen, Ø.; Federolf, P. Predicting Missing Marker Trajectories in Human Motion Data Using Marker Intercorrelations. PLoS ONE 2016, 11, e0152616. [Google Scholar] [CrossRef]
  21. Tits, M.; Tilmanne, J.; Dutoit, T. Robust and automatic motion-capture data recovery using soft skeleton constraints and model averaging. PLoS ONE 2018, 13, e0199744. [Google Scholar] [CrossRef]
  22. Piazza, T.; Lundström, J.; Kunz, A.; Fjeld, M. Predicting Missing Markers in Real-Time Optical Motion Capture. In Modelling the Physiological Human; Magnenat-Thalmann, N., Ed.; Number 5903 in Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; pp. 125–136. [Google Scholar]
  23. Wu, Q.; Boulanger, P. Real-Time Estimation of Missing Markers for Reconstruction of Human Motion. In Proceedings of the 2011 XIII Symposium on Virtual Reality, Uberlandia, Brazil, 23–26 May 2011; pp. 161–168. [Google Scholar] [CrossRef]
  24. Li, L.; McCann, J.; Pollard, N.S.; Faloutsos, C. DynaMMo: Mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2009; pp. 507–516. [Google Scholar] [CrossRef]
  25. Li, L.; McCann, J.; Pollard, N.; Faloutsos, C. BoLeRO: A Principled Technique for Including Bone Length Constraints in Motion Capture Occlusion Filling. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation; Eurographics Association: Aire-la-Ville, Switzerland, 2010; pp. 179–188. [Google Scholar]
  26. Burke, M.; Lasenby, J. Estimating missing marker positions using low dimensional Kalman smoothing. J. Biomech. 2016, 49, 1854–1858. [Google Scholar] [CrossRef][Green Version]
  27. Wang, Z.; Liu, S.; Qian, R.; Jiang, T.; Yang, X.; Zhang, J.J. Human motion data refinement unitizing structural sparsity and spatial-temporal information. In Proceedings of the IEEE 13th International Conference on Signal Processing (ICSP), Chengdu, China, 6–10 November 2017; pp. 975–982. [Google Scholar]
  28. Aristidou, A.; Cohen-Or, D.; Hodgins, J.K.; Shamir, A. Self-similarity Analysis for Motion Capture Cleaning. Comput. Graph. Forum 2018, 37, 297–309. [Google Scholar] [CrossRef]
  29. Zhang, X.; van de Panne, M. Data-driven autocompletion for keyframe animation. In Proceedings of the 11th Annual International Conference on Motion, Interaction, and Games, New York, NY, USA, 8–10 November 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–11. [Google Scholar] [CrossRef][Green Version]
  30. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991, 4, 251–257. [Google Scholar] [CrossRef]
  31. Fragkiadaki, K.; Levine, S.; Felsen, P.; Malik, J. Recurrent Network Models for Human Dynamics. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4346–4354, ISSN: 2380-7504. [Google Scholar] [CrossRef][Green Version]
  32. Harvey, F.G.; Yurick, M.; Nowrouzezahrai, D.; Pal, C. Robust motion in-betweening. ACM Trans. Graph. 2020, 39, 60:60:1–60:60:12. [Google Scholar] [CrossRef]
  33. Mall, U.; Lal, G.R.; Chaudhuri, S.; Chaudhuri, P. A Deep Recurrent Framework for Cleaning Motion Capture Data. arXiv 2017, arXiv:1712.03380. [Google Scholar]
  34. Kucherenko, T.; Beskow, J.; Kjellström, H. A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture. arXiv 2018, arXiv:1803.02665. [Google Scholar]
  35. Holden, D. Robust solving of optical motion capture data by denoising. ACM Trans. Graph. 2018, 37, 165:1–165:12. [Google Scholar] [CrossRef]
  36. Ji, L.; Liu, R.; Zhou, D.; Zhang, Q.; Wei, X. Missing Data Recovery for Human Mocap Data Based on A-LSTM and LS Constraint. In Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 23–25 October 2020; pp. 729–734. [Google Scholar] [CrossRef]
  37. Kaufmann, M.; Aksan, E.; Song, J.; Pece, F.; Ziegler, R.; Hilliges, O. Convolutional Autoencoders for Human Motion Infilling. arXiv 2020, arXiv:2010.11531. [Google Scholar]
  38. Torres, J.F.; Hadjout, D.; Sebaa, A.; Martínez-Álvarez, F.; Troncoso, A. Deep Learning for Time Series Forecasting: A Survey. Big Data 2021, 9, 3–21. [Google Scholar] [CrossRef] [PubMed]
  39. Shahid, F.; Zameer, A.; Muneeb, M. Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM. Chaos Solitons Fractals 2020, 140, 110212. [Google Scholar] [CrossRef] [PubMed]
  40. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The Performance of LSTM and BiLSTM in Forecasting Time Series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292. [Google Scholar] [CrossRef]
  41. Czekalski, P.; Łyp, K. Neural network structure optimization in pattern recognition. Stud. Inform. 2014, 35, 17–32. [Google Scholar]
  42. Srebro, N.; Jaakkola, T. Weighted low-rank approximations. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; AAAI Press: Washington, DC, USA, 2003; pp. 720–727. [Google Scholar]
Figure 1. Stages of the motion capture pipeline: actor (a); registered markers (b); body mesh (c); mesh matched skeleton (d).
Figure 1. Stages of the motion capture pipeline: actor (a); registered markers (b); body mesh (c); mesh matched skeleton (d).
Sensors 21 06115 g001
Figure 2. Outline of the body model (a), and corresponding parts hierarchy annotated with parents and siblings (b).
Figure 2. Outline of the body model (a), and corresponding parts hierarchy annotated with parents and siblings (b).
Sensors 21 06115 g002
Figure 3. Schematic of FFNN.
Figure 3. Schematic of FFNN.
Sensors 21 06115 g003
Figure 4. Usage of recurrent NNs in sequence to sequence task: (a) folded, (b) unfolded unidirectional variant, (c) unfolded bidirectional variant.
Figure 4. Usage of recurrent NNs in sequence to sequence task: (a) folded, (b) unfolded unidirectional variant, (c) unfolded bidirectional variant.
Sensors 21 06115 g004
Figure 5. LSTM (left) and GRU (right) neurons in detail.
Figure 5. LSTM (left) and GRU (right) neurons in detail.
Sensors 21 06115 g005
Figure 6. Proposed RNN-FC architecture for the regression task.
Figure 6. Proposed RNN-FC architecture for the regression task.
Sensors 21 06115 g006
Figure 7. Results for most of the quality measures for all the test sequences. Bars denote R M S E ; for R M S E k : ⋄ denotes mean value, × denotes median, ∘ denotes mode, whiskers indicate IQR; standard deviation is not depicted here; dash-outlined areas are zoomed in Figure 8.
Figure 7. Results for most of the quality measures for all the test sequences. Bars denote R M S E ; for R M S E k : ⋄ denotes mean value, × denotes median, ∘ denotes mode, whiskers indicate IQR; standard deviation is not depicted here; dash-outlined areas are zoomed in Figure 8.
Sensors 21 06115 g007
Figure 8. Results of the most of the quality measures for all the test sequences—zoomed variant for gaps 10, 20, and 50. Bars denote R M S E ; for R M S E k : ⋄ denotes mean value, × denotes median, ∘ denotes mode, whiskers indicate IQR; standard deviation is not depicted here.
Figure 8. Results of the most of the quality measures for all the test sequences—zoomed variant for gaps 10, 20, and 50. Bars denote R M S E ; for R M S E k : ⋄ denotes mean value, × denotes median, ∘ denotes mode, whiskers indicate IQR; standard deviation is not depicted here.
Sensors 21 06115 g008
Figure 9. Influence of training sequence length on the quality of obtained results for NN methods: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and MSE.
Figure 9. Influence of training sequence length on the quality of obtained results for NN methods: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and MSE.
Sensors 21 06115 g009
Table 1. List of mocap sequence scenarios used for the testing.
Table 1. List of mocap sequence scenarios used for the testing.
1StaticActor stands in the middle of scene, looking around and shifting from one foot to another, freely swinging arms32 svaried motions
2WalkingActor stands still at the edge of the scene, then walks straight for 6 m, then stands still7 slow dynamics, easy
3RunningActor stands in the middle of scene, then goes backwards to the edge of the scene and runs for 6 m, then goes backwards to the middle of the scene16 smoderate dynamics
4SittingActor stands in the middle of scene, then sits on a stool, and, after a few seconds, stands again15 socclusions
5BoxingActor stands in the middle of scene, and performs some fast boxing punches14 shigh dynamics
6FallingActor stands on 0.5 m elevation in the middle of scene, the walks to edge of platform, then falls on the mattress, lies for 2 s and stands16 shigh dynamics, occlusions
Table 2. Input sequence characteristics.
Table 2. Input sequence characteristics.
NoEntropy ( H ( X ) )Stddev ( σ X )Velocity ( X t )Acc. ( 2 X t 2 )Jerk ( 3 X t 3 )MonotonicityComplexity
[Bits/Mark.][mm/Coordinate][m/s/Mark.] [ m / s 2 / Mark . ] [ m / s 3 / Mark . ] [-][-]
Table 3. Quality measures for the static (No. 1) sequence.
Table 3. Quality measures for the static (No. 1) sequence.
mean ( RMSE k )3.2804.8692.1752.2901.7080.9710.2430.4680.5120.971
median ( RMSE k )2.7464.3992.0352.1201.6140.8930.2050.4060.3910.893
mode ( RMSE k )0.9931.8210.6260.8610.4550.0990.0000.0450.0360.099
stddev ( RMSE k )1.8932.2090.9390.9890.5730.6950.2160.3360.4580.695
iqr ( RMSE k )2.1232.9050.8810.9010.6840.6920.2350.3700.4340.692
mean ( RMSE k )3.1874.7752.3712.3511.9032.6940.9331.5251.7382.694
median ( RMSE k )2.8284.7092.2742.2351.7792.1470.7641.2511.2872.147
mode ( RMSE k )0.6050.5840.5400.3810.4150.0520.0050.0260.0230.052
stddev ( RMSE k )1.4421.8710.8910.8980.8261.8310.6641.0451.4831.831
iqr ( RMSE k )1.8412.3941.1031.0130.8131.9830.8661.1731.4371.983
mean ( RMSE k )3.4015.4344.2333.4453.9589.2074.5726.0276.5739.207
median ( RMSE k )2.9065.1543.7763.1183.4968.7333.8885.5125.7338.733
mode ( RMSE k )1.3261.3930.8311.0661.0001.1690.4000.8000.7931.169
stddev ( RMSE k )1.6882.1682.4301.9212.4484.4642.8523.1743.7644.464
iqr ( RMSE k )1.4212.2162.1691.6422.2826.0782.4183.7704.3736.078
mean ( RMSE k )4.2337.1349.4606.7219.30221.81211.23613.58716.10821.812
median ( RMSE k )3.6586.3298.3335.9538.19821.12910.34512.87514.78521.129
mode ( RMSE k )1.5172.2521.3771.4651.4003.2662.5461.9861.9373.266
stddev ( RMSE k )2.1323.1435.1143.6925.23011.3055.4726.8259.55611.305
iqr ( RMSE k )2.2153.4735.6504.2175.70014.5366.8508.02911.01914.536
mean ( RMSE k )9.06217.30330.20424.83730.13555.09931.61641.67648.78955.099
median ( RMSE k )8.68316.20028.35222.65528.46249.64129.91438.41042.15549.641
mode ( RMSE k )2.4043.9735.5234.2635.0108.5106.5186.4596.0338.510
stddev ( RMSE k )4.0137.63113.45012.74313.50329.93413.51122.02228.46329.934
iqr ( RMSE k )5.0849.41318.23116.89518.43648.86417.12536.31546.22248.864
Table 4. List of mocap sequence scenarios used for the testing.
Table 4. List of mocap sequence scenarios used for the testing.
NN TypeNumber of Learnable ParametersValue for Exemplary Case
FFNN: h i d d e n L a y e r S i z e × i n p u t v e c t o r S i z e + h i d d e n L a y e r S i z e 275
+ 3 × h i d d e n L a y e r S i z e + 3
LSTM: 4 × h i d d e n R e c u r r e n t N e u r o n s × i n p u t v e c t o r S i z e 22,023
+ 4 × h i d d e n R e c u r r e n t N e u r o n s × h i d d e n R e c u r r e n t N e u r o n s
+ 4 × h i d d e n R e c u r r e n t N e u r o n s
+ 3 × h i d d e n R e c u r r e n t N e u r o n s + 3
GRU: 3 × h i d d e n R e c u r r e n t N e u r o n s × i n p u t v e c t o r S i z e 16,563
+ 3 × h i d d e n R e c u r r e n t N e u r o n s × h i d d e n R e c u r r e n t N e u r o n s
+ 3 × h i d d e n R e c u r r e n t N e u r o n s
+ 3 × h i d d e n R e c u r r e n t N e u r o n s + 3
BILSTM: 8 × h i d d e n R e c u r r e n t N e u r o n s × i n p u t v e c t o r S i z e 47,043
+ 8 × h i d d e n R e c u r r e n t N e u r o n s × h i d d e n R e c u r r e n t N e u r o n s
+ 8 × h i d d e n R e c u r r e n t N e u r o n s
+ 3 × 2 × h i d d e n R e c u r r e n t N e u r o n s + 3
Table 5. Correlation between RMSE and sequence parameters (averaged for all gap sizes).
Table 5. Correlation between RMSE and sequence parameters (averaged for all gap sizes).
Table 6. Correlation between sequence parameters.
Table 6. Correlation between sequence parameters.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Skurowski, P.; Pawlyta, M. Gap Reconstruction in Optical Motion Capture Sequences Using Neural Networks. Sensors 2021, 21, 6115.

AMA Style

Skurowski P, Pawlyta M. Gap Reconstruction in Optical Motion Capture Sequences Using Neural Networks. Sensors. 2021; 21(18):6115.

Chicago/Turabian Style

Skurowski, Przemysław, and Magdalena Pawlyta. 2021. "Gap Reconstruction in Optical Motion Capture Sequences Using Neural Networks" Sensors 21, no. 18: 6115.

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop