1. Introduction
Estimates of human motion for monitoring and analysis are useful in many circumstances. In sports and fitness applications, quantitative data on a person’s kinematics can aid in their training more effectively. Monitoring workers’ postures may help them avoid injuries. Rehabilitation may benefit from quantitative data on a patient’s capabilities, enabling therapists to customize training programs and observe progress. In addition, exoskeletons and prosthetics can benefit from a quantitative understanding of human motion, since they must work cooperatively with the body. Each of these applications is becoming more prevalent, resulting in a need for databases of accurate human motion, and methods for using sensors to estimate human kinematics.
The expanding need for accurate human motion analysis has resulted in a number of motion capture datasets using various techniques, such as optical motion capture, cameras, and wearable sensors [
1,
2,
3,
4,
5,
6,
7]. Optical motion capture using markers is a widely accepted, popular methodology for ground-truth pose data. A widely used dataset is the CMU motion capture dataset collected with optical motion capture [
3]. This dataset was the most abundant motion capture dataset of its time, in terms of timespan and number of motions, and contains data on 144 participants with more than 9 h of data. The KIT motion capture project [
5] introduced the largest optical motion capture dataset to date, which also includes object interaction. The dataset contains 55 participants and approximately 11 h of data [
8]. KIT is important because it focuses not only on human motion but also on interaction with objects and the environment.
The largest and most expansive motion capture dataset to date is the Archive of Motion Capture as Surface Shapes (AMASS). AMASS is the most recent step towards much larger motion capture datasets for deep learning and computer vision use. It “amasses” many previously mentioned motion capture datasets [
1,
2,
3,
5]. Since many of the motion capture procedures differ, it uses surface shapes to combine different marker placements with the SMPL and MoSH algorithms [
9,
10]. It contains 43.6 h of motion capture data on 450 subjects, making it by far the largest publicly available motion capture dataset. However, it shares the same constraints as the other motion capture datasets: in it, researchers instruct people to perform actions in areas designed and constrained by the researchers.
Another valuable recent dataset is the Human3.6M dataset [
4], which uses both optical motion capture and other sensors. This dataset is heavily benchmarked for human motion prediction using deep learning [
11,
12,
13,
14,
15,
16,
17].
While optical motion capture remains the gold standard for accuracy, motion capture has also been facilitated by advancements in pose estimation using cameras. The Kinect from Microsoft and OpenPose from CMU, are two popular examples [
18,
19]. These systems do not require marker sets on the human body and can be used in a wide range of locations. However, the 3D pose accuracy is generally lower when using these systems.
There are also motion capture solutions using inertial sensors, and these have several advantages [
20,
21]. Critically, they are usable in outdoor, real-world settings and are not limited to indoor, small-volume laboratory environments. This difference is essential because it allows for the collection of unscripted motion while people perform day-to-day activities. For example, inertial measurement units have been used for activity assessment [
22], sports training [
23], and monitoring neurological disorders [
24]. Thus, inertial sensors can be used to create datasets of natural motion in real-world settings, like work environments and the outdoors. For example, Reference [
25] collected inertial motion capture data in environments, like parks, where the participants rode bikes and climbed jungle gyms. Notably, [
7] used multi-view cameras and XSens inertial sensors to create a non-marker-based motion capture dataset in a larger area than is possible with optical motion capture. However, despite the range of applications that inertial motion capture systems could benefit, the largest open datasets are from Reference [
7,
26], which total 90 min and 50 min, respectively, according to Reference [
26].
A separate area of literature besides datasets of human motion is work on inferring full-body kinematics based on a small number of inertial sensors (i.e., with sparse inertial sensors). Full-body motion capture using inertial measurement units (IMUs) requires placing an IMU on each body segment, which typically requires at least 15 sensors (including the head, hands, and feet). This can be cumbersome and prohibitively complex for most applications. Instead, using a much smaller number of sensors (1–7) and inferring the wearer’s activities or kinematics is a promising and more practical method.
There have been multiple previous approaches to full-body human motion inference based on sparse inertial sensors. Many groups have taken a hybrid approach, combining video data with inertial sensor data [
27,
28,
29,
30]. Most recently, in Reference [
30], the authors use inertial sensors and a single video camera to find 3D poses. Using convolutional neural networks to detect 2D poses [
19], the authors find 3D postures with help from inertial sensor data.
Depth cameras and optical motion capture have been used alongside inertial sensors as well [
31,
32]. In Reference [
31], the authors use a Kinect and six inertial sensors for full-body motion inference. In Reference [
32], the authors propose a real-time system that makes use of six sparse inertial sensors and five reflectors for optical motion capture. Other than inertial sensors, human motion inference has also been studied using magnetic sensors [
33,
34], foot pressure sensors [
35], accelerometers [
36,
37], and inertial-ultrasonic motion sensors [
38].
Still other groups have inferred kinematics based solely on a small number of inertial sensors. To accomplish this, both machine learning and optimization algorithms have been used. In Reference [
39], the authors use Gaussian Processes to learn mappings between data from four inertial sensors and optical motion capture data. The inertial sensors are placed on the wrists and ankles. In additional experiments, the authors estimate walking postures with even fewer sensors. However, in more recent works, Gaussian Processes have mostly been replaced by neural networks because those are capable of scaling with training data and generalizing to unseen data.
More recently, Reference [
40] made use of densely connected neural networks to predict postures using five inertial sensors with several different configurations. Reference [
40] used an XSens MVN Link suit, allowing them to test multiple configurations. However, they only have around 2 h of data collected in lab conditions, and only predict a single posture at a time.
In Reference [
25], the authors present an offline optimization method by minimizing orientation, acceleration, and anthropometric errors, and jointly optimizing postures over multiple frames using six inertial sensors. One interesting finding from this paper is the necessity of an acceleration term, which significantly benefits posture approximation. They also made use of an anthropometric term to ensure they generate human-like poses. Overall, they generate accurate human motion from a variety of postures. However, their method is computationally expensive and offline.
Most recently, Reference [
26] presents an advancement on Reference [
25] that uses deep learning instead of offline optimization. They use a two-layer bidirectional Long Short-Term Memory (LSTM) architecture to predict single postures using 20 past and 5 future frames. Their system uses six inertial sensors and runs in real-time. One interesting point is that they use the AMASS dataset to generate synthetic inertial sensor data, which allows them to generate 45 h of training data. However, because they use synthetic data, they have to fine-tune on real inertial sensor data that they collect manually to perform well on their test set.
Despite the previous work in human motion datasets and motion inference, there remain opportunities for improvement in both areas. Previous human motion datasets have been largely scripted, and may contain large amounts of data corresponding to uncommon motions while leaving out other behaviors, such as subconscious habitual motions. Thus, previous datasets may not be sufficient for applications in day-to-day life where people move, and live, in real-world environments. To aid the study of exoskeletons, robotics, and human-computer interaction, there is a need for datasets that contain significant amounts of unscripted, natural human motion. A truly natural motion dataset would not be directed by a researcher, and would not be constrained to a highly limited environment that contains a limited number of objects with which people can interact.
Additionally, there remain opportunities for improvement in motion inference algorithms using sparse inertial sensors. With the field of deep learning rapidly growing, it is useful to understand how well new algorithms perform relative to previous works. Additionally, the relative accuracy of different sensor configurations has not been previously studied very much.
In this paper, we have three primary contributions. First, we present a new dataset of full-body kinematics as regular people continued with their lives and performed their jobs, which we refer to as the Virginia Tech Natural Motion Dataset. Second, we evaluate two deep learning algorithms and their variants for motion inference. Specifically, we conduct motion inference using Sequence-to-Sequence (Seq2Seq) and Transformer models, which have not been previously used for human motion inference, to understand their accuracy. Third, we compare the relative accuracy of motion inference using different sparse sensor configurations, and for the upper body alone versus the whole body.
3. Results and Discussion
In this section, we first describe our dataset, and then provide results on the performance of our algorithms. We describe quantitative results for our algorithms, then show qualitative results to visualize postures predicted by the models.
3.1. Dataset
Our dataset of natural human motion contains 40.6 h of data, which is greater than other datasets we know of apart from AMASS [
8], which “amasses” data from a multitude of different datasets. Additionally, to our knowledge, it is the only dataset containing natural human motion and only dataset with the kinematics of manual material handlers in a retail environment.
Table 1 gives an overview of our dataset in comparison to two other large datasets, and
Table A1 in the
Appendix A provides additional details. Our dataset includes 35.1 million frames, with 17 subjects of ages 20–58.
3.2. Evaluation Overview
To understand the performance of our models, we conducted studies using quantitative and qualitative evaluations. The quantitative evaluation was in two stages. The first stage covers the performance on the two test sets (regular test set and special test set) using the mean angle difference metric as described in
Section 2.4.4. We only run these benchmarks on the two test sets a single time after tuning the model hyperparameters on the separate validation set. We then evaluate each model using histograms of the mean angle difference metric to understand the performance better. For qualitative evaluation, we view various representative postures to determine how well the model generalizes and where it fails. Example postures are taken from both of the test sets and the validation set for qualitative evaluation.
3.3. Quantitative Evaluation
We present the model performance with our regular test set for upper-body inference with the four configurations in
Table 2, and we present the model performance for full-body inference in
Table 3. Overall, the models performed fairly similarly, resulting in mean angular differences of around 10–15
with respect to the ground truth.
Note that the output of the model includes the known segments, mainly for convenience during training and performing forward kinematics. This was also done by other groups in previous work [
25,
26]. However, the models did not accurately predict these known segments with perfect accuracy. The mean angular differences for all of the known segments combined were between approximately
and
across the different configurations. Thus, the inclusion of these segments reduces the mean angular difference in this section by approximately
. We present the results including these values because the models were trained and optimized using all segments, including the known segments.
Comparing the models to each other, different models perform best in different configurations. For upper-body inference, the Transformer Encoder performs best in Configurations 1 and 4 and nearly is the best in Configuration 3. However, it is around 3 worse than the other models for Configuration 2. Potentially this could be the result of a local minimum during training and validation. The other three models perform nearly as well as the Transformer Encoder, with mean angle differences of for all configurations.
For the full-body inference, the Transformer model is the best with Configurations 1 and 3, and nearly the best in Configuration 2. However, it is almost 5 worse than the other models for Configuration 4. Again, the other models perform almost as well as the best one, with differences of . The one exception to this is the Transformer Encoder is almost worse than the others for Configuration 3.
Overall, Configuration 2 has better accuracy than the other configurations for both full-body and upper-body motion inference and all the models. It is interesting comparing Configuration 2 to Configuration 1, which was used in previous works (Reference [
25,
26,
30]); Configuration 2 uses a sensor on the sternum instead of on the head. Configuration 2 does 1.5–3
better than Configuration 1 across all models, suggesting that other studies could improve their accuracy by choosing different sensor locations.
Surprisingly, Configuration 3 provides results for upper-body and full-body motion inference very similar to the results from Configuration 1. Configuration 1 uses sensors on both the pelvis and head, while Configuration 3 instead uses one sensor on the sternum. Thus, Configuration 3 uses one fewer sensor than Configurations 1 and 2, for a total of only three sensors in the upper-body scenario or five for the full-body scenario. In comparison, Configuration 4 also uses one fewer sensor location (but uses the pelvis instead of the sternum) and displays lower accuracy by 2–5 on average as compared to the other configurations. Even with this lower accuracy, the mean angular error is still for the upper-body and for the full-body for most models, which is sufficient for many applications.
As mentioned previously, we also conducted a quantitative evaluation using a special test set containing data from participant 13 who performed physical exercises, such as napping, Frankensteins, burpees, pushups, jumping jacks, and jogging. We present model performance on the special test set for upper-body inference in
Table 4, and the results for full-body inference in
Table 5.
As compared to the regular test set, the special test set has accuracies that are 7–10 worse for the upper-body, and ∼10 worse for the full-body scenario. It is expected that the special test set would have worse accuracies, as the special test set contains unseen motions that we were certain were not in our training data, but these values are promising. Adding dynamic motions to the dataset would likely improve these values.
As with the regular test set, Configuration 2 results in the best accuracy. Configuration 3 again has accuracies several degrees worse than Configuration 2, and Configuration 4 is even worse, although by smaller amounts than with the regular test set.
With the upper-body inference, the Seq2Seq model with the bidirectional encoder and attention performs about as well as the Transformer across all configurations. With full-body inference, the Seq2Seq model with the bidirectional encoder and attention seems to perform best overall. The Transformer does particularly poorly in Configuration 4 for the full-body, around 5 worse than the other models.
Compressing the performance of different models to a single number can be misleading, so we also report histograms of sequence angular error in degrees for upper-body motion inference in
Figure 5 and full-body motion inference in
Figure 6. We also generated histograms using the special test set, and these are available in the
Appendix B.
For the upper-body scenario, it is noticeable that for each model, except the Transformer Encoder, the end of the distribution’s tail drops off faster when using Configuration 2 as compared to the other configurations. Configuration 4 has the worst performance for each model and the longest tail in each case. It is also evident from the histograms that Configuration 3 performs better than Configuration 4, again supporting the conclusion that the sternum is more informative as a root segment for upper-body motion inference than the pelvis.
For full-body inference, the end of the tail in Configuration 2 again drops off faster compared to other configurations, although the effect is not as pronounced as with the upper-body. The center of the distributions for Configuration 2 is also lower, in agreement with
Table 3. Overall, the different configurations have histograms that are much more similar than they are with the upper-body inference. However, the Configuration 4 performs noticeably worse when using the Transformer (
Figure 6d). Configurations 3 and 4 use the same number of sensors, but Configuration 3 has a shorter tail for most models except the Transformer Encoder.
For both the upper-body and full-body scenarios, the histogram tails have maximum values of around 23–25 for Configuration 2, and 25–28 for Configuration 3. Even with our special test set, the histogram tails have maximum values of around 22–29 for Configuration 2, and 25–30 for Configuration 3. Therefore, our algorithms’ performance is very promising, as the inferred kinematics are not grossly different from the true kinematics, even if they are not perfect.
3.4. Qualitative Evaluation
Though quantitative evaluation is useful for viewing concise metrics about the models’ performance, qualitative evaluation is necessary to build intuition for how the models make predictions and where they fail. Our goal in this section is to visualize the postures that the models predict to build intuition about what they get wrong. Qualitative evaluation is performed in almost every study of human motion inference, such as in Reference [
26,
27,
35]. We use Configuration 2 from the full-body scenario to investigate how the models differ. Configuration 2 was chosen because it gave the best results out of the four configurations. The upper-body scenario has very similar outputs, so we only present the full-body results to avoid repetition.
In each of the figures in this section (
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12), we label the different rows with letters (a, b, c, d), representing different postures that the model must infer using sparse segment orientation and acceleration. Our goal was to choose various representative postures, such as standing, sitting, and bending, to understand where the models fail and succeed. The first column is the reference (ground truth) posture, and the remaining columns are predictions from each model. We provide descriptions of the postures in the captions.
The first set of postures comes from participant 5 in our validation set (
Figure 7). Of note is that each model correctly predicts the participant is sitting down in each posture. In posture (a), the participant is sitting and typing on a laptop in their lap. Each model has subtle inaccuracies in inferring this motion, but most seem to infer realistic motion. In posture (b), the participant has their elbows on the table, but the Transformer incorrectly infers the participant is lifting their right elbow, and the Transformer Encoder infers inaccurate orientations of their feet. In posture (c), the participant is reaching down into their bag. The Seq2Seq model infers that the person is reaching further back than they are in reality. In posture (d), the participant is sketching instead of typing, and the Transformer Encoder and Transformer both infer inaccurate foot orientation.
Another set of postures comes from worker three, who was working at a home improvement store (
Figure 8). Similar to
Figure 7, the models can all determine that the participant is standing up instead of sitting down. In posture (a), the participant has their legs crossed while leaning on something. Each model except for the Transformer Encoder accurately models the participant crossing their legs. In postures (b) and (c), the participant is doing a one-legged lift and split-legged lift with a heel raised, respectively. Each model can infer this correctly, though there is some inaccuracy in the right arm orientation for posture (b). In posture (d), the Seq2Seq model and Transformer Encoder fail to infer accurate upper arm kinematics for both arms. The Transformer is reasonably accurate in inferring the right arm’s orientation but is less accurate for the left arm. Each model also fails to infer that the participant is on their toes.
The third set of postures is from worker two, who was also working at a home improvement store (
Figure 9). This set of postures is interesting because it demonstrates the range of motions the models can perform inference for, and where the models fail. In posture (a), the participant is walking with their hand on a cart behind them. Each model has accurate inference about the posture. In posture (b), the participant is kneeling on the ground while reaching for something. Each model has varying predictions for how extreme the knees’ flexion is, but each seems to correctly infer that the participant is kneeling. In posture (c), the participant is putting on a vest or jacket. This posture is particularly challenging because of the unusual way the arms move during the activity. The Transformer Encoder, in particular, incorrectly predicts the participant is sitting down. In addition, each model inaccurately infers the upper arm orientation. Each model has varying inaccuracies. Finally, in posture (d), the participant is lifting something with both hands. There are varying degrees of inaccuracy between each model. Each model seems to be conservative about how far apart the arms are.
We also present a typical set of cases where the mean angular error is greater than
when using the Transformer, in
Figure 10. These inaccurate predictions are very few in number in our test set and validation set. In
Figure 10a, the Transformer predicts that the person is sitting down while they are instead operating machinery (a stand-up forklift). The accelerations from the forklift possibly cause this error to occur. In (b), the model correctly predicts the participant is reaching across their chest, but incorrectly predicts the amount of knee flexion and hip rotation present in the posture. In (c), the participant is reaching across their chest with one hand. The Transformer model output looks reasonable, but this still resulted in a high angular error, likely due to errors in the arms and hip rotation. Finally, in (d), the participant is performing an action with high elbow flexion. The Transformer correctly predicts this but has a large amount of error in predicting the leg orientation.
Our special test set contains further examples of where the mean angular error is high. The first set of cases relates to the participant napping, doing push ups, doing Frankensteins, and doing mountain climbers, shown in
Figure 11. In addition, we share two of the worst failure cases from our special test set. These postures are from the participant performing burpees (
Figure 12).
The special test set pushed the models to their limits in interesting ways. Because there were no people who exercised in our dataset, the models had lower accuracy when handling these unseen cases. Data of people lying down and napping also was not present in our training dataset because we mainly collected data in workplaces. However, considering the complete lack of data for Frankensteins, pushups, and lying down in our training set, the models generalize reasonably well, and the outputs are reasonable for these exercises. Burpees, on the other hand, are clearly at the limit for the models.
3.5. Comparison to Prior Works
Our results compare favorably to other previous work in this area. Other approaches to full-body motion inference, such as Reference [
26,
40], also make use of neural networks for human motion inference using sparse sensors. In Reference [
26], the authors predict Skinned Multi-Person Linear model (SMPL, [
9]) parameters of a single frame using 20 past frames and five future frames at test time with a bidirectional Long Short-Term Memory network (LSTM). They synthesize their IMU data using AMASS [
8] and then fine-tune on a real IMU dataset that is 90 min in length. An important point is that their system was used in a real-time demonstration; our models can also be run in real-time, but we did not develop a demonstration. Overall, their real-time system and use of SMPL model parameters are impressive. Like our system, theirs has lower accuracy with dynamic and unusual motions.
In contrast, we predict segment orientations instead of SMPL model parameters. We also frame the problem as a sequence-to-sequence problem and use Seq2Seq models and Transformers instead of only LSTMs that predict a single frame of poses. We do not use future frames; we map a sequence of sparse segment orientation and acceleration data to a concurrent sequence of full-body or upper-body orientations.
In comparison to Reference [
26], we found that our histograms have shorter tails, which is worth noting as this may lead to fewer failure cases in endpoint applications. Their histogram tails extend past 50
, whereas ours are less than 30
even for the special test set. However, this may be due to different underlying motions in each of our datasets and that they use joint angles instead of segment orientations to measure angular differences. Finally, our dataset for training and validation is, to the best of our knowledge, the largest real inertial motion capture dataset and is not synthetic, like the synthetic IMU data in Reference [
26]. Our view is that our dataset will be complimentary to other large datasets, such as AMASS and KIT, and may also provide for novel approaches to problems because we collected more data per participant on average in real-world environments.
3.6. Limitations and Future Work
Although our natural motion dataset contains around 40 h of human motion in unconstrained real-world environments, the number of participants (
N = 17) is still limited compared to other motion capture datasets [
5,
8]. We would also like to make the dataset evenly split between males and females in the future, and this is currently not the case. It is unknown how many unique postures are in the dataset, and manual labeling by humans will be necessary. In future work, this dataset can be expanded upon and labeled.
Another limitation is that the XSens suit only records kinematics. Force sensors could add information about how people are interacting with their environment. FootSee remains a fascinating case study into how force sensors can be used to infer human motion [
35]. Future work could incorporate force sensors in people’s shoes to incorporate additional information and expand use cases. In addition, synchronized on-body or off-body cameras could be used to capture additional context about the environment or improve human motion inference accuracy. Other approaches have used off-body cameras and inertial sensors to allow for hybrid motion capture [
27,
28,
29,
30]. One exciting approach, called EgoCap, uses two on-body fish-eye cameras for full-body motion capture [
66], and a similar approach could be used in conjunction with inertial sensors.
Finally, our dataset was limited in the activities it captured. While the dataset contains 40 h of unscripted, natural human motion that was captured entirely in real-world environments, it did not fully capture all possible activities of daily living, and did not capture many dynamic motions. Data in future training sets should capture activities of daily living more comprehensively and account for more extreme motions from sports or exercising.
An exciting possible direction for future motion capture studies is to make them “occupation”-focused. Previous large-scale datasets have collected data from people in enclosed laboratory spaces where participants perform defined actions. We believe the future of large-scale motion capture projects should be in real-world environments with workers, such as manual material handlers, nurses, factory workers, farmers, and firefighters. Our natural motion dataset is an attempt at collecting natural motion in real-world environments, and we aim to use it to design and study tools and assistive devices that can help people.
4. Conclusions
In summary, we introduced the largest full-body inertial motion capture dataset as far as we know, and our dataset is available for both biomechanics and machine learning studies. We also explored Seq2Seq and Transformer models for inferring upper-body and full-body kinematics using different configurations of sparse sensors. Our models produce mean angular errors of 10–15 degrees for both the upper body and full body, and worst-case errors of less than 30 degrees. Overall, our approach leads to reasonable kinematics estimates for typical activities of daily living.
Our study into human motion inference has shown that a wide range of human motion can be inferred effectively with limited segment information using Seq2Seq models and Transformers. Importantly, we found that the sternum is a useful segment for upper-body and full-body motion inference: as a general segment, it provides more information than the head, and, as a root segment, it provides more information than the pelvis. The use of the sternum as a root segment also allows us to perform accurate human motion inference with one fewer segment in full-body and upper-body motion inference. This is a novel contribution of our work, because to the best of our knowledge, other research has not shown or tried to show a direct benefit to using the sternum as a root segment. This has important implications for future applications. This also may have benefits for the practical implementation of a small sensor set. An inertial sensor may not move very much if attached to a person’s glasses or hat, but requiring someone to wear something on their head may be more intrusive than a sensor clipped to the front or back of their shirt.
We also did an extensive study on upper-body motion inference, while other works have focused only on the entire body (an exception is Reference [
36]). Using just the upper-body resulted in similar accuracies to the full-body. This is an important contribution as we think many use cases do not require full-body motion to be useful. For example, in stroke rehabilitation exercises where the patient is seated at a table, upper-body kinematics could be inferred with only two inertial sensors on the patient’s wrists and one clipped to the back of their shirt. Such an application could help remediate privacy concerns over the use of cameras for motion capture while giving constructive feedback about the patient’s posture and motion.
We found that some postures are hard to model with sparse segment information, such as putting on a jacket or reaching for a high object, and these postures led to larger errors. However, even in these cases, our models tend to predict postures that are plausible. In other words, when viewing failure cases in our validation and regular test set, we did not see cases where the limbs were tangled together, or unrealistic joint angles were predicted. Thus, these models are promising for real-world applications and additional use with future datasets.