Towards Single Camera Human 3D-Kinematics

Markerless estimation of 3D Kinematics has the great potential to clinically diagnose and monitor movement disorders without referrals to expensive motion capture labs; however, current approaches are limited by performing multiple de-coupled steps to estimate the kinematics of a person from videos. Most current techniques work in a multi-step approach by first detecting the pose of the body and then fitting a musculoskeletal model to the data for accurate kinematic estimation. Errors in training data of the pose detection algorithms, model scaling, as well the requirement of multiple cameras limit the use of these techniques in a clinical setting. Our goal is to pave the way toward fast, easily applicable and accurate 3D kinematic estimation \xdeleted{in a clinical setting}. To this end, we propose a novel approach for direct 3D human kinematic estimation D3KE from videos using deep neural networks. Our experiments demonstrate that the proposed end-to-end training is robust and outperforms 2D and 3D markerless motion capture based kinematic estimation pipelines in terms of joint angles error by a large margin (35\% from 5.44 to 3.54 degrees). We show that D3KE is superior to the multi-step approach and can run at video framerate speeds. This technology shows the potential for clinical analysis from mobile devices in the future.


Introduction
3D Human kinematics involves measuring joint angles between body segments, which is essential in the day-to-day practice of experts. Skilled physicians could judge, just by looking at a specific motion of their patient, whether it is healthy or abnormal. Skilled sports coaches can help their coachees achieve better performance and lower injury risk by evaluating their movements through observation. However, these visual examinations of human kinematics remain inherently subjective, leading to variation between and within human observers. Modern systems and sensors could reduce these variations through more objective observations. Yet, these systems make the measurement of human motion more costly and more timeconsuming. A system with the availability and ease of use of visual estimation would help physicians and coaches make more objective observations more often, ultimately raising their own and their subjects quality of life. Digital cameras have made the estimation of human kinematics more accessible but come at the 1 cost of reduced accuracy. Compared to the more traditional Optical Motion capture (OMC) systems, markerless motion capture (MMC) systems do not require specialized cameras and markers attached to the subject being monitored, but use normal RGB cameras in combination with image-based automatic pose estimation algorithms. Instead of specific markers, pose estimation algorithms detect the centers of major joints of the human body, such as the shoulders, hips, and knees. These detected centers are usually referred to as key points.
Multiple commonly used markerless motion capture methods rely on 2D pose estimation methods [27,45,46,26]. Often these methods still need more than one camera to generate a good estimation of the keypoints in 3D, which again requires additional cameras to be set up. On the other hand, an increasing number of methods are using single-view (monocular) 3D pose estimation methods [21,32,43], which allow to estimate a 3D pose just by using a single camera. This makes MMC systems faster and more accessible as they do not require the additional time and expertise to place markers on the subject or calibrate multiple cameras. However, MMC systems assume that current pose estimation algorithms can accurately replace markerless motion capture systems for, e.g., biomechanical applications [52,13,60].
Commonly used pose estimation algorithms introduce mistakes in kinematic estimation pipelines due to systematic errors in their predictions. To detect key points, most pose estimation methods are trained on a combination of images of a person and ground truth annotations which map pixels in the image to their corresponding joint center.These ground truth annotations are often manually conducted by non-expert annotators, leading to errors caused by personal biases for training and inaccuracies in the pose estimations [13]. For example, Needham et al. [41] compared three often used pose estimation algorithms OpenPose [10], DeepLab-Cut [38] and AlphaPose [17] algorithm against an OMC system and showed errors in the estimation of joint centers of 30 mm to 50 mm with variations in 12 mm to 25 mm in marker placement. Cronin [13] provides an overview of additional problems with 2D pose estimation for kinematic analysis. These differences are most likely due to a difference between the application that pose estimation algorithms are often developed for and their application to, e.g., the biomedical domain, which has different accuracy requirements [52]. Wade et al. [60] proposed to solve this problem by re-annotating existing large-scale datasets, this, however, is a time-consuming process, when for example considering the COCO-keypoint dataset https://cocodataset.org/#keypoints-2020 (accessed on 2 December 2022 ) consists of more than 250.000 labeled poses. For the evaluation of pose estimation algorithms, these labeling errors will just appear as a baseline error that all algorithms training on the same data will have. However, for applications in the biomedical domain and in situations such as kinematic estimation, where the pose is just an intermediate step errors can propagate to subsequent tasks.
Errors in the estimated pose cannot not be corrected by most kinematic estimation pipelines because they all roughly follow a 'multi-step' approach. The 'multi-step' approach consists of • Detection of the 3D pose (in one or more steps); • (Optional) modeling of the pose with a (musculo)skeletal model. • Calculation of kinematics and/or downstream tasks such as gait parameters or dynamics.
For example, Kidzinski et al. [27] used OpenPose to first predict key points from a video and then trained a convolutional neural network (CNN) to predict the walking parameters of patients with cerebral palsy. Liao et al. [32] first model the 2D pose in OpenPose then create a 3D pose using data-driven matching and finally estimate 3D gait parameters. Noteboom et al. [43] first used VideoPose3D [49] to estimate a 3D pose, followed by modeling in OpenSim [54] for the estimation of dynamics from a single camera. Pagnon et al. [45,46] developed the handy Pose2Sim tool, which first combines 2D OpenPose pose estimations from multiple cameras into a 3D pose then models it in OpenSim. Because the pose estimation step is de-coupled from kinematic estimation, errors in pose estimation propagate through to the estimation of kinematics. Uchida and Seth [57] showed that 20 mm of marker uncertainty leads to a variation of 15.9 • in peak ankle plantarflexion angle and impacts downstream tasks such as joint moment estimation. Della Croce et al. [14] showed precision variation 13 mm to 25 mm, which leads to differences in estimated joint angles up to 10°. Fonseca et al. [18] showed that misplacement of markers up to 10 mm can lead to errors of 7 • depending on the marker. With estimation errors of 30 mm to 50 mm in keypoint estimation [41], it is to be expected that these errors will substantially influence kinematic estimation from markerless motion capture. Low-pass filter [40,45] or bi-directional Kalman-filter [40] has been applied to compensate for noisy key point estimations, but cannot correct for faults in keypoint detection. Subsequent modeling and kinematic calculation steps can only compensate for these inaccuracies. This 'multi-step' approach is probably inspired by the steps of a traditional OMC method, as in the traditional OMC systems the pose detection step is done using a different system and is thus isolated from the other steps. In camera-based kinematic estimation pipelines, however, the de-coupling of individual steps is no longer necessary.
Deep neural networks have often demonstrated their ability to outperform multistep systems, by implicitly learning individual steps through end-to-end training between an input and the desired output [55,29]. The main strength of deep neural networks lies in their ability to break down a highly complex task, in this case, the estimation of kinematics from videos, into a sequence of simpler tasks, without the need for intermediate 'hand-crafted' representations [30,2]. Due to the fully differentiable nature of neural networks, it means that an error in estimation during training can influence all stages of the network and adjust them accordingly [2]. This allows deep neural networks to directly estimate kinematics.
In this work, we challenge the notion of the classical multi-step approach of pose estimation, fitting of a musculoskeletal model and kinematic estimation. To this end, we propose a novel end-to-end method that allows for direct estimation of human kinematics, which is directly optimized for kinematic estimation while treating pose estimation only as an auxiliary task to constrain the estimations of the network. Figure 1 shows a general overview of our method. Figure 1. Overview of the proposed direct 3D human kinematics estimation (D3KE). Instead of using the common 'multi-step' approach of predicting pose, fitting it to a model, and estimating kinematics, our D3KE directly estimates the kinematics. Errors in earlier steps of the multi-step approach propagate to later steps; in contrast, our method can correct for errors occurring anywhere between input and output.
Contributions. To the best of our knowledge, we are the first to present an endto-end trainable network that directly generates joint angles, joint positions, scale factors and marker positions of a biomechanical model from a monocular video. We propose a method that directly regresses from a video to joint angles and scales using deep neural networks. We investigate the influence of various temporal smoothing methods to increase the accuracy of our algorithm. We introduce a novel type of network layer that allows for the calculation of the 3D pose from estimated kinematics during the training process to train the network simultaneously on the pose and kinematic labels.

Materials and Methods
Our method takes videos from a single camera as input and directly estimates joint angles, which we call direct 3D kinematic estimation (D3KE). The proposed method first coarsely estimates kinematics per frame by using a convolutional neural network, and then it uses a sequence network with temporal relations across frames to re-fine kinematic estimations at each frame. An overview of our method is shown in Figure 2. Both networks estimate the scale of body segments, joint angles, and a rotation matrix from the pelvis to the ground, those serve as input for a skeletalmodel layer in both networks that allows for additional supervision on the pose of a subject. In this section, we first describe the deep learning architecture, including a detailed description of the skeletal-model layer. We then describe how the ground truth data was generated and which pre-processing and hyperparameters were used for training. Lastly, we describe the dataset used for training and testing our method.
2.1. Network Structure. Convolutional neural networks(CNNs) have shown good accuracy for 2D and 3D pose estimations [10,51,39,42,48] from single input images. Conventionally 2D CNNs are used for pose estimation tasks, that takes a single image as an input and predict the pose of one or multiple people in the image. For our method, we choose a per-frame convolutional network to coarsely predict the joint angle and scaling parameters. Inspired by [51], we choose a standard pre-trained ResNeXt-50 [62] as our convolutional backbone.
To fine-tune the per-frame predicted joint angles and scaling parameters we add a sequential network. Sequential networks are used in pose estimation to 'lift' an estimated 2D pose to 3D [11,12]. Recent research combines temporal information with lifting to improve accuracy during frames where one limb occludes another in the view of the camera (self-occlusion) or where not all key points were detected [31,33,49]. In contrast to CNNs, these sequential networks do not take a single frame as input but exploit temporal dependencies in the data for their prediction. As the convolutional network outputs per-frame estimates, it cannot take temporal information into account. We add a sequential network to our architecture to refine a sequence of estimations made by the convolutional model. Inspired by works on temporal lifting we experimentally evaluate three sequential networks; an LSTM [23], a Temporal Convolutional Network (TCN) [49] and a Transformer [31] to refine the predicted joint angles and scale factors.
Both the convolutional and the sequential network contain a specialized layer that allows each network to perform the kinematic transformations of a musculoskeletal model. Therefore, at train time both networks can be supervised not only on the estimated joint angles but also on a resulting pose.
Both convolutional and sequential networks are supervised by losses of joint positions, marker positions, body scales and joint angles. The overall objective function L can be expressed in the equation where λ 1 , λ 2 , λ 3 , λ 4 are weights of losses. We use the root-relative L1 loss in Equation (2) to define the loss of marker position L marker and the loss of joint position L joint . The estimationsŷ and the labels y are first subtracted with each root positionŷ root , y root . For the loss of body scales L body and joint angles L angle , we calculate the L1 norm.
The objective of a neural network during training is to minimize the loss function; in our case, the difference between estimated and ground truth joint angles. However, this joint angle loss cannot capture the underlying relations and constraints of individual angles, dictated by the human musculoskeletal system. Intuitively, small changes in the angles of spine, shoulder, and elbow can accumulate and lead to large differences in the position of the hand, as illustrated in Figure 3. To address this issue, we propose to use a skeletal-model layer to perform the kinematic transform of a musculoskeletal model. . Our skeletal-model layer uses an internal representation of a skeletal model to convert the predicted joint angles and scale factors to the positions of individual markers on segments of the skeletal model. This allows our method to be supervised during training not only on errors(losses) in the estimation of joint angles but also on errors in the resulting pose. On the right, we show the additional error that is created between estimations (gray) and ground truth (blue). This auxiliary estimation of the pose as 3D marker positions helps to constrain the estimation of joint angles as small changes in proximal joints can have a large effect on a marker at more distal positions.
Skeletal-Model Layer. The skeletal-model layer allows us to convert predicted joint angles into marker positions on a skeletal model and add them as an additional loss term. This loss term represents the cumulative effect of small joint angle changes on the final pose, indirectly imposing the constraints of a skeletal-model on the predictions of the network. As the skeletal-model layer does not contain any learnable parameters, i.e., it cannot change during network training. The accuracy of the predicted pose is completely determined by the input to the skeletal-model layer; thus, the pose prediction is only an auxiliary task.
A skeletal model consists of body segments, motions between different body segments (joints) and points with a vector from a center of its anchor body segment (markers). Given body scales β, joint angles θ and rotation matrix R ground←pelvis , we use the skeletal-model layer to calculate marker positions and joint positions. In the following variables with a hat (x) denote estimated values, variables without the hat (x) denote the predefined variable from the musculoskeletal model. First, the translation part T in the transformation from the joint to the body depends on the subject's body scale. For example if the subject has longer legs, the center of the femur will be farther from the hip joint. We can update the translation partT by comparing the ratio between predicted body scalesβ and default body scales β in Equation (3), where is elementwise multiplication, and is elementwise division.
Then, we create a matrix to represent spatial transformation of motions R m otion using Equation (4) A 1 , A 2 , A 3 are the predefined axesθ 1 ,θ 2 ,θ 3 and predicted angels per degree of freedom per joint in axis-angle notation. G(A, θ) is the standard function converting an axis-angle representation to a 3x3 transformation matrix.
Then, we can calculate the estimated transformation from the body to its parent bodyR parent←child in Equation (9) using Equation (8) with O parent , O child denoting predefined orientations from andT parent ,T child the predicted translations from the joint to the parent/child. F (O a ) the conversion from euler angles to a 3 × 3 Rotation matrix.
We measure the spatial transform by traversing from the root (pelvis) to leaf nodes (hands and feet) in the level order. In D3KE, we directly infer the rotation matrix from the pelvis to the ground R ground←pelvis . R ground←pelvis can initially be expressed in Equation (10), where I denotes the identity matrix. The rotation part of R child←joint is also a 3 × 3 identity matrix in our musculoskeletal model. Our method aims to predict the root-relative position so the translation part can be ignored during the prediction. Moreover, in our musculoskeletal model, only joint angles of the pelvis are unbounded in [−∞,∞]. Predicting three unbounded angles to form the rotation matrix in Equation (4) will have the problem of discontinuity [65]. Thus, we directly predict the rotation matrix R ground←pelvis .
Last, a marker with a vector of d from the center of the body is also dependent on the body scales. The predicted vector ofd is updated in Equation (11). The position of the predicted point is calculated in Equation (12) withR parent←child andd.
2.2.1. Ground Truth Generation. For training our method, we need to create custom ground truth data that contains all outcomes that our network is predicting since they are not available in publicly available datasets. Most pose estimation datasets, provide only video and marker positions from optical motion capture (OMC) system. For training our method, we need the joint angle and the scales of individual bones, a rotation matrix of the pelvis to the ground as well as the marker positions corresponding to them. To generate these, we model the OMC data, represented as a 3D human mesh model in the OpenSim software [54] and use inverse kinematics [36,1] to generate joint angles. The following describes each step in more detail. First, we create a general (musculo)skeletal model to fit the data using the OpenSim software [54,16]. As we are interested in capturing the complete motion of the human subject, we model the full body. With the OpenSim software [16,54] we create a full-body musculoskeletal model (MSM) by merging existing models of upper limbs and lower limbs [3,4,15,24,63] and thoracolumbar spine [8,7,9]. We add wrist and hand [20,44] models to the MSM, which are not used for ground truth generation, for the sake of aesthetics. The full-body model contains all bones in a skeletal system from the head to feet and from the upper arms to the hands. We do not model every degree of freedom between vertebrae to avoid expensive computation and the requirement of at least three markers to measure the motions of one vertebra. Instead, we separate the spine from the fifth lumbar to the first cervical vertebra into nine segments.
Then, we fit our data to the musculoskeletal model. Instead of using the OMC marker data directly, we use OMC marker converted to 3D human mesh representations using the MoSh++ [34] method, to make scaling the model to individual participants more time efficient and allow us to define an arbitrary number of virtual markers. We fit our data to the musculoskeletal model, by first defining virtual markers on the vertices of the 3D mesh representation. We then used these virtual markers as input for the OpenSim software. Then, we used the OpenSim internal scaling tool to scale the proportion of individual body segments according to the distances of virtual markers on the 3D mesh. As the sizes of individual body parts vary across individuals, this step must be conducted individually for each subject in the dataset. We define the ratio in dimensions between the default and scaled body segments as scaling factors. Finally, we used the inverse kinematics solver for the calculation of joint angles. During this process, the MSM is moved for each time step to a position that minimizes the sum of weighted squared errors between the virtual markers on the 3D mesh and markers defined on the musculoskeletal model. All joint angles where segments had a higher squared error than 2 cm were disregarded in the analysis.
The final ground truth values were the calculated joint angles, the scaling factors per segment as well as the virtual marker positions. Additionally, a pelvis rotation matrix was generated for each frame, since the pelvis functions as the relative position of the model to the ground that is free to move in all directions.

Data Preparation and Hyperparameters.
To generate the input for our network, each video frame was cropped and augmented. We use the pre-trained Faster R-CNN [50] with ResNet-50 [22] backbone to extract a square bounding box of the person in videos and resize it to 256 × 256 pixels as the input image size. During training, we apply data augmentation with scaling, rotation, translation and noise to simulate occlusions similar to [51].
Our model was trained using the following hyperparameters and loss. For the ResNeXt model, we use an Adam optimizer with weight decay [35] of 0.001 and a batch size of 64. The learning rate exponentially decays in two steps from 5 × 10 −4 to 3.33 × 10 −5 over 28 epochs and from 3.33 × 10 −6 to 10 −6 over 2 epochs. For both sequential and convolutional networks, we set the hyperparameters with λ 1 = 1.0, λ 2 = 2.0, λ 3 = 0.1 and λ 4 = 0.06 experimentally. Due to memory constraints, we do not train convolutional and sequential models simultaneously, but in succession, by first training the convolutional model and then refining predictions using the sequential model.
2.3. Software Tools. All training was conducted in python using the PyTorch library [47]. The pre-trained ResNext and FasterRCNN networks were obtained from the torchvision library [59]. All code for training and generation of ground truth will be made available in a Github repository: https://github.com/bittnerma/ Direct3DKinematicEstimation. .

2.4.
Data. We trained and tested D3KE on the BML-MoVi Database [19]. BML-Movi is an extensive motion capture and video dataset, it contains recordings of 90 actors that each perform 20 kinds of everyday movements as well as a random one. Motions were captured using inertial measurement units as well as a Qualisys optical motion capture system and videos were recorded using two computer-vision cameras. For this study, we used recordings from the calibrated Point Gray cameras (PG1, PG2) during recording session F as the full set of optical markers was used during this session. In accordance with the anatomical plane that each camera is viewing during the initial T-Pose of the participants, we will refer to the camera view captured by PG1 as the frontal-and PG2 as the sagittal camera view. For the generation of ground truth virtual markers, we use the 3D mesh representations of the Qualisys data that is provided in the larger AMASS dataset [37]. For analysis of the data, we divided the BML-Movi database into 63 participants for training, 16 participants for the testing, and three participants for validation. This is common practice in the supervised training of deep neural networks [64]. At training time, the validation set is used to evaluate the accuracy of kinematic estimation after each training iteration on a portion of the data the network does not have access to, to prevent overfitting on the training set.

Experiments
3.1. Experiment 1: Direct vs. Multi-Step Estimation. To evaluate the accuracy of our direct 3D kinematic estimation approach (D3KE) for joint angle estimation, we compare its performance against multiple versions of the multistep approach.
3.1.1. Experiment 1-A: 3D Pose Based Kinematic Estimation. We first compare our direct estimation of kinematics and a 3D pose estimation multi-step baseline. To create a fair comparison between direct and multi-step estimations, we implement a custom multi-step approach (CMS) that is trained on the same data as our direct approach. For the CMS, we combine a 3D human pose estimation method with subsequent musculoskeletal modeling in OpenSim. We modify the metric-scale heatmaps [51] of the convolutional network to predict marker positions and SMPL keypoint positions in the metric scale. As for D3KE, we exploit a sequence network to re-fine marker positions at the target frame. More specifically, the convolutional network initially infers marker positions under a calibration pose (T-pose), and OpenSim utilizes the predicted marker data for body scaling, where the general musculoskeletal model is scaled to the participant's body size. Re-fined marker positions are then used to run inverse kinematics with the scaled musculoskeletal model to obtain joint angles. The main difference between the CMS approach and D3KE is that CMS uses multiple steps to estimate the kinematics and is only supervised on the marker/pose estimation task, while D3KE is directly trained on the kinematic estimation task; this way, we can compare direct vs. multi-step estimation of kinematics. We use multiple metrics for the comparison of D3KE and the CMS. The mean per bony landmarks position error (MPBLPE) is used to evaluate bony landmark positions. Bony landmarks are markers placed where bones are close to the surface, such as the elbow. This metric is inspired by the mean per joint position error (MPJPE) which is often used in 3D pose estimation. MPBLPE first aligns estimations and ground truth at the root position and calculates the average Euclidean distance. We directly evaluate the body scale factors by the root mean square error RMSE body on the scalars predicted by the network. However, to present the results in a more intuitive format, we choose the axis along the longest dimension in each body scale and convert the scale of the axis into millimeters and calculate the mean absolute error (MAE body ).

Experiment 1-B: 2D-Pose Based Kinematic Estimation.
In the previous experiment, we evaluate the multi-step baseline with the 3D body pose estimation method. However, the use of fully trained 2D pose estimation algorithms is common in kinematic estimation works [40,45,46]. Therefore, we conduct experiments to compare our method and these 2D-based kinematic estimation methods. In contrast to our CMS method which estimates 3D pose from a single camera estimation, 2D pose estimation methods require at least 2 calibrated cameras for the estimation of 3D keypoints. The use of an additional camera to generate the pose could be an advantage, which the CMS method does not have. We chose a naive implementation of the OpenPose algorithm [10,25], which has extensively been used in related work [40,45,46]. Additionally, we test the MediaPipe implementation of the blazepose algorithm [6], as a more modern 2D algorithm. MediaPipe is easy to use since it is available as a python library, however, in contrast to OpenPose it runs faster, allows for additional smoothing of its predictions, provides more key points, and is labeled on different keypoint labels.
For the OpenPose and MediaPipe, we project the key points to 3D using the BML-Movi camera parameters https://github.com/saeed1262/MoVi-Toolbox (accessed on 2 August 2022 ). For OpenPose, we connected missing points (due to self-occlusion) using linear interpolation. For MediaPipe, we chose the highest model complexity (2) and set enabled smooth landmarks for continuous frames of a video. For modeling and inverse kinematics, we follow the same steps as for the CMS method only redefining the positions of markers on the OpenSim model to fit the provided key points.
We compare against an average across both camera views for CMS and D3KE, as MediaPipe and OpenPose need at least two cameras to work. We evaluate performance based on mean absolute error M AE angle ( • ), the standard deviation of errors SD angle ( • ) and smoothness of the predictions as the mean velocity of the angle MV angle ( • /s). The mean velocity error is calculated by the derivative of the landmark position and joint angle data with respect to time. 3.2. Experiment 2: Sequential Network Variants. Since we have multiple options for the sequential networks, we evaluate three to determine whether the additional modeling of temporal dependencies in the data improves the accuracy of our method or not. For subsequent smoothing and reduction of self-occlusion artifacts of the estimations, we test three different networks including LSTM [23], temporal convolutional networks (TCNs) [49], and a lifting Transformer [31] as the sequential network. As smoothing is known to improve the accuracy of multistep approaches [40], we also evaluate combinations of our CMS model with these sequential networks.
For the LSTM, we implement a bidirectional architecture with a hidden size of 128, three recurrent layers and a dropout probability of 0.1. For TCNs, we follow [49] to exploit 243 frames as the receptive field and make the momentum of batch normalization decay from 0.1 to 0.001. For the lifting Transformer, we use a hidden size of 256 and 8 parallel attention heads in the self-attention layer and a channel size of 512 in the convolutional layer. Each sequential network is trained with a sequence length of 243 frames and a batch size of 128 over 50 epochs with Adam optimizer [28]. The learning rate exponentially decays from 10 −3 to 5 × 10 −6 .
We use the same metrics for the comparison of individual network variants as we used for the comparison of D3KE and CMS. In addition, we investigate the smoothness of the predicted sequences, we estimate the mean velocity (MV) on bony landmark positions and joint angles, denoted as MV BL and MV angle .
3.3. Experiment 3: Processing Speed. One important property of our proposed method for clinical applications is its processing speed. As applications for camera-based kinematic estimation should form an alternative to visual examinations in the future, it should ideally be able to run fast enough to estimate kinematics from video frames as fast as they are collected by a camera, mostly between 15 and 30 frames per second.
We compare the running time on Windows 10 with four core CPU, 52 GB RAM and NVIDIA T4 GPU. We compare the CMS and D3KE method as they both use the same type of convolutional network. We choose the lifting Transformer as the sequential architecture in both the CMS and D3KE. For the CMS method, OpenSim is executed in parallel with four cores. Our report results in frames per second.
3.4. Experiment 4: Generalization Performance. The goal of camera-based kinematic estimation is ultimately to create tools for researchers and clinicians to analyze and diagnose human movement, these tools should not discriminate between different subjects and movements. We analyze whether our method generalizes to different subjects, movements, and joints.
To assess how well D3KE generalizes, we compare the estimates of the proposed method to the ground truth on the time series of each of the 16 participants in the test set with respect to the performed movement, the joint, the camera view and the individual participant. For this test, we use the best performing model from experiment 1 with the lifting transformer.
For all time series of joint angles, mean absolute error (MAE) and Pearson's correlation coefficient (ρ) were calculated between the estimation from D3KE and the ground truth.
Central tendencies in the data are reported as a median and interquartile range of MAE, RMSE and ρ, as the data are not normally distributed, as assessed through visual inspection and confirmed by the Shapiro-Wilk test. For completion, mean and standard deviation are also reported. The absolute values of ρ were categorized as weak, moderate, strong and excellent for ρ ≤ 0.35, 0.35 < ρ ≤ 0.67, 0.67 < ρ ≤ 0.90 and 0.90 < ρ, respectively [56].
3.5. Software and Tools. All data analysis was conducted in python 3 [58] using the pandas library to generate descriptive statistics, SciPy library for the calculation of MAE, RMSE and Pingouin library for the calculation of ρ. Table 1, D3KE has better performance than CMS methods in terms of joint angles and body scales, and these two factors are the key to kinematic estimation. Significantly, D3KE reduces 37.4% of errors on MAE angle when comparing the CMS method with the Transformer architecture. Although the proposed method has a slightly larger MP-BLPE than the CMS, this metric is not related to the kinematic estimation and is only used as one of the losses during training in the proposed method (e.g., MP-BLBE is not minimized). This indicates a gain in accuracy when directly estimating kinematics from the video instead of the multi-step approach of first estimating pose and then estimating kinematics. Table 1. Comparison of bony landmarks position (MPBLPE), body scales (MAEbody) and joint angles (MAEangle) between the estimation of and the ground truth across all participants, movements, joints and camera views. For the custom multi-step approach (CMS) as well as our proposed method, we compare convolutional networks with different temporal networks. All versions of the proposed method show superior performance for the prediction of body scales and joint angle estimation. All CMSs show superior performance in estimating marker positions. Each method group shows better performance for the task it was optimized for, highlighting the importance of direct optimization. Bold numbers indicate the best performance. In Table 2, we list RMSE and MAE for body scales of selected segments. The results show that the CMS performs better than D3KE on lower limbs, and D3KE performs better than the CMS on upper limbs. The CMS and the proposed method have comparable performance in scale estimation of the pelvis and lower limbs.  Table 3. We find that algorithms trained on noisy labels, that use fewer key points perform worse than ours. We see a clear difference between the unsmoothed OpenPose estimations and the smoothed MediaPipe estimations in the mean velocity of the estimations. Our proposed method D3KE still performs better showing that even in an ideal scenario (CMS, i.e., no noise in the labels, enough markers, same distribution training data) direct estimation is preferable. The results of our investigation to reduce the noise in the estimations using temporal smoothing are shown in Table 4. The result shows that all temporal models can improve the smoothness of the sequence. The LSTM achieves the best performance on MV BL and MV angle among all temporal models, this is contrary to the results in Table 1, in which using a Transformer as the sequential model yielded the best results.  Table 5. Since the skeletal-model layer must traverse body segments in the level order, our proposed method is slower than the CMS for a batch size of 1. However, the support of mini-batch computation in the skeletalmodel layer allows D3KE to run faster than the CMS. Showing that our method can reach video framerate speeds on a competent GPU.  Figure 4 shows that both CMS and D3KE have relatively little variation across different movements and different participants, yet larger variations across individual joints. This is also reflected in median MAEs and ρ per joint (Table 6), with median MAEs for joints varying within a range 3.8°f or joints, while movements and participants vary under 1°. It shows that D3KE generalizes well to different participants and movements.  We visualize the estimation of musculoskeletal models from our proposed method with the Transformer architecture in Figure 5. We also show the comparison between estimation and ground truth of the left knee angle as an example of joint angle estimation quality. We took the average body scales of the predicted sequence of scaling factors to scale the model and visualize it in OpenSim using the predicted joint angles as inputs. From the figure, we can see that the proposed method can achieve results that are in agreement with the single-view input video.

Discussion
In summary, we compared a direct approach of estimating joint angles from video images to the more traditional multi-step approach found in most recent works. The traditional method first estimates key points from a video of a subject, then calculates joint angles using a (musculo)skeletal model through an inversekinematics process. We developed a method consisting of a convolutional neural network and a sequential network both including a specialized layer that performs kinematic transforms of a (musculo)skeletal model and allows for direct optimization of the predicted joint angles (D3KE) and treats the prediction of key points only as an auxiliary task. We compared our direct estimation approach against naive implementations of often used algorithms in the related literature, as well as a self-implemented custom multi-step approach (CMS) that is trained on the same data as our direct approach. We show that direct estimation of kinematics yields higher accuracy in predicted joint angles compared to the traditional multi-step approach. Our results indicate that direct estimation can help the future development of algorithms for fast and accessible kinematic analysis for researchers and clinicians.

Direct vs. Multi-Step Estimation.
5.1.1. 3D-Pose Kinematic Estimation. To compare direct estimation vs. multi-step estimation, we compared our D3KE method against a 3D-pose based multi-step approach (CMS) with comparable network architecture and trained on the same training data. Compared to the CMS, our proposed method improves the accuracy of joint angle estimation. For all model combinations, we can see an improvement of about 35% in accuracy for the estimation of joint angles. Our results support the feasibility of our proposed method. It delivers improvements due to directly optimizing the predicted joint angles and scaling factors while using the pose estimates only as an auxiliary task. The auxiliary task effectively imposes a constraint on the network estimation. We show that direct optimization is preferable to the multi-step approach when using videos from a single-camera view. We expect that using additional specialized layers, a network might be able to directly optimize for individual muscle forces with comparable accuracy from a monocular video. 5.1.2. 2D-Pose Kinematic Estimation. We compared our direct approach (D3KE) and the self-trained multi-step approach (CMS) against two multi-step approaches commonly used in the literature. Compared to the more traditional implementations of the OpenPose and MediaPipe algorithms, both our proposed and our CMS method show superior performance. From Tables 4 and 6, we see that estimations from D3KE are far smoother compared to the traditional methods. This is likely due to multiple reasons. The predicted key points of OpenPose suffer from systematic errors due to inaccuracies in their training data [13], which can explain the drop in performance, for MediaPipe the accuracy of labels in their training data is not known as their paper only states that their annotators were human [6], not whether they had expertise in labeling anatomical key points. The lack of smoothing for the predicted OpenPose key points can also contribute to its overall worse performance. MediaPipe, which uses internal smoothing, shows better results in comparison. In general, we expected worse performance from OpenPose and MediaPipe as they only predict 18 or 33 key points respectively, while we supervise a total of 77. Both traditional methods predict keypoints representing joint centers and not markers on body segments, this makes it hard to distinguish rotations between different body segments during the musculoskeletal modeling step. While OpenPose shows by far the noisiest estimation, the smoothing of the MediaPipe estimation is clear, our proposed method and implemented CMS work best, probably due to the restriction of additional markers.

Network Variants.
We compared three commonly used sequential networks to improve our estimation. We show that all sequential networks improve our estimation accuracy. One possible explanation for this is that the network is better able to handle 'self-occlusion' artifacts. The estimation of a 3D-pose from a single camera is an ill-posed problem as a 3D point projected to a 2D image can originate from any position along the ray(s) that fall on the image sensor and form the corresponding pixel. This can lead to self-occlusion, such as the torso and left arm occluding the right arm during a right arm swing in frames recorded from the left sagittal view. During self-occlusion, it is difficult for frame-based networks to make a good estimate as they lack temporal information of previous angles of the arm to extrapolate from. Sequential networks on the other hand have access to temporal information, which can allow for more accurate estimations. Although we only see a slight increase 0.001°in MAE of the joint angle estimation, we can see a clear improvement of the sequential models in the smoothness of the predicted angles Table 4. This might be due to the network learning to interpolate motions during occurrences of self-occlusion.

Processing Speed.
Compared to the multi-step baseline, CMS, our D3KE approach shows increased calculation speeds for larger batch sizes. Both CMS and D3KE make use of the same ResNeXt50 architecture, which should show approximately the same performance increase with increasing batch sizes for both methods. D3KE could be expected to be slower, as it also has the additional time cost of calculating the pose from the estimated kinematics in the skeletal model layer. However, due to its multi-step nature, CMS has to perform an additional inverse kinematics calculation. This calculation seems to form a bottleneck in the processing speed of the CMS approach restricting it to a framerate of 8 fps. Other multi-step algorithms will most likely encounter the same problem. In the case of OpenPose, which runs at about 4 fps [6], even lower frame rates can be expected for a complete pipeline. This shows the advantage in the processing time of our direct approach.
For a method to be usable in everyday life, it should be reasonably fast in running. Processing speeds allowing a method to run between 15 and 30 frames per second are favorable, as they show that a method can process a video as fast as its frames are collected. However, our results might not directly translate to every real-world scenario. To process multiple images simultaneously as batches, D3KE currently requires GPUs that are not available in mobile devices, which prevents it from beingportable. In addition, we use the Faster R-CNN object detection network to crop our images. This step was not included in the processing speed evaluation, as it is highly dependent on the chosen object-detection algorithm. However, with inference speeds of 12fps, the Faster R-CNN object detection would form a bottleneck in applying our method in real-time applications. Given the speed of development in the field of object detection, Faster R-CNN can by now be regarded as an old algorithm, and newer and faster object detectors should be used instead. The YoloV7 algorithm [61], which performs object detection at up to 286 fps could be considered. In general, the current architecture is not optimized for speed or a specific technology and we are using off-the-shelf, fairly standard convolutional and sequential architectures. For these architectures, smaller and faster alternatives might be found in the future. When optimizing for all these points, we predict the proposed method could run on mobile devices within a few years, effectively enabling a 3D kinematic analysis instrument to become available for everyone with a mobile phone or tablet. 5.4. Generalization Performance. Our method generalizes well on the tested data. As shown in Figure 4 and Table 6, the estimation variations across participants and movements are small. We can conclude that our method shows the ability to generalize to different camera views, participants, and performed movements, within the tested dataset. Our results indicate that D3KE could be generally applied to a variety of people and movements, including clinical and sports applications, e.g., physiotherapists and athletes, when trained on sufficient additional data.
Although we show good generalization performance on the BMLMovi database, it is difficult to estimate how well our method will generalize in a real-world scenario. In machine learning settings, training data is often not representative of the task of the network in the real world [5] and can introduce biases if applied to scenarios that are very different from the one represented in the training data. Unfortunately, there is currently a lack of deep learning datasets for kinematic analysis [52,40,13,60]. In addition, while the BML-Movi database is excellent for training neural networks due to the large number of participants performing movements and the diversity of execution styles, it might be not extensive enough to train a network for biomedical applications in the real world. However, to evaluate the current method fully, such an extensive dataset would be necessary. In general, we expect a drop in accuracy when our method is applied to a scenario different from the BMLMovi database. As we train on just two calibrated cameras, we expect our method to be most vulnerable to alternative camera positions, that do not show people in either frontal or sagittal view. Future research should investigate the stability of direct estimation methods when applied to data that differs significantly from the training data. 5.5. Future Work. To improve the accuracy of the algorithm and provide further insight into the strengths and weaknesses of monocular joint angle estimation, a new dataset with dedicated annotation is needed. A dataset specifically designed for the estimation of joint angles and/or kinetics could improve the accuracy of the algorithm. This dataset could be established with a large number of camera views, and top-down views for better estimation of movements in the transverse plane, where participants perform movements that exercise the full ROMs of individual joints including upper extremities, as well as movements that are relevant for health care professionals such as physiotherapy exercises and other clinical tests. In addition, the inclusion of abnormal movement patterns could give better insights into the clinical relevance of newly developed methods.
Transfer learning could be explored to apply 3DKE in settings where little training data are available. Vdeos that are very different from the BMLMovi training data, such as people wearing more clothes, are in different surroundings, or are filmed from a different camera view, will, most probably, yield worse accuracy than shown in this paper. Transfer learning of a pre-trained D3KE on a minimal portion of a dataset could be investigated as an alternative to the time-consuming collection of a novel dataset.
The capabilities of D3KE as an adapter for kinetic analysis of a movement in OpenSim could be explored. Given data similar to BMLMovi or successful transfer learning on relevant data beforehand, our method provides an easy way to skip the tedious steps of scaling and running inverse kinematics on an MSM. This enables the quick generation of MSMs for kinetic analysis from just a single video. Even if this kinematic estimation comes at the cost of reduced accuracy, it could provide coarse insights into collected data, which can later be confirmed through finer analysis with the manually scaled MSMs.
D3KE could be made more generally applicable if the underlying model of the Skeletal-model layer would not be fixed. Currently, the underlying model is fixed in the Skeletal-model layer. Future iterations could explore combinations of the Pytorch and OpenSim python libraries to allow training a network on a self-defined model or allowing a pre-trained model to be refined through transfer learning for, e.g., only estimation of joint angles around the shoulder.
Existing Explainable AI tools should be applied to better understand the inner workings of D3KE. Deep neural networks are capable of high accuracy estimation, because of their ability to break down highly complex tasks into simpler tasks [2], but understanding what these simpler tasks are is non-trivial. Research in Explainable AI has generated tools and frameworks that allow one to better understand the basis of the final predictions of a network [53]. Applying these tools could help users and researchers alike to better understand the biases and limitations of our method. D3KE can still predict the joint angles even if these joints are occluded; this means it must make assumptions. What these assumptions are and how they came to be are important to estimate the trustworthiness of this algorithm in a real-world scenario.

Conclusions
In this paper, we present a novel end-to-end neural network for the estimation of segment joint angles of the human body. Compared to the previous method, we directly regress to the joint angle and scale for individual segments from the input video. We trained our method from scratch on the BML-Movie database and compared it against a 3D pose estimation method on which we used the inverse kinematics tool of OpenSim to obtain the kinematics.
We conclude that using direct estimation of joint angles is preferable in a single camera setting, as it is more accurate compared to the common approach of fitting an estimated pose to a musculoskeletal model and performing inverse kinematics. By allowing the network to directly optimize for the joint angles and scaling factors, our method is less prone to errors in the key point labels used to predict key point location for pose estimation. In addition, the use of a sequential model is important when designing a neural network architecture for kinematic estimation, as it allows to smooth predictions over time to create better estimates of limb position and joint angles during self-occlusion.
While using deep learning for biomedical solutions is still in its infancy, the presented method shows that training networks from scratch for specialized tasks is a viable way to estimate joint angles from a single camera video. With further advancements in the underlying algorithms as well computational performance, we predict that the methodology we have presented will assist biomedical and clinic practitioners to measure and monitor human movement in the near future.