Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy

Kim, Yeonggwang; Ku, Giwon; Yang, Chulseung; Lee, Jeonggi; Kim, Jinsul

doi:10.3390/electronics12204273

Open AccessArticle

Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy

by

Yeonggwang Kim

¹,

Giwon Ku

¹

,

Chulseung Yang

¹,

Jeonggi Lee

¹ and

Jinsul Kim

^2,*

¹

Korea Electronics Technology Institute, Gwangju 61011, Republic of Korea

²

Department of ICT Convergence System Engineering, Chonnam National University, 77, Yongbong-ro, Buk-gu, Gwangju 500-757, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(20), 4273; https://doi.org/10.3390/electronics12204273

Submission received: 30 July 2023 / Revised: 1 October 2023 / Accepted: 11 October 2023 / Published: 16 October 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this study, we proposed a novel transformer-based model with independent tokens for estimating three-dimensional (3D) human pose and shape from monocular videos, specifically focusing on its application in rehabilitation therapy. The main objective is to recover pixel-aligned rehabilitation-customized 3D human poses and body shapes directly from monocular images or videos, which is a challenging task owing to inherent ambiguity. Existing human pose estimation methods heavily rely on the initialized mean pose and shape as prior estimates and employ parameter regression with iterative error feedback. However, video-based approaches face difficulties capturing joint-level rotational motion and ensuring local temporal consistency despite enhancing single-frame features by modeling the overall changes in the image-level features. To address these limitations, we introduce two types of characterization tokens specifically designed for rehabilitation therapy: joint rotation and camera tokens. These tokens progressively interact with the image features through the transformer layers and encode prior knowledge of human 3D joint rotations (i.e., position information derived from large-scale data). By updating these tokens, we can estimate the SMPL parameters for a given image. Furthermore, we incorporate a temporal model that effectively captures the rotational temporal information of each joint, thereby reducing jitters in local parts. The performance of our method is comparable with those of the current best-performing models. In addition, we present the structural differences among the models to create a pose classification model for rehabilitation. We leveraged ResNet-50 and transformer architectures to achieve a remarkable PA-MPJPE of 49.0 mm for the 3DPW dataset.

Keywords:

human pose estimation; rehabilitation therapy; SMPL; characterization token

1. Introduction

Capturing human body pose movements is necessary for various fields, such as rehabilitation exercise analysis, human–computer interactions, film production, digital avatar [1,2,3,4,5,6], animation [7,8,9,10,11,12], VR/AR [13,14,15,16,17,18], etc. To date, automatic methods that reconstruct a human body in three dimensions (3D) have been explored to capture human pose movements. Typically, research on classifying human body postures involves estimating a human’s 3D pose and shape from one or more color images [19,20,21,22,23,24,25,26]. The pose estimation methods demonstrated to date have shown impressive results in estimating a 3D human that fits well with the image features extracted from camera views. In the early stages of research, a method based on deep neural networks (DNNs) was proposed for human pose estimation [27]. In the DeepPose study, pose estimation was defined as a DNN-based regression problem targeting the prediction of body joint locations, and a cascade of multiple DNN regressors was introduced for this purpose. Since then, research has also been conducted to overcome the difficulty of estimating poses in videos, and the DCPose study conducted the research with the following three points [28]. The researchers in this work used the pose temporary merger (PTM) module to improve the range of pose prediction using initial keypoint prediction and considering the pose spatial context, using the pose residual fusion (PRF) module, we used a keypoint heat map, calculated pose residuals, and used them as additional temporal clues, using the pose correction network (PCN) module, we used it to improve the initial keypoint prediction. We adjusted the final pose estimation using the pose residual characteristics.

However, estimating pose from objects viewed from different angles or placed in a 3D scene is a considerable challenge owing to low estimation accuracies, which can be addressed using marker-based motion capture systems. The existing marker-based motion capture systems [29,30,31,32,33,34] provide accurate movement information. However, the associated fitting process is time- and power-intensive; thus, such systems have restricted applicability. As a result, non-marker-based motion capture [35,36,37,38,39,40], which is based on high-accuracy RGB images and video processing techniques, has gained attention, with extensive research being conducted in the fields of deep learning and computer vision.

In particular, monocular 3D human pose and shape estimation techniques, which utilize the skinned multi-person linear model (SMPL) [41] and various datasets with 3D annotations, have been significantly advanced. Representative methods include video inference for human body pose and shape estimation (VIBE) [42], temporally consistent mesh recovery (TCMR) [43], and multi-level attention encoder–decoder network (MAED) [44] models, which are structured to estimate the SMPL parameters using the approach. These methods use an “iterative error feedback technique” and are based on “initialized mean pose and shape”. The drawbacks of these approaches lie in their excessive use of image features to detect changes in overall human motion within video footage. Additionally, they have limitations in reflecting the “rotation motion” of each joint and temporally expanding the “single-frame features”. Figure 1 shows a common characteristic of the existing algorithms, i.e., temporal use of single-frame features.

However, using only single-frame features in these algorithms is disadvantageous. This method cannot compensate for the rotational motion characteristics of joints or enable consistency in joint movements. To address these shortcomings, understanding 3D human reconstruction for rehabilitation poses differentiation from a causal perspective is essential. The primary causes of changes in image pixel and human body appearance are not background variations but 3D joint rotations, reflected in the human skeletal dynamics and observer viewpoints. The 3D relative rotations of each joint and human body shape are independent of the specific images and observer viewpoints. In other words, joint rotations are not visible; they are independent concepts in terms of images and viewpoints.

Therefore, researchers in this field have proposed a new 3D human pose and shape estimation model based on three independent tokens: joint information, camera information, and shape information, which consider these factors. This approach excels in the field of pose estimation, demonstrating a high accuracy. However, our study is focused on rehabilitation therapy, wherein precise posture estimation is crucial. In the realm of rehabilitation therapy, the classification of user postures is accompanied by a significant emphasis on joint orientation and angles. Accordingly, we utilized two types of independent tokens for rehabilitation therapy that encode 3D joint rotation and camera information. These initialized tokens learn prior knowledge and interdependencies from large-scale training data, thus eliminating the need for iterative regression or biomechanical topology decoders. These tokens utilize the transformer architecture to interact with 2D image evidence and update themselves, ultimately providing posterior estimates of pose, shape, and camera parameters. Our model abstracts joint-rotation tokens from image pixels to represent the motion state of each joint and establish temporal correlations for capturing the temporal rotation movements of each joint independently. As a result, it ensures overall temporal consistency and coherence.

2. Literature Review: Related Work

2.1. VIBE

As shown in Table 1, VIBE extracts frame features from a given video and learns a temporal encoder using bidirectional gated recurrent units (GRUs) to output latent variables that incorporate past and future frame information. VIBE utilizes the extracted latent variable features to estimate SMPL body-model parameters at each time step. SMPL represents the pose and shape of a human body using

Θ

, which consists of pose parameters and shape parameters (β). Pose parameters include global body rotation and axes and relative rotations of 23 joints in angle forms. Shape parameters are the first 10 coefficients in the PCA [45] shape space. VIBE accepts a video sequence as input and computes the

Θ

values.

In this process, VIBE measures the pose parameters at each frame and a single body-shape value for the entire input sequence through average pooling; this model is called the “temporal generator G”. Then, the output

Θ

values and a sample

Θ

_real from AMASS [46] are fed to the motion discriminator (

D_{M}

) to distinguish between fake and real examples. GRU is designed to benefit from past video pose information for improved prediction of future frames, particularly in cases where the pose is ambiguous or the body is partially occluded, and past information can help resolve and constrain pose estimation. The collective loss of the suggested temporal encoder consists of 2D(x), 3D(x), pose (θ), and shape (β) components (assuming they are accessible), and this collective loss is further amalgamated with an opposing

D_{M}

loss. Specifically, the aggregate loss of G is evaluated as follows:

L_{G} = L_{3 D} + L_{2 D} + L_{S M P L} + L_{a d v}

(1)

Here, each element is computed as:

L_{3 D} = \sum_{t = 1}^{T} | X_{t} - \hat{X_{t}} |_{2} L_{2 D} = \sum_{t = 1}^{T} | x_{t} - \hat{x_{t}} |_{2} L_{S M P L} = | β - \hat{β} |_{2} + \sum_{t = 1}^{T} | θ_{t} - \hat{θ_{t}} |_{2}

(2)

VIBE describes the motion discriminator, which determines whether the generated pose sequence is a realistic sequence or not. Subsequently, self-attention is used to integrate these latent codes, and finally, the linear layer predicts the probability of

Θ

to represent human movements hypothetically. Adversarial loss items are reversed by the generator and used for learning. The adversarial loss component applied during backpropagation to G is expressed as:

L_{a d v} = E_{Θ \sim p_{G}} [{(D_{M} (\hat{Θ}) - 1)}^{2}]

(3)

Moreover, the goal of

D_{M}

is defined as:

L_{D_{M}} = E_{Θ \sim p_{R}} [{(D_{M} (Θ) - 1)}^{2}] + E_{Θ \sim p_{G}} [D_{M} {(\hat{Θ})}^{2}]

(4)

Here,

p_{R}

denotes an actual motion sequence sourced from the AMASS dataset, and

p_{G}

refers to a synthesized motion sequence. With training on authentic poses,

D_{M}

becomes adept at understanding credible body pose arrangements and thus mitigates the necessity of using a separate single-frame discriminator. VIBE experimentally uses a behavioral dictionary model, namely the motion prior model (MPoser), as shown in Figure 2. MPoser is an extension of the variational human pose dictionary model, called variational body pose prior model (VPoser) [47], to a time sequence; the VPoser is used to learn possible representations of human motion. MPoser is used as a regularization entry to impose penalties on non-realistic sequences. Finally, VIBE uses a self-attention mechanism to overcome the limitations of circular networks. The circular network updates the hidden state while processing the input of the sequence such the last hidden state summarizes the information of the sequence. However, using the self-attention mechanism amplifies the contribution of the most important frame, resulting in a final representation. As a result, the representation r of the input sequence

\hat{Θ}

becomes a learned convex combination of hidden states.

2.2. TCMR

Traditional 3D human pose estimation methods heavily rely on static features, resulting in jittering in the image. In the TCMR method, even if the same error occurs, a relatively smooth screen is obtained using temporary data. The operation mechanism of TCMR is described in Figure 3.

From a given series of T RGB frames (I1, I2, …, and IT), static image features are extracted for each frame via a pre-trained ResNet. Next, a “global average pooling” is applied to the ResNet output, resulting in f1, f2, …, and fT. In this case, the network weights of the ResNet are shared by all the frames. TCMR uses a bidirectional GRU derived from the extracted static features of all the input frames to calculate the temporal features of the current frame. A bidirectional GRU consists of two unidirectional GRUs, which are denoted by G(all) and extract the temporal features with opposite time directions from the input static features.

The initial inputs of the two GRUs are f1 and fT, and the initial hidden states are initialized to zero tensors. The two GRUs then repeatedly update the hidden states by aggregating the static features of the next frame. Unlike VIBE, TCMR does not add a reserved connection, and thus, the current temporal features are not localized by the computational process. TCMR compiles the temporal features retrieved from all the frames (g(all)), preceding frames g(past), and subsequent frames g(future) for a conclusive 3D mesh prediction, as demonstrated in Figure 4. In terms of the integration process, each temporal feature is directed through a RELU activation function and fully connected layer to modify the channel dimension size to 2048.

Figure 4 illustrates the integration of temporal features for estimating the 3D human mesh for a current frame. TCMR describes temporary encoding with Pose Forecast, which predicts additional temporal features as current target poses in the past and future frames using two additional GRUs labeled G_past and G_future. The past and future frames are defined as individual frames, and G_past and G_future receive f1 and fT as initial inputs to update the hidden states and become temporal features of the past and future frames, respectively. Thus, TCMR receives the image as an input, extracts the feature with ResNet, and stores it on a disk. Further, there are three temporary encoders that extract the temporary feature. TCMR accepts the previous frames of the current frame as inputs and creates a feature from the past frames. In other words, it predicts future scenarios by analyzing only the past ones. Unlike other algorithms, TCMR does not use a reserved connection; instead, it is characterized using only the values derived from the temporary encoder. TCMR creates a feature from the future frames; that is, it predicts the past scenarios by visualizing only the future ones. By integrating the aforementioned three module schemes (using the attention module), TCMR circumvents the preferential use of only static features without current frames and focuses more on temporal features to reduce temporary errors significantly. When the VIBE and MEVA [48] models are adopted, large motion changes result in large errors, and thus, TCMR is more effective than the conventional methods.

2.3. MAED

The structure of the framework proposed in this paper is based on the SMPL model, a parametric model that maps a set of shape and pose parameters to a 3D human–body mesh. SMPL is a parameterized 3D human body model, with N = 6890 vertices and K = 23 joints, and uses shape and pose parameters as inputs. The shape parameter belongs to the PCA shape space and consists of the first 10 coefficients that control the shape of a human body (height, weight, etc.). The pose parameter controls the pose of a joint and is calculated using a representation and linear regression representing the relative rotation of a joint. Figure 5 shows the architecture of the proposed network.

The model receives a video clip of length T as an input and extracts basic characteristics using a CNN backbone for each frame. The global pooling layer at the end of the CNN is removed, resulting in T characteristic maps of sizes (h × w × d). We reconstruct each feature map into 1D sequences of (hw × d) sizes and add trainable embeddings to each sequence. The sequences to which these embeddings are added are represented in a matrix, and these fundamental properties are spatiotemporally modeled using the proposed spatial–temporal encoder (STE). The encoded vector corresponding to the previously added embedding is extracted as the output of the STE. We then use the proposed kinetic topology decoder to estimate the shape, pose, and camera parameters from the output of the STE. These predicted parameters are subsequently fed to the SMPL algorithm to compute 3D joints and their 2D projections, and the STE simultaneously performs spatial–temporal modeling. In video-based computer vision tasks, loss of spatial information due to global pooling operations severely constrains detailed human pose estimation. To address this issue, MAED serializes input video clips in several ways and designs three variants based on multi-head self-attention (MSA-S), multi-head temporary self-attention (MSA-T), and multi-head coupling self-attention (MSA-C). We further designed three types of space–time encoder blocks, as shown in Figure 6, which endowed global space recognition and time inference capabilities to the encoder. Finally, multiple STE blocks are stacked to form an STE.

Because conventional MSA can learn only 1D attention data, the order of input dimensions affects the meaning of the learned attention. The three variants proposed in a previous study on MAED have similar model structures; however, the orders of their input dimensions are different. MSA-S identifies important spatial information in the frame, namely human joints and limbs. As shown in the blue box in Figure 6a, each self-attention head outputs a (T × N × N) heat map, calculated using the scaled point product. However, this setting does not capture the temporal relationship between the frames because the patches in one frame do not interact with those in the other. The function of MSA-T is similar to that of MSA-S; however, MSA-T reconstructs the input matrix from (T × N × d) to (N × T × d). As shown in the green box in Figure 6b, each head of the MSA-T outputs a (N × T × T) heat map. Each score reflects the attention between patches in a particular location and those in the same location in different frames. MSA-T explicitly models the temporal meaning but ignores the spatial relationship between the patches in the same frame.

MSA-C simultaneously flattens the patch and frame sequences, implying that the size of the input matrix is reconstructed from (T × N × d) to (T × d), as shown in the yellow box in Figure 6c. This process results in a (TN × TN) heat map, in which each patch interacts with the other in the video clip. As shown in Figure 6, MAED designs three types of STE blocks based on these MSA variants. The coupling blocks model the coupled spatiotemporal information with MSA-C, followed by multi-layer perception (MLP) layers. However, this coupled modeling increases the computational complexity. Parallel and series blocks connect MSA-S and MSA-T in parallel or series, respectively. For parallel blocks, the outputs of the two branches can be simply calculated as element-specific averages of the MSA-S and MSA-T outputs. We calculate the attention weights for the time and spatial elements corresponding to each branch to obtain dynamically balanced spatial and temporal information.

3. Methods

The introduced three models (VIBE, TCMR, and MAED) tentatively use single-frame features. The figure on the left in Figure 7 illustrates the common parts of the existing algorithms. However, using single-frame features leads to other issues, including no corrections in the rotational motion characteristics of a joint or inconsistency in the joint. An independent token (INT) model [49] was presented in a previous study to mitigate these shortcomings associated with using single-frame features. We propose a new transformer-based model that measures each joint (left elbow, right elbow, right knee, etc.) with three independent tokens (joint-rotation tokens, shape tokens, and camera tokens), as shown in Figure 7 (right figure).

The INT model uses a time model to capture the time-based rotation information of each joint. Based on the learned results, the INT model helps prevent anxiety in the local area of the dummy. Therefore, it is an optimal model for capturing the joints’ individual rotational motion and maintaining the temporal consistency and rationality of each joint rotation.

Therefore, in this study, we compared the results obtained using our proposed model framework with those of the INT model. However, pose classification for rehabilitation considers only a small number of people (about one) with no significant changes in the background. In addition, we focused on joint and position information rather than on models of the human body and conducted the study assuming preexisting conditions. In other words, the INT model was used with Cam_Token, focusing on Shape_Token, to adjust the posture model; this study was conducted using Joint_Token and Cam_Token. The changes in the image pixels (or, now, changes in a person’s body) can be divided into point rotation and viewpoint or camera information. These two tokens are updated to learn the encoding of the prior knowledge of 3D joint rotation in humans, to learn the location information extracted from large-scale data through the transformer layer, and to estimate the SMPL parameters. For each rehabilitation image, our model confirms that joint rotation can be well maintained (for consistency or any change). The model is schematically illustrated in Figure 8.

The process is the same as that followed by the conventional algorithms until the image passes through the ResNet; then, a feature map is extracted, and a Flatten and Add position is embedded with a linear layer. Next, we add two types of tokens and randomly reset them to make the tokens. Considering the dimension, up to 24 pieces of information are written for each joint rotation, and a sequence token is developed with one camera token. The tokens that fit each of the two types are selected using the transformer. In SMPL, the head point rotation token is expressed in a 3D format through 6D rotation, and the camera token is expressed using 10 vector expressions.

Joint_token is bundled as a torch.nn.parameter and has a value of “torch.zeros (1, 48, 1536)”. Cam_token is also bundled as a torch.nn.parameter and has a value of “torch.zeros (1, 1, 1536)”. The joint3d_head value module performs linear transformations with an input tensor (size: 1536) and produces an output tensor with a size of 12. The joint3d_cam value is the same type of linear transformation module with input and output tensor sizes of 1536 and 3, respectively.

Loss is calculated based on the joint token using the mean squared error. In this case, we square the difference between the new and previous joint token values and calculate the mean of these squared differences to derive the error rate.

The transformer model has several arguments, including img_size (size of the input images, with a default value of 448), patch size (size of patches into which the image is divided, with a default value of 32 × 32), channels (number of input channels, typically three for RGB images), number of classes (number of classes to classify, with a default value of 1000, which can be adjusted for different classification tasks), and embed_dim (embedding dimension of the model, with a default value of 768).

In the code, it is implemented as “batch_size = x.shape”, wherein the input images are fed from an open training dataset. The model’s architecture and hyperparameters are established using the class initialization function, which configures patch embedding, the transformer blocks, normalization, the classifier head, and other components.

The main components within the class include:

A layer for converting the input images into patches;
A class token that contains information on the entire image;
Position embeddings that contain information on the position of the patches;
A sequence composed of multiple transformer blocks;
A normalization layer for the final output;
A layer applied before the final output;
The batch_size value is determined based on the shape of the input image.

4. Experiment and Results

In this experiment, the performance of the trained models was evaluated using the 3DPW [50] and Human3.6M (H3.6M) datasets [51] in terms of the following two standard metrics (Table 2):

Procrustes-aligned mean-per-joint-position error (PA-MPJPE);
Mean-per-joint-position error (MPJPE).

4.1. Comparison with Other Human Pose Estimation Methods

For a fair comparison, we evaluate the models using the H3.6M training set. In particular, PA-MPJPE and MPJPE show excellent results in H3.6M. For H3.6M, the performance of our model is similar to that of the state-of-the-art techniques. In particular, a 0.7% performance improvement is achieved in terms of PA-MPJPE for the 3DPW dataset. When trained with 3DPW, our model exhibits a significantly improved performance, indicating that accurate SMPL poses and feature labels are crucial for improving generalization to outdoor environments. To show the qualitative results of the postural estimation mesh reconstruction for rehabilitation, a representative form of the 3DPW dataset is used, and the performance of our model is compared with that of the INT model, which is the current state-of-the-art video-based technique (see Figure 9). The INT model performs well in most frames but produces unsuitable results in the joint-side samples. Conversely, our model produces an accurate pixel alignment mesh for joint prediction. In Figure 9, the solid line represents the areas in which the performance of our model is similar to that of the existing models. The dotted line indicates the areas wherein our model outperforms the existing models. In the images highlighted with dotted lines around the knee area, the 3D model is accurately positioned at the center. In the images highlighted with dotted lines around the face area, our model is slightly biased to the left relative to the direction of the person’s view. These results suggest the advantages of our model in neck joint estimation. In addition, for rehabilitation exercises, the method of focusing on joint points is more advantageous than that of regression of the entire pose.

4.2. Core Analysis

For a detailed rehabilitation therapy comparison, we evaluate the models using the MOYO [52] and RICH [53] training datasets, which are open data consisting of postures for rehabilitation and postures for improving physical abilities. Therefore, we again introduce these data as performance indicators for rehabilitation treatment. In particular, PA-MPJPE and MPJPE show excellent results in the case of RICH. For the RICH dataset, the performance of our model is similar to that of the state-of-the-art techniques. In particular, we achieve a 0.3% performance improvement in terms of PA-MPJPE using the RICH dataset. These results indicate a significant improvement in the performance of our model trained with the RICH dataset.

Table 3 shows that the performance of our model is almost similar to that of the INT model. In the case of learning with the MOYO dataset, only negligible differences are observed in the results obtained using our model and the INT model. The results derived using the RICH dataset are analyzed through qualitative comparisons, and the following trends are observed.

4.3. Observations

We analyze the performance of the method for estimating a rehabilitation pose by incorporating input images (Figure 10). Pose classification using the existing INT model results in a character image in the form of a smooth shape. Further, the pose classification image acquired using our model is more focused on the joint part and proceeds with pose estimation classification.

The MOYO dataset contains highly sophisticated data points, which are favorable for accurate image classifications. The characteristics of our algorithm are distinctly visible on the representative RICH dataset. This image reveals the advantage of our algorithm—the joint rotation (elbow, knee, etc.) identified using our proposed model is distinct from that extracted using the INT model. Furthermore, the foot shape characteristics are clearly visible.

5. Conclusions

Traditional 3D human pose estimation techniques have been extensively studied to recover SMPL meshes that operate well with input images. To realize superior image/video-based estimation performances, research on 3D human pose estimation technology primarily focuses on developing heavy and complex models. However, in this paper, we propose a model based on a simple token design to solve the 3D human pose and shape estimation problem for rehabilitation pose classification. We introduce two tokens to encode the 3D rotation of human joints and camera parameters for SMPL-based human mesh reconstruction. To compare the performance of this mode with that of the INT model, which selects the token assignment method for shapes (i.e., the conventional method), we conducted an experiment by adding more weights to the joint token and analyzed the causality of the system. The proposed method is optimized for maintaining the temporal rotation consistency of each joint and for measuring joints with less shaking, such as classifying a rehabilitation pose. Our model outperforms the state-of-the-art comparators on the 3DPW and RICH benchmarks and achieves comparable results on the H3.6M and MOYO datasets. The qualitative results show that our model is optimized for rehabilitation video data and can generate an accurate human mesh. These findings suggest that further investigations based on token selection for individual features are required in the field of future postural estimation to develop a postural estimation technology for individual domains.

Author Contributions

Conceptualization, Y.K. and J.K.; methodology, Y.K. and G.K.; software, Y.K. and G.K.; validation, G.K. and C.Y.; formal analysis, G.K.; investigation, C.Y.; resources, C.Y. and J.K.; data curation, J.L. and C.Y.; writing—original draft preparation, Y.K.; writing—review and editing, J.K. and J.L.; visualization, J.L. and Y.K.; supervision, J.K. and C.Y.; project administration, J.K. and J.L.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the Innovative Human Resource Development for Local Intellectualization support program (IITP-2023-RS-2022-00156287) supervised by the Institute for Information and Communications Technology Planning and Evaluation (IITP). This research was also supported as a “Technology Commercialization Collaboration Platform Construction” project by the INNOPOLIS FOUNDATION (Project Number: 1711177250).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pooyandeh, M.; Han, K.-J.; Sohn, I. Cybersecurity in the AI-Based Metaverse: A Survey. Appl. Sci. 2022, 12, 12993. [Google Scholar] [CrossRef]
Wang, G.; Badal, A.; Jia, X.; Maltz, J.S.; Mueller, K.; Myers, K.J.; Niu, C.; Vannier, M.; Yan, P.; Yu, Z.; et al. Development of metaverse for intelligent healthcare. Nat. Mach. Intell. 2022, 411, 922–929. [Google Scholar] [CrossRef] [PubMed]
Mozumder, M.A.I.; Sheeraz, M.M.; Athar, A.; Aich, S.; Kim, H.C. Overview: Technology Roadmap of the Future Trend of Metaverse based on IoT, Blockchain, AI Technique, and Medical Domain Metaverse Activity. In Proceedings of the 2022 24th International Conference on Advanced Communication Technology (ICACT), Pyeongchang-gun, Republich of Korea, 13–16 February 2022; pp. 256–261. [Google Scholar]
Chaudhary, M.Y. Augmented Reality, Artificial Intelligence, and the Re-Enchantment of the World: With Mohammad Yaqub Chaudhary, “Augmented Reality, Artificial Intelligence, and the Re-Enchantment of the World”; and William Young, “Reverend Robot: Automation and Clergy”. Zygon 2019, 54, 454–478. [Google Scholar] [CrossRef]
Ali, S.; Abdullah; Armand, T.P.T.; Athar, A.; Hussain, A.; Ali, M.; Yaseen, M.; Joo, M.-I.; Kim, H.-C. Metaverse in Healthcare Integrated with Explainable AI and Blockchain: Enabling Immersiveness, Ensuring Trust, and Providing Patient Data Security. Sensors 2023, 23, 565. [Google Scholar] [CrossRef] [PubMed]
Afrashtehfar, K.I.; Abu-Fanas, A.S.H. Metaverse, Crypto, and NFTs in Dentistry. Educ. Sci. 2022, 12, 538. [Google Scholar] [CrossRef]
Aaron, H. Can Computers Create Art? Arts 2018, 7, 18. [Google Scholar]
Ahmad, S.F.; Rahmat, M.K.; Mubarik, M.S.; Alam, M.M.; Hyder, S.I. Artificial Intelligence and Its Role in Education. Sustainability 2021, 13, 12902. [Google Scholar] [CrossRef]
Reitmann, S.; Neumann, L.; Jung, B. Blainder—A blender ai add-on for generation of semantically labeled depth-sensing data. Sensors 2021, 21, 2144. [Google Scholar] [CrossRef]
Papastratis, I.; Chatzikonstantinou, C.; Konstantinidis, D.; Dimitropoulos, K.; Daras, P. Artificial Intelligence Technologies for Sign Language. Sensors 2021, 21, 5843. [Google Scholar] [CrossRef]
Pataranutaporn, P.; Danry, V.; Leong, J.; Punpongsanon, P.; Novy, D.; Maes, P.; Sra, M. AI-generated characters for supporting personalized learning and well-being. Nat. Mach. Intell. 2021, 3, 1013–1022. [Google Scholar] [CrossRef]
Jiang, S.; Ma, J.W.; Liu, Z.Y.; Guo, H.X. Scientometric Analysis of Artificial Intelligence (AI) for Geohazard Research. Sensors 2022, 22, 7814. [Google Scholar] [CrossRef] [PubMed]
Gandedkar, N.H.; Wong, M.T.; Darendeliler, M.A. Role of Virtual Reality (VR), Augmented Reality (AR) and Artificial Intelligence (AI) in Tertiary Education and Research of Orthodontics: An Insight. Semin. Orthod. 2021, 27, 69–77. [Google Scholar] [CrossRef]
Hu, L.; Tian, Y.; Yang, J.; Taleb, T.; Xiang, L.; Hao, Y. Ready player one: UAV-clustering-based multi-task offloading for vehicular VR/AR gaming. IEEE Netw. 2019, 33, 42–48. [Google Scholar] [CrossRef]
Pan, Y.; Zhang, L. Roles of artificial intelligence in construction engineering and management: A critical review and future trends. Autom. Constr. 2021, 122, 103517. [Google Scholar] [CrossRef]
Minopoulos, G.M.; Memos, V.A.; Stergiou, K.D.; Stergiou, C.L.; Psannis, K.E. A Medical Image Visualization Technique Assisted with AI-Based Haptic Feedback for Robotic Surgery and Healthcare. Appl. Sci. 2023, 13, 3592. [Google Scholar] [CrossRef]
Zhang, C.; Wang, X.; Fang, S.; Shi, X. Construction and Application of VR-AR Teaching System in Coal-Based Energy Education. Sustainability 2022, 14, 16033. [Google Scholar] [CrossRef]
Monterubbianesi, R.; Tosco, V.; Vitiello, F.; Orilisi, G.; Fraccastoro, F.; Putignano, A.; Orsini, G. Augmented, Virtual and Mixed Reality in Dentistry: A Narrative Review on the Existing Platforms and Future Challenges. Appl. Sci. 2022, 12, 877. [Google Scholar] [CrossRef]
Badiola-Bengoa, A.; Mendez-Zorrilla, A. A systematic review of the application of camera-based human-pose estimation in thefield of sport and physical exercise. Sensors 2021, 21, 5996. [Google Scholar] [CrossRef]
Jalal, A.; Akhtar, I.; Kim, K. Human Posture Estimation and Sustainable Events Classification via Pseudo-2D Stick Model andK-ary Tree Hashing. Sustainability 2020, 12, 9814. [Google Scholar] [CrossRef]
Nguyen, H.; Nguyen, T.; Scherer, R.; Le, V. Unified End-to-End YOLOv5-HR-TCM Framework for Automatic 2D/3D Human PoseEstimation for Real-Time Applications. Sensors 2022, 22, 5419. [Google Scholar] [CrossRef] [PubMed]
Chung, J.L.; Ong, L.Y.; Leow, M.C. Comparative Analysis of Skeleton-Based Human-pose estimation. Future Internet 2022, 14, 380. [Google Scholar] [CrossRef]
Patil, A.K.; Balasubramanyam, A.; Ryu, J.Y.; Chakravarthi, B.; Chai, Y.H. An open-source platform for human-pose estimationand tracking using a heterogeneous multi-sensor system. Sensors 2021, 21, 2340. [Google Scholar] [CrossRef]
Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A Simple Yet Effective Baseline for 3d Human-pose estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2659–2668. [Google Scholar]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human-pose estimation: New Benchmark and State of the Art Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
Wang, J. Deep 3D human-pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225–103246. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 September 2014; pp. 1653–1660. [Google Scholar]
Liu, Z.; Chen, H.; Feng, R.; Wu, S.; Ji, S.; Yang, B.; Wang, X. Deep Dual Consecutive Network for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Ganguly, A.; Rashidi, G.; Mombaur, K. Comparison of the Performance of the Leap Motion ControllerTM with a StandardMarker-Based Motion Capture System. Sensors 2021, 21, 1750. [Google Scholar] [CrossRef]
Zhao, Y.S.; Jaafar, M.H.; Mohamed, A.S.A.; Azraai, N.Z.; Amil, N. Ergonomics Risk Assessment for Manual Material Handlingof Warehouse Activities Involving High Shelf and Low Shelf Binning Processes: Application of Marker-Based Motion Capture. Sustainability 2022, 14, 5767. [Google Scholar] [CrossRef]
Filippeschi, A.; Schmitz, N.; Miezal, M.; Bleser, G.; Ruffaldi, E.; Stricker, D. Survey of Motion TrackingMethods Based on Inertial Sensors: A Focus on Upper Limb Human Motion. Sensors 2017, 17, 1257. [Google Scholar] [CrossRef]
Khan, M.H.; Zöller, M.; Farid, M.S.; Grzegorzek, M. Marker-Based Movement Analysis of Human BodyParts in Therapeutic Procedure. Sensors 2020, 20, 3312. [Google Scholar] [CrossRef] [PubMed]
Moro, M.; Marchesi, G.; Hesse, F.; Odone, F.; Casadio, M. Markerless vs. Marker-Based Gait Analysis: A Proof of Concept Study. Sensors 2022, 22, 2011. [Google Scholar] [CrossRef]
Klishkovskaia, T.; Aksenov, A.; Sinitca, A.; Zamansky, A.; Markelov, O.A.; Kaplun, D. Development of Classification Algorithmsfor the Detection of Postures Using Non-Marker-Based Motion Capture Systems. Appl. Sci. 2020, 10, 4028. [Google Scholar] [CrossRef]
Fang, W.; Zheng, L.; Deng, H.; Zhang, H. Real-Time Motion Tracking for Mobile Augmented/Virtual RealityUsing Adaptive Visual-Inertial Fusion. Sensors 2017, 17, 1037. [Google Scholar] [CrossRef]
Adolf, J.; Dolezal, J.; Kutilek, P.; Hejda, J.; Lhotska, L. Single Camera-Based Remote Physical Therapy: Verification on a LargeVideo Dataset. Appl. Sci. 2022, 12, 799. [Google Scholar] [CrossRef]
Song, J.; Kook, J. Mapping Server Collaboration Architecture Design with OpenVSLAM for Mobile Devices. Appl. Sci. 2022, 12, 3653. [Google Scholar] [CrossRef]
Muhammad, K.; Khan, N.; Lee, M.Y.; Imran, A.S.; Sajjad, M. School of the future: A comprehensive study on the effectiveness ofaugmented reality as a tool for primary school children’s education. Appl. Sci. 2021, 11, 5277. [Google Scholar]
Jung, S.; Song, J.G.; Hwang, D.J.; Ahn, J.Y.; Kim, S. A study on software-based sensingtechnology for multiple object control in AR video. Sensors 2010, 10, 9857–9871. [Google Scholar] [CrossRef]
Schmitz, A.; Ye, M.; Shapiro, R.; Yang, R.; Noehren, B. Accuracy and repeatability of joint angles measuredusing a single camera markerless motion capture system. J. Biomech. 2014, 47, 587–591. [Google Scholar] [CrossRef]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 2015, 34, 1–16. [Google Scholar] [CrossRef]
Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5252–5262. [Google Scholar]
Choi, H.; Moon, G.; Lee, K.M. Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [Google Scholar]
Wan, Z.; Li, Z.; Tian, M.; Liu, J.; Yi, S.; Li, H. Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Tung, H.Y.F.; Tung, H.W.; Yumer, E.; Fragkiadaki, K. Self-supervised learning of motion capture. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5242–5252. [Google Scholar]
Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. AMASS: Archive of Motion Capture As Surface Shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Luo, Z.; Golestaneh, S.A.; Kitani, K.M. 3D human motion estimation via motion compression and refinement. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Yang, S.; Heng, W.; Liu, G.; Luo, G.; Yang, W.; Yu, G. Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens. In Proceedings of the ICLR 2023 International Conference on Learning Representations, International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3D human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
Huang, C.H.P.; Yi, H.; Höschle, M.; Safroshkin, M.; Alexiadis, T.; Polikovsky, S.; Scharstein, D.; Black, M.J. Capturing and inferring dense full-body human-scene contact. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13274–13285. [Google Scholar]
Tripathi, S.; Müller, L.; Huang, C.H.P.; Taheri, O.; Black, M.J.; Tzionas, D. 3D human-pose estimation via intuitive physics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4713–4725. [Google Scholar]

Figure 1. Image feature-based overall temporal learning (VIBE, TCMR, and MAED).

Figure 2. The architecture of the motion discriminator,

D_{M}

, is designed with GRU layers that precede a self-attention layer.

D_{M}

provides a real or synthetic probability for each given input sequence. * Marking means that many models are intertwined.

Figure 2. The architecture of the motion discriminator,

D_{M}

, is designed with GRU layers that precede a self-attention layer.

D_{M}

provides a real or synthetic probability for each given input sequence. * Marking means that many models are intertwined.

Figure 3. Comprehensive workflow of TCMR. The golden-colored result, Θ_int, is utilized during the inference phase, and it is regressed from the accumulated temporal feature.

Figure 4. Temporal feature integration to estimate a 3D human mesh for the current frame.

Figure 5. Overview of the MAED model. The upper segment demonstrates the model’s workflow, and the lower segment showcases the designs of our proposed spatial–temporal encoder and kinematic topology decoder.

Figure 6. STE block and MSA variants.

Figure 7. (Left) Mainstream temporal human mesh methods adopt a temporal encoder to mix temporal information from past and future frames and then regress the SMPL parameters from the temporally enhanced feature for each frame. (Right) The INT model first acquires the tokens of each joint in time dimensions and then separately captures the motion of each joint using a shared temporal encoder.

Figure 8. Our model extracts feature maps through an image encoder and sends the learnable point rotation and camera tokens to the converter. Finally, our model converts the joint-rotation tokens, feature tokens, and camera tokens into SMPL parameters for 3D mesh reconstruction and 2D reprojection of image planes.

Figure 9. Qualitative comparisons between our model and the reproduced INT model using the 3DPW dataset.

Figure 10. Qualitative comparisons between our model and the reproduced INT model using the MOYO and RICH datasets.

Table 1. Taxonomic summary of three recent algorithms for 3D human pose estimation.

	VIBE [42]	TCMR [43]	MAED [44]
Overview	Extracts frame features from videos. Utilizes bidirectional GRUs to incorporate temporal information and generate latent variables.	Does not rely on static image features. Predicts 3D human poses using temporal data.	Built upon the SMPL model. Models spatial and temporal features based on image features.
Objective	Estimates SMPL human body model parameters (pose and shape). Represents human body movements.	Combines temporal features of the current frame with those of previous and subsequent frames to estimate a 3D human mesh.	Models spatial and temporal information to estimate a 3D human mesh. Utilizes SMPL to compute 3D joints and their 2D projections.
Improvements	Enhances pose estimation by utilizing past and future frame information. Combines comprehensive loss for each element with the corresponding motion discriminator loss.	Utilizes bidirectional GRUs to extract temporal features. Integrates temporal features to predict 3D mesh for the current frame.	Employs various multi-head self-attention methods. Models spatial and temporal information. Stacks STE blocks to build a complex model.

Table 2. Comparison of video-based human pose estimation models using the 3DPW and Human3.6M datasets. ↓ means that fewer numbers are better.

		3DPW [50]		H3.6M [51]
Models	Backbone	PA-MPJPE [↓]	MPJPE [↓]	PA-MPJPE [↓]	MPJPE [↓]
VIBE	ResNet-50	52.9	83.2	42.4	65.4
(Kocabas et al., 2020) [42]	(from Spin)
TCMR	ResNet-50	56.8	96.1	41.7	62.1
(Choi et al., 2021) [43]	(from Spin)
MAED	ResNet-50	50.7	93.1	38.7	56.4
(Wan et al., 2021) [44]
INT	ResNet-50	49.7	90.0	39.1	57.1
(Yang et al., 2023) [49]
Our Model	ResNet-50	49.0	87.9	39.3	57.0

Table 3. Comparison of video-based human pose estimation models (INT and our proposed model) on the MOYO and RICH datasets. ↓ means that fewer numbers are better.

		MOYO		RICH
Models	Backbone	PA-MPJPE [↓]	MPJPE [↓]	PA-MPJPE [↓]	MPJPE [↓]
INT	ResNet-50	36.8	74.3	48.2	80.7
(Yang et al., 2023) [49]
Our Model	ResNet-50	36.7	74.4	47.8	80.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, Y.; Ku, G.; Yang, C.; Lee, J.; Kim, J. Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy. Electronics 2023, 12, 4273. https://doi.org/10.3390/electronics12204273

AMA Style

Kim Y, Ku G, Yang C, Lee J, Kim J. Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy. Electronics. 2023; 12(20):4273. https://doi.org/10.3390/electronics12204273

Chicago/Turabian Style

Kim, Yeonggwang, Giwon Ku, Chulseung Yang, Jeonggi Lee, and Jinsul Kim. 2023. "Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy" Electronics 12, no. 20: 4273. https://doi.org/10.3390/electronics12204273

APA Style

Kim, Y., Ku, G., Yang, C., Lee, J., & Kim, J. (2023). Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy. Electronics, 12(20), 4273. https://doi.org/10.3390/electronics12204273

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy

Abstract

1. Introduction

2. Literature Review: Related Work

2.1. VIBE

2.2. TCMR

2.3. MAED

3. Methods

4. Experiment and Results

4.1. Comparison with Other Human Pose Estimation Methods

4.2. Core Analysis

4.3. Observations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI