You are currently viewing a new version of our website. To view the old version click .
Journal of Imaging
  • Article
  • Open Access

29 September 2023

Synthesizing Human Activity for Data Generation

,
,
and
1
Faculdade de Engenharia, Universidade do Porto, 4200-465 Porto, Portugal
2
Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência, 4200-465 Porto, Portugal
3
School of Engineering, Polytechnic of Porto, 4200-072 Porto, Portugal
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Machine Learning for Human Activity Recognition

Abstract

The problem of gathering sufficiently representative data, such as those about human actions, shapes, and facial expressions, is costly and time-consuming and also requires training robust models. This has led to the creation of techniques such as transfer learning or data augmentation. However, these are often insufficient. To address this, we propose a semi-automated mechanism that allows the generation and editing of visual scenes with synthetic humans performing various actions, with features such as background modification and manual adjustments of the 3D avatars to allow users to create data with greater variability. We also propose an evaluation methodology for assessing the results obtained using our method, which is two-fold: (i) the usage of an action classifier on the output data resulting from the mechanism and (ii) the generation of masks of the avatars and the actors to compare them through segmentation. The avatars were robust to occlusion, and their actions were recognizable and accurate to their respective input actors. The results also showed that even though the action classifier concentrates on the pose and movement of the synthetic humans, it strongly depends on contextual information to precisely recognize the actions. Generating the avatars for complex activities also proved problematic for action recognition and the clean and precise formation of the masks.

1. Introduction

The problem of inferring human activity from images is a long-standing problem in computer vision [1]. Over the last two decades, researchers have tackled this problem via the prediction of 2D content, such as keypoints, silhouettes, and part segmentations [2]. More recently, however, the interest has shifted toward retrieving 3D meshes of human bodies, including facial and hand expressiveness, as a result of developments in statistical body models [3]. Models of human bodies [3,4,5,6] are central to this trend due to their flexibility in accurately representing a wide range of poses without being overly complex. The use of these models of human bodies allows researchers to analyze and recreate intricate human actions.
Human actions, behaviors, and interactions with the environment are highly diverse due to the wide range of poses, motions, and facial expressions that humans are capable of and how subtle changes in them can correspond to a wide range of actions. Therefore, datasets must be substantial in size in order to be sufficiently representative of human actions [7]. However, creating such large datasets is a costly and time-consuming task due to the manual labor involved in labeling data. Techniques such as data augmentation help alleviate this issue by artificially expanding modestly sized datasets into much larger and richer datasets without further manual labeling. Nevertheless, we cannot disregard the significant cost and privacy and legal issues arising from the presence of people. Hence, a potential solution to overcome the issue of insufficient data is through transfer learning [8], which consists of pre-training models on large general-purpose datasets and subsequently fine-tuning them on smaller datasets. By embedding the knowledge and experience of a large-scale dataset in the pre-trained model, we reduce the number of data required for the intended application. Data augmentation is a technique that alleviates the challenge of annotating large numbers of data. It consists of artificially generating new labeled data samples, thus enlarging the original dataset without further need for manual labeling. An example of data augmentation is presented in [9], which introduces an augmentation approach of datasets with annotated synthetic images exhibiting new poses. Another example where data augmentation is applied indirectly, i.e., instead of artificially extending a dataset, is SURREAL [10], which is a fully synthetic dataset. The dataset was created with the Skinned Multi-Person Linear(SMPL) body model and driven by 3D motion captures, where parameters were altered, such as illumination, camera viewpoint, and background, to augment the diversity.
The usage of synthetic data is important in multiple scenarios and applications. For instance, we can use accurate virtual humans for human motion analysis and recognition [11,12] and in other major application areas of 3D mesh reconstruction, such as virtual and augmented reality, to create realistic, life-like avatars for interactive experiences [13]; ergonomic analysis [14]; and video games and animation [15,16]. Furthermore, there is also an application in inserting real humans into virtual environments. A notable example of this is using virtual production [17], where virtual and physical film-making techniques are combined to produce media that include human actors in photo-realistic virtual environments, thus saving time and resources in post-production that would otherwise be required. Nowadays, the usage of synthetic and virtual data is even more prominent, and we can see it in all the television media. Another important aspect of this is the fact that synthetic data can also be a valuable resource to train deep learning models. For instance, in [18], synthetic datasets and real datasets augmented with synthetic humans were used as an experiment to show that mixing synthetic humans with real backgrounds can indeed be used to train pose-estimation networks. Another work in [19] explores the possibility of pre-training an action-recognition network in purely synthetic datasets and then transfer-learning it to real datasets with different categories from the ones on the synthetic datasets, showing competitive results. A survey exploring application domains of synthetic data in human analysis is presented in [20], which further highlights the practical usage of synthetic data in multiple application scenarios.
In this paper, we propose a semi-automated framework to generate dynamic scenes featuring synthetic humans performing diverse actions. The framework contains features encompassing background manipulation, avatar size, and placement adjustments, facilitating the creation of datasets characterized by heightened variability and customization that can aid users in creating mixed-reality or purely virtual datasets for specific human activity tasks. We also propose an evaluation methodology for assessing the resulting synthetic videos, as well as the synthetic human models. Experimental results showed that the action classifier used to assess the framework results primarily relied on determining the pose and kinetics of the avatars to determine a precise action identification. Still, the background and the presence of objects in which the actors interact also affect the recognition of the action. We also observed that the Part Attention REgressor (PARE) model also performs better when activities are less complex and when there is less partial occlusion in input videos.
The remaining sections of this paper are organized as follows: Section 2 describes the related work about 3D human pose estimation, reconstruction of human meshes, and data augmentation, while Section 3 introduces our framework, explaining in detail each component, how the PARE’s method generates the avatars, and the additional features in our platform. Section 4 presents our evaluation methods, as well as their results and interpretation. Lastly, in Section 5, we exhibit our main conclusions.

3. Avatar Generation Application

Nowadays, the usage of synthetic data has become prominent in multiple application scenarios, ranging from virtual and augmented reality to training data that can be used successfully to train machine learning models. When looking particularly into human behavior analysis and related topics, it is noticeable how important it is to have sufficiently large and variate datasets so that models can better generalize using these data. With this in mind, we propose a framework for the semi-automatic manipulation and generation of visual scenes with synthetic humans performing actions.
Our framework allows users to select input videos with people performing actions and automatically extracts their pose to generate a virtual avatar performing the same actions. The users can then manipulate these synthetic humans and place them in arbitrary visual or virtual scenes by means of a web application. Figure 1 illustrates the workflow of the designed web application, showing the possible use cases. The process starts by processing an input video that contains a person performing an action and performs human detection, tracking and synthesis, also caching these results. The users can then manipulate the scene and manipulate the synthetic human extracted from the input video.
Figure 1. Schematic of the Avatar-Generation Application.
For synthetic human generation, we chose to use the PARE algorithm, as it uses the well-known Resnet-50 [54] backbone and is partially robust to occlusion. This algorithm starts by first performing human detection and tracking and then rendering each detected human in its original place. This way, the synthetic humans generated by our mechanism replicate the exact actions and positions observed in the input video relative to the captured actor. Internally, the algorithm takes the tracking of each actor and performs a regression of SMPL parameters, which describe the body shape and pose of the synthetic human. Additionally, the PARE model comprises two integral components: learning to regress 3D body parameters and learning attention weights per body part. Firstly, we initialize the frames and the stored bounding boxes to pass them through the PARE model, which yields the predicted camera parameters, 3D vertex coordinates of the SMPL model, predicted pose and shape parameters, and 2D and 3D joint positions. Then, the model learns attention weights for each body part, which serve as a guide, allowing the model to focus on specific regions of interest within the input image. By directing attention to particular body parts, the model can extract more accurate and detailed information, enhancing the overall quality of the generated synthetic humans.
With these data, we are able to render synthetic human bodies like the one depicted in Figure 2. By making use of the predicted camera parameters, shape and pose, we are able to manipulate the resulting avatars and place then on the image plane. We can then make use of our web application to change the size and position of the generated synthetic human, as well as the background where they can be placed. As the algorithm supports the detection of multiple humans, this allows users to even manipulate and generate crowded scenes with synthetic humans.
Figure 2. Example of a 3D mesh.
Overall, our proposal makes use of existing technology and provides a tool that allows users to semi-automatically generate visual data with synthetic humans performing actions based on real scenes. This allows a significant reduction in the manual effort required for data retrieval or collection, which also allows the training and testing of machine learning models in diverse simulated environments that can be crucial in many application scenarios.

4. Results

We employed two evaluation methods to assess the quality and usability of the data obtained from our web application: (i) the usage of an action-recognition algorithm on our outputs and (ii) the evaluation of the avatars, in terms of body resemblance, through segmentation. We conducted extensive experiments using publicly available videos, each containing one of the following actions: basketball dribble, archery, boxing, and pushups. The basketball video was from the 3DPW [55] dataset, the archery and boxing videos were from the UCF101 [56] dataset, and the pushups videos were from the HMBD51 [57] dataset (the resulting videos obtained by inserting synthetic humans can be found at https://mct.inesctec.pt/synthesizing-human-activity-for-data-generation, accessed on 26 September 2023).

4.1. Action Recognition

For the first experimental phase, we used MMAction2 [37], which is a framework designed for action recognition that gives, as output, the top five labels of its predictions and the respective scores. MMAction2 is known for supporting a comprehensive set of pre-trained models for action recognition and for being flexible and easy to use. Among the several pre-trained action-recognition models available within the framework, we selected the Temporal Segment Networks (TSN) model due to its operability on short video segments to capture spatial and temporal cues. Lastly, all of the frames presented in Table 1 represent the cases in which we tested the action-recognition algorithm.
Table 1. First frame from the output video containing the generated avatar in the input video’s background (first row); the output video containing the generated avatar in a new background (second row); the output video containing the generated avatar in different sizes and positions (third row).
Table 2 displays the scores for each label given by the MMAction2 that we considered correct, i.e., that corresponded to the action performed by the actors in the original videos and the scores of the avatars substituting the respective actors. It is essential to highlight that the scores presented in the Table are among the top five, not necessarily the top one, score. The MMAction2 was unsuccessful with any cases containing new backgrounds; i.e., it could not predict correctly between the top five action predictions.
Table 2. Scores (in percentages) of the correct labels of the original four videos and of videos containing the avatars in the original background.
We also performed a particular test regarding the basketball-dribbling video, where the ball was not present, due to the unsuccessful attempt to remove the objects which the actor interacted with in the remaining videos. Hence, we observed that in this test case, the action classifier was not able to correctly predict the labels we accepted for this action: dribbling basketball and playing basketball. Thus, the remaining tests concerning this video used the background where the ball appears.
The last stage of this evaluation method consisted of re-adjusting the avatars in terms of placement and size; i.e., for three different sizes for the synthetic humans (original, smaller, and bigger), they were placed more to the left, more to the right, upwards, and downwards. Table 3 exhibits the scores of the action labels given by the MMAction2 (which we considered correct) for the cases of the final part of this evaluation phase, where the cells colored in grey represent cases where the action classifier was unsuccessful in none of its top five predictions. We used the results of the experimental outputs regarding the avatars with the same size and placement as the input actors and on the original background as a reference value. It is visible that the results of the basketball-dribbling and boxing videos were very similar to the respective reference values. Overall, the archery video improved the scores for the avatars with smaller sizes and three other exceptions compared to the avatar with the actual size in the original position. A possible explanation for these results was due to perspective and visual hints; i.e., placing the avatars in different locations and sizes may alter the understatement about what is happening, allowing the model to be more confident about the prediction of the action. Even so, the avatars with a larger size, placed upwards and downwards, showed inferior results but were analogous to the first experiment’s output. Lastly, the classifier could only correctly label two between the twelve cases for the push-up video, namely when we placed the avatar more to the left and upwards, with a bigger size. The explanation for these two deviations could be that they stand out more in the frame due to their size and placement.
Table 3. Scores (in percentage) of the correct labels given by the MMAction2 to all the experiments regarding the avatars.

4.2. Segmentation

The next experimental phase consisted of the evaluation of the segmentation results to evaluate the fit of PARE’s model analytically. We generated masks of the actors and avatars using Detectron2 [38] by employing two models for segmentation already included in the framework: Instance Segmentation and Panoptic Segmentation. For Instance Segmentation, we utilized the Mask R-CNN [58] architecture with the ResNet-50 backbone, while for Panoptic Segmentation, we employed the Panoptic FPN architecture [59] with the ResNet-101 backbone. Table 4 illustrates the segmentation results of the four aforementioned cases.
Table 4. Generated mask of the actor and avatar in the first frame from the original video (first row) and from the output video containing the generated avatar in the input video’s background (second row) using Detectron2.
Afterward, we calculated the IoU metric since it allows us to quantify the accuracy and efficacy of our segmentation process. Table 5 displays the results we obtained for the masks of the four videos.
Table 5. IoU results using the Instance and Panoptic Segmentation.
Table 6 exhibits the positive effect of storing the avatar’s information in the four actions we tested using a GTX 1080 graphics processing unit (GPU).
Table 6. Process time for the avatar generation, before and after the avatars are cached, using a GTX 1080 GPU.

5. Conclusions

In this article, we propose a semi-automated mechanism that generates scenes with synthetic humans performing various actions. To achieve this, we implemented a web application that allows users to select input videos with humans performing actions and automatically extract a 3D model of the person that can be inserted into other videos or backgrounds, where the generation of the synthetic humans was performed by employing the PARE algorithm. The application also allows users to manipulate the 3D model’s position and scale, allowing further customization of the output scene.
We also introduced two evaluation methodologies to assess our proposal. The first assesses the ability of our outputs to be considered for a video with actual humans performing actions. To do so, we employed the MMAction2 framework for videos processed using our proposal and analyzed if the predicted actions were in fact the original actions of the extracted input videos. The results showed that for simple actions, this was achieved. However, it failed in cases where the actions involved interaction with other objects. The second evaluation methodology consisted of assessing the PARE models by comparing segmentation masks. We observed that in complex actions, the resulting segmentation masks could not be correctly used to assess the 3D models. However, for simpler actions, it is evident that this type of assessment can indeed be used. The avatars’ appearance may be visually similar to the background or other objects in the scene, leading to possible confusion for the algorithm and difficulties in accurately segmenting the avatars. This suggests that further research on assessment methodologies for objectively evaluating the quality of 3D-generated models is required.
Lastly, our contribution extends beyond works like SURREAL by providing a unique platform combining personalization, realism, and flexibility. Furthermore, our platform goes beyond the capabilities of SURREAL’s work by empowering users to generate realistic content that reflects their personal preferences and creative vision. We understand that each user has unique requirements and desires when creating virtual content, and our platform embraces this diversity by offering a wide range of customization options. By providing a more personalized approach, we enable users to tailor their generated content to specific scenarios or styles.

Author Contributions

Conceptualization, P.C., L.C.-R. and A.P.; methodology, P.C., L.C.-R. and A.P.; software, A.R.; investigation, A.R.; writing—original draft, A.R.; writing—review and editing, P.C., L.C.-R. and A.P.; supervision, P.C., L.C.-R. and A.P. All authors have read and agreed to the published version of the manuscript.

Funding

The work was funded by the European Union’s Horizon Europe research and innovation program under Grant Agreement No 101094831 (Project Converge-Telecommunications and Computer Vision Convergence Tools for Research Infrastructures). Américo was funded by National Funds through the Portuguese funding agency, FCT—Fundação para a Ciência e a Tecnologia, within the PhD grant SFRH/BD/146400/2019.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The resulting dataset obtained by inserting synthetic humans into other backgrounds can be found at the public link https://mct.inesctec.pt/synthesizing-human-activity-for-data-generation (accessed on 26 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Nie, B.X.; Wei, P.; Zhu, S.C. Monocular 3D Human Pose Estimation by Predicting Depth on Joints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  2. Tian, Y.; Zhang, H.; Liu, Y.; Wang, L. Recovering 3D Human Mesh from Monocular Images: A Survey. arXiv 2022, arXiv:2203.01923. [Google Scholar] [CrossRef] [PubMed]
  3. Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10975–10985. [Google Scholar]
  4. Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 2015, 34, 1–16. [Google Scholar] [CrossRef]
  5. Romero, J.M.; Tzionas, D.; Black, M.J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Trans. Graph. 2017, 36, 245. [Google Scholar] [CrossRef]
  6. Li, T.; Bolkart, T.; Black, M.J.; Li, H.; Romero, J. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 2017, 36, 194:1–194:17. [Google Scholar] [CrossRef]
  7. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
  8. Yin, X.; Yu, X.; Sohn, K.; Liu, X.; Chandraker, M. Feature Transfer Learning for Face Recognition With Under-Represented Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  9. Rogez, G.; Schmid, C. MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild. In Proceedings of the Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
  10. Varol, G.; Romero, J.; Martin, X.; Mahmood, N.; Black, M.J.; Laptev, I.; Schmid, C. Learning from Synthetic Humans. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  11. Aggarwal, J.K.; Cai, Q. Human motion analysis: A review. Comput. Vis. Image Underst. 1999, 73, 428–440. [Google Scholar] [CrossRef]
  12. Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar] [CrossRef]
  13. Hilton, A.; Beresford, D.; Gentils, T.; Smith, R.; Sun, W. Virtual people: Capturing human models to populate virtual worlds. In Proceedings of the Computer Animation 1999, Geneva, Switzerland, 26–29 May 1999; pp. 174–185. [Google Scholar] [CrossRef]
  14. Reed, M.P.; Raschke, U.; Tirumali, R.; Parkinson, M.B. Developing and implementing parametric human body shape models in ergonomics software. In Proceedings of the 3rd International Digital Human Modeling Conference, Tokyo, Japan, 20–22 May 2014. [Google Scholar]
  15. Huang, Z.; Xu, Y.; Lassner, C.; Li, H.; Tung, T. ARCH: Animatable Reconstruction of Clothed Humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  16. Suma, E.A.; Lange, B.; Rizzo, A.S.; Krum, D.M.; Bolas, M. FAAST: The Flexible Action and Articulated Skeleton Toolkit. In Proceedings of the 2011 IEEE Virtual Reality Conference, Singapore, 19–23 March 2011; pp. 247–248. [Google Scholar] [CrossRef]
  17. Grau, O.; Price, M.C.; Thomas, G.A. Use of 3d techniques for virtual production. In Proceedings of the Videometrics and Optical Methods for 3D Shape Measurement, San Jose, CA, USA, 22 December 2000; Volume 4309, pp. 40–50. [Google Scholar]
  18. Hoffmann, D.T.; Tzionas, D.; Black, M.J.; Tang, S. Learning to train with synthetic humans. In Proceedings of the Pattern Recognition: 41st DAGM German Conference, DAGM GCPR 2019, Dortmund, Germany, 10–13 September 2019; pp. 609–623. [Google Scholar]
  19. Kim, Y.w.; Mishra, S.; Jin, S.; Panda, R.; Kuehne, H.; Karlinsky, L.; Saligrama, V.; Saenko, K.; Oliva, A.; Feris, R. How Transferable are Video Representations Based on Synthetic Data? Adv. Neural Inf. Process. Syst. 2022, 35, 35710–35723. [Google Scholar]
  20. Joshi, I.; Grimmer, M.; Rathgeb, C.; Busch, C.; Bremond, F.; Dantcheva, A. Synthetic data in human analysis: A survey. arXiv 2022, arXiv:2208.09191. [Google Scholar]
  21. Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-End Recovery of Human Shape and Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  22. Hagbi, N.; Bergig, O.; El-Sana, J.; Billinghurst, M. Shape Recognition and Pose Estimation for Mobile Augmented Reality. IEEE Trans. Vis. Comput. Graph. 2011, 17, 1369–1379. [Google Scholar] [CrossRef]
  23. Zhou, X.; Huang, Q.; Sun, X.; Xue, X.; Wei, Y. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  24. Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video Inference for Human Body Pose and Shape Estimation. arXiv 2019, arXiv:1912.05656. [Google Scholar] [CrossRef]
  25. Kocabas, M.; Huang, C.P.; Hilliges, O.; Black, M.J. PARE: Part Attention Regressor for 3D Human Body Estimation. arXiv 2021, arXiv:2104.08527. Available online: http://xxx.lanl.gov/abs/2104.08527 (accessed on 26 September 2023).
  26. Güler, R.A.; Neverova, N.; Kokkinos, I. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7297–7306. [Google Scholar]
  27. Baradel, F.; Groueix, T.; Weinzaepfel, P.; Brégier, R.; Kalantidis, Y.; Rogez, G. Leveraging MoCap Data for Human Mesh Recovery. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021. [Google Scholar] [CrossRef]
  28. Kolotouros, N.; Pavlakos, G.; Black, M.J.; Daniilidis, K. Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  29. Akhter, I.; Black, M.J. Pose-Conditioned Joint Angle Limits for 3D Human Pose Reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–15 June 2015. [Google Scholar]
  30. Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-Based Human Pose Estimation: A Survey. ACM Comput. Surv. 2023, 56, 11. [Google Scholar] [CrossRef]
  31. Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.J.; Yuan, J.; Thalmann, N.M. Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  32. Wang, J.; Yan, S.; Xiong, Y.; Lin, D. Motion guided 3d pose estimation from videos. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 764–780. [Google Scholar]
  33. Zhang, H.; Tian, Y.; Zhou, X.; Ouyang, W.; Liu, Y.; Wang, L.; Sun, Z. PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 11446–11456. [Google Scholar]
  34. Chen, K.; Gabriel, P.; Alasfour, A.; Gong, C.; Doyle, W.K.; Devinsky, O.; Friedman, D.; Dugan, P.; Melloni, L.; Thesen, T.; et al. Patient-Specific Pose Estimation in Clinical Environments. IEEE J. Transl. Eng. Health Med. 2018, 6, 1–11. [Google Scholar] [CrossRef] [PubMed]
  35. Erol, A.; Bebis, G.; Nicolescu, M.; Boyle, R.D.; Twombly, X. Vision-Based Hand Pose Estimation: A Review. Comput. Vis. Image Underst. 2007, 108, 52–73. [Google Scholar] [CrossRef]
  36. Fastovets, M.; Guillemaut, J.Y.; Hilton, A. Athlete Pose Estimation from Monocular TV Sports Footage. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–38 June 2013; pp. 1048–1054. [Google Scholar] [CrossRef]
  37. MMA Contributors. OpenMMLab’s Next Generation Video Understanding Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmaction2 (accessed on 26 September 2023).
  38. Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 26 September 2023).
  39. Muhammad, Z.U.D.; Huang, Z.; Khan, R. A review of 3D human body pose estimation and mesh recovery. Digit. Signal Process. 2022, 128, 103628. [Google Scholar] [CrossRef]
  40. Pareek, T.G.; Mehta, U.; Gupta, A. A survey: Virtual reality model for medical diagnosis. Biomed. Pharmacol. J. 2018, 11, 2091–2100. [Google Scholar] [CrossRef]
  41. Wang, T.; Zhang, B.; Zhang, T.; Gu, S.; Bao, J.; Baltrusaitis, T.; Shen, J.; Chen, D.; Wen, F.; Chen, Q.; et al. RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 4563–4573. [Google Scholar]
  42. Cheok, A.; Weihua, W.; Yang, X.; Prince, S.; Wan, F.S.; Billinghurst, M.; Kato, H. Interactive theatre experience in embodied + wearable mixed reality space. In Proceedings of the International Symposium on Mixed and Augmented Reality, Darmstadt, Germany, 30 September–1 October 2002; pp. 59–317. [Google Scholar] [CrossRef]
  43. Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
  44. Joshi, P.; Tien, W.C.; Desbrun, M.; Pighin, F. Learning Controls for Blend Shape Based Realistic Facial Animation. In Proceedings of the ACM SIGGRAPH 2006 Courses, New York, NY, USA, 30 July–3 August 2006; SIGGRAPH ’06. p. 17-es. [Google Scholar] [CrossRef]
  45. Dantone, M.; Gall, J.; Leistner, C.; Van Gool, L. Human Pose Estimation Using Body Parts Dependent Joint Regressors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar]
  46. Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016. Lecture Notes in Computer Science. [Google Scholar]
  47. Cao, C.; Weng, Y.; Zhou, S.; Tong, Y.; Zhou, K. FaceWarehouse: A 3D Facial Expression Database for Visual Computing. IEEE Trans. Vis. Comput. Graph. 2014, 20, 413–425. [Google Scholar] [CrossRef]
  48. Paysan, P.; Knothe, R.; Amberg, B.; Romdhani, S.; Vetter, T. A 3D Face Model for Pose and Illumination Invariant Face Recognition. In Proceedings of the 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, Genova, Italy, 2–4 September 2009; pp. 296–301. [Google Scholar] [CrossRef]
  49. Sigal, L.; Balan, A.O.; Black, M.J. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 2010, 87, 4–27. [Google Scholar] [CrossRef]
  50. Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. AMASS: Archive of Motion Capture as Surface Shapes. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5442–5451. [Google Scholar]
  51. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755, Proceedings, Part V 13. [Google Scholar]
  52. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. Available online: http://xxx.lanl.gov/abs/1810.04805 (accessed on 26 September 2023).
  53. Blender. Available online: https://www.blender.org/ (accessed on 25 December 2022).
  54. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  55. von Marcard, T.; Henschel, R.; Black, M.; Rosenhahn, B.; Pons-Moll, G. Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  56. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
  57. Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar] [CrossRef]
  58. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  59. Kirillov, A.; Girshick, R.; He, K.; Dollar, P. Panoptic Feature Pyramid Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.