The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer
Abstract
:1. Introduction
1.1. Related Research
1.2. Key Technologies in Articulatory Animation Generation
- It employs a stylized generator architecture that can explicitly control the style and content of images;
- It uses a mapping network to match the distribution of random noise, resulting in more diverse generation results;
- It introduces a feature mapping technique that can synthesize high-quality, high-resolution images;
- It utilizes noise input, making the image quality clearer and the style more coherent;
- It can generate highly realistic scene images.
1.3. Main Innovation of Our Method
- Different occasions require pronunciation animations of different styles, and the workload of manual drawing is very high. With our system, given a specific style image and a driving articulation animation, the system can automatically generate an articulation animation of the target style based on the image, thus greatly reducing the workload of manual drawing;
- Our method creatively uses the techniques of keypoint extraction and keypoint registration to realize the automatic generation of articulation animations. Firstly, the keypoint extraction network is used to extract the keypoints of the target image. Then, the keypoints of the keyframe of the driving video are extracted. Next, there is a registration match between the keypoints of the two images. This is followed by the prediction of pixel motion in the target animation through a dense motion field and animation generation is realized through a GAN network;
- In order to improve the accuracy and visual effect of the generated animation, we also provide a manual configuration method. If the keypoints extracted from the original target image and the keyframe of the driving video do not match perfectly or have errors, we can calibrate them through manual methods. This can greatly improve the accuracy and visual effect of our video generation and only requires a very small increase in manual workload;
- With the articulatory animation for each vowel and consonant, we provide a method to combine them into phonemes and to combine animations of different phonemes to form word animations to help English learners to improve their pronunciation.
2. Method
2.1. Selection of Driving Video and Target-Style Image
- First, we collect the publicly used articulatory animations to form our dataset. These animation datasets can be used as driving animations and also as training datasets;
- In the second step, we preprocess the dataset, detect the articulatory region, and clip the region of interest, which can optimize the training data and greatly improve the follow-up training effect;
- The third step is to design and train a deep learning network that includes keypoint detection, movement transfer, and style transfer. This network can transfer the style of the target image to the driving animation, thus forming a new style of animation. In the training process, the network collects a series of video samples, which contain different samples of the same object, and learns the potential expression of video motion features. With the characteristics of a single image frame, the model can reconstruct the video;
- The fourth step is animation cascade by cascading consonants and vowels into a syllable animation. When multiple syllables are cascaded into a word animation, multiple connecting words can be cascaded into a sentence animation;
- The fifth step is to optimize and perfect the animation manually according to the situation such that the articulation is accented, toned, and connected and, finally, a relatively complete system can be generated. The system framework is shown in Figure 3.
2.2. Keypoint Detection in Feature Space
2.3. Style Transfer Model for Articulation Image
2.4. Animation Generation for Syllables and Words
3. Experiments and Results
3.1. Dataset Preparation
- The articulation videos were extracted into frame-by-frame images in RGB format. All image frames for the same video were saved in the same directory. Each image contained different numbers of frames: the minimum was 10 frames and the maximum was 120 frames;
- For the original images that were large, most of the areas were still and did not move, and the moving parts were mainly in the lower right corner. We used software to automatically intercept the lower right corner area of each image. The width and height of the intercepted area were equal, and it was a square area;
- We normalized the intercepted area into RGB images 256 pixels in height and width with three channels in image format and finally obtained 256 × 256 × 3 RGB images.
3.2. Evaluation Method
3.3. Training
3.4. The Results of Keypoint Detection
3.5. Animation Generation
3.6. Discussion
- The process can generate high-resolution and high-quality images. The images are rich in detail and the quality is close to that of real pictures;
- The training process is more stable and can handle complex, high-resolution images;
- The generation process is deterministic, and the same image can be generated each time the same noise is input;
- The model framework is simple and easy to implement.
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, X.; Zhang, Z. Review of Speech Driven Facial Animation. Comput. Eng. Appl. 2017, 22, 142–149. [Google Scholar]
- Li, R.; Yu, J.; Luo, C.; Wang, Z. 3D Visualization Method for Tongue Movements in Pronunciation. PR AI 2016, 5, 142–149. [Google Scholar]
- Mi, H.; Hou, J.; Li, K.; Gan, L. Chinese Speech Synchronized 3D Lip Animation. Appl. Res. Comput. 2015, 4, 142–149. [Google Scholar]
- Zheng, H.; Bai, J.; Wang, L.; Zhu, Y. Visual Speech Synthesis Based on Articulatory Trajectory. Comput. Appl. Softw. 2013, 6, 142–149. [Google Scholar]
- Zhi, N.; Li, A. Phonetic Training Based on Visualized Articulatory Model. J. Foreign Lang. 2020, 1, 142–149. [Google Scholar]
- Tang, Z.; Hou, J. Speech-driven Articulator Motion Synthesis with Deep Neural Networks. Acta Autom. Sin. 2016, 6, 142–149. [Google Scholar]
- Jiang, C.; Yu, J.; Luo, C.; Wang, Z. Physiology Based Tongue Modeling and Simulation. J. Comput.-Aided Des. Comput. Graph. 2015, 12, 142–149. [Google Scholar]
- Chen, Z.; Xin, Q.; Zhu, Y.; Lin, Q.; Wang, L. Visualization Study of Virtual Human Tongue in Speech Production. Chin. J. Rehabil. Theory Pract. 2013, 10, 142–149. [Google Scholar]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2016, arXiv:1511.06434v2. [Google Scholar]
- Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved Training of Wasserstein GANs. arXiv 2017, arXiv:1704.00028v3. [Google Scholar]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. arXiv 2020, arXiv:1912.04958v2. [Google Scholar]
- Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical Coordinate Regression with Convolutional Neural Networks. arXiv 2018, arXiv:1801.07372v2. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. arXiv 2016, arXiv:1603.06937v2. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arXiv 2019, arXiv:1812.08008v2. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose EstimationKe. arXiv 2019, arXiv:1902.09212v1. [Google Scholar]
- Wang, W.; Bao, H.; Dong, L.; Bjorck, J.; Peng, Z.; Liu, Q.; Aggarwal, K.; Mohammed, O.K.; Singhal, S.; Som, S.; et al. Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasksar. arXiv 2022, arXiv:2208.10442v1. [Google Scholar]
- Cheng, B.; Schwing, A.; Kirillov, A. Per-Pixel Classification is Not All You Need for Semantic Segmentation. arXiv 2021, arXiv:2107.06278v2. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030v2. [Google Scholar]
- Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv 2019, arXiv:1812.04948v3. [Google Scholar]
- University of Glasgow Homepage. Available online: https://www.gla.ac.uk/ (accessed on 1 January 2012).
- University of British Columbia Homepage. Available online: https://enunciate.arts.ubc.ca/ (accessed on 1 January 2018).
- Zhang, Y.; Guo, Y.; Jin, Y.; Luo, Y.; He, Z.; Lee, H. Unsupervised Discovery of Object Landmarks as Structural Representations. arXiv 2018, arXiv:1804.04412v1. [Google Scholar]
- Jakab, T.; Gupta, A.; Bilen, H.; Vedaldi, A. Conditional Image Generation for Learning the Structure of Visual Objects. arXiv 2018, arXiv:1806.07823v1. [Google Scholar]
- Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; Sebe, N. Animating Arbitrary Objects via Deep Motion Transfer. arXiv 2019, arXiv:1812.08861v3. [Google Scholar]
- Siarohin, A.; Lathuilière, S.; Tulyakov, S.; Ricci, E.; Sebe, N. First Order Motion Model for Image Animation. arXiv 2020, arXiv:2003.00196v3. [Google Scholar]
- Siarohin, A.; Woodford, O.J.; Ren, J.; Chai, M.; Tulyakov, S. Motion Representations for Articulated Animation. arXiv 2021, arXiv:2104.11280v1. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv 2020, arXiv:2006.11239v2. [Google Scholar]
- Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv 2022, arXiv:2112.10741v3. [Google Scholar]
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. arXiv 2022, arXiv:2010.02502v4. [Google Scholar]
- Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv 2023, arXiv:2209.00796v10. [Google Scholar]
Types | Video Automatically Generated by the Model | Video Hand-Drawn in MAYA Software |
---|---|---|
Consonants | 56.22% | 43.78% |
Vowels | 48.17% | 51.83% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ling, X.; Zhu, Y.; Liu, W.; Liang, J.; Yang, J. The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer. Computers 2023, 12, 150. https://doi.org/10.3390/computers12080150
Ling X, Zhu Y, Liu W, Liang J, Yang J. The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer. Computers. 2023; 12(8):150. https://doi.org/10.3390/computers12080150
Chicago/Turabian StyleLing, Xufeng, Yu Zhu, Wei Liu, Jingxin Liang, and Jie Yang. 2023. "The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer" Computers 12, no. 8: 150. https://doi.org/10.3390/computers12080150
APA StyleLing, X., Zhu, Y., Liu, W., Liang, J., & Yang, J. (2023). The Generation of Articulatory Animations Based on Keypoint Detection and Motion Transfer Combined with Image Style Transfer. Computers, 12(8), 150. https://doi.org/10.3390/computers12080150