Lester: Rotoscope Animation through Video Object Segmentation and Tracking
Abstract
:1. Introduction
2. Related Work
3. Methodology
3.1. Overview
3.2. Segmentation and Tracking
3.3. Contours Simplification
Algorithm 1 Contour simplification algorithm |
Require: M a list containing submasks, a minimum contour area, t a tolerance value for simplification |
Ensure: R a list of contours where is the simplified contour j of submask i |
|
3.4. Finishing Details
4. Experiments and Results
5. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ACR | Absolute Category Rating |
GAN | Generative Adversarial Networks |
HOG | Histogram of Oriented Gradients |
MOS | Mean Opinion Score |
SAM | Segment Anything Model |
References
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
- Cheng, Y.; Li, L.; Xu, Y.; Li, X.; Yang, Z.; Wang, W.; Yang, Y. Segment and Track Anything. arXiv 2023, arXiv:2305.06558. [Google Scholar]
- Douglas, D.H.; Peucker, T.K. Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Caricature. In Classics in Cartography: Reflections on Influential Articles from Cartographica; John Wiley & Sons: Hoboken, NJ, USA, 2011; pp. 15–28. [Google Scholar] [CrossRef]
- Chen, Y.; Lai, Y.K.; Liu, Y.J. CartoonGAN: Generative Adversarial Networks for Photo Cartoonization. In Proceedings of the CVPR. IEEE Computer Society, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9465–9474. [Google Scholar]
- Chen, J.; Liu, G.; Chen, X. AnimeGAN: A Novel Lightweight GAN for Photo Animation. In Artificial Intelligence Algorithms and Applications, Proceedings of the 11th International Symposium, ISICA 2019, Guangzhou, China, 16–17 November 2019; Springer: Berlin/Heidelberg, Germany, 2020; pp. 242–256. [Google Scholar] [CrossRef]
- Liu, Z.; Li, L.; Jiang, H.; Jin, X.; Tu, D.; Wang, S.; Zha, Z. Unsupervised Coherent Video Cartoonization with Perceptual Motion Consistency. In Proceedings of the The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22), Virtual, 22 February–1 March 2022. [Google Scholar] [CrossRef]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2021, arXiv:2112.10752. [Google Scholar]
- Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Muller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. arXiv 2024, arXiv:2403.03206. [Google Scholar]
- Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar]
- Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. arXiv 2021, arXiv:2104.07636. [Google Scholar] [CrossRef] [PubMed]
- Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-Image Diffusion Models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, New York, NY, USA, 7–11 August 2022. SIGGRAPH ’22. [Google Scholar] [CrossRef]
- Yang, S.; Zhou, Y.; Liu, Z.; Loy, C.C. Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation. In Proceedings of the ACM SIGGRAPH Asia Conference Proceedings, Sydney, NSW, Australia, 12–15 December 2023. [Google Scholar]
- Jamriska, O. ebsynth: Fast Example-Based Image Synthesis and Style Transfer. 2018. Available online: https://github.com/jamriska/ebsynth (accessed on 20 July 2024).
- Brooks, T.; Peebles, B.; Holmes, C.; DePue, W.; Guo, Y.; Jing, L.; Schnurr, D.; Taylor, J.; Luhman, T.; Luhman, E.; et al. Video Generation Models as World Simulators. 2024. Available online: https://openai.com/index/video-generation-models-as-world-simulators/ (accessed on 20 July 2024).
- Chen, Z.; Li, S.; Haque, M. An Overview of OpenAI’s Sora and Its Potential for Physics Engine Free Games and Virtual Reality. EAI Endorsed Trans. Robot. 2024, 3. [Google Scholar] [CrossRef]
- Liang, D.; Liu, Y.; Huang, Q.; Zhu, G.; Jiang, S.; Zhang, Z.; Gao, W. Video2Cartoon: Generating 3D cartoon from broadcast soccer video. In Proceedings of the MULTIMEDIA ’05, Singapore, 11 November 2005. [Google Scholar]
- Ngo, V.; Cai, J. Converting 2D soccer video to 3D cartoon. In Proceedings of the 2008 10th International Conference on Control, Automation, Robotics and Vision, Hanoi, Vietnam, 17–20 December 2008; pp. 103–107. [Google Scholar] [CrossRef]
- Weng, C.Y.; Curless, B.; Kemelmacher-Shlizerman, I. Photo Wake-Up: 3D Character Animation From a Single Photo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Tous, R. Pictonaut: Movie cartoonization using 3D human pose estimation and GANs. Multimed. Tools Appl. 2023, 82, 1–15. [Google Scholar] [CrossRef]
- Ruben Tous, J.N.; Igual, L. Human pose completion in partial body camera shots. J. Exp. Theor. Artif. Intell. 2023, 1–11. [Google Scholar] [CrossRef]
- Fišer, J.; Jamriška, O.; Simons, D.; Shechtman, E.; Lu, J.; Asente, P.; Lukáč, M.; Sýkora, D. Example-Based Synthesis of Stylized Facial Animations. ACM Trans. Graph. 2017, 36, 1–11. [Google Scholar] [CrossRef]
- Gatys, L.A.; Ecker, A.S.; Bethge, M. A Neural Algorithm of Artistic Style. arXiv 2015, arXiv:1508.06576. [Google Scholar] [CrossRef]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15979–15988. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual, 3–7 May 2021; Available online: https://openreview.net/ (accessed on 20 July 2024).
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Yang, Z.; Yang, Y. Decoupling Features in Hierarchical Propagation for Video Object Segmentation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
- Suzuki, S.; be, K. Topological structural analysis of digitized binary images by border following. Comput. Vision Graph. Image Process. 1985, 30, 32–46. [Google Scholar] [CrossRef]
- Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1867–1874. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Zablotskaia, P.; Siarohin, A.; Zhao, B.; Sigal, L. DwNet: Dense warp-based network for pose-guided human video generation. In Proceedings of the 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, 9–12 September 2019; p. 51. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The Kinetics Human Action Video Dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Tous, R. Lester Project Home Page. Available online: https://github.com/rtous/lester (accessed on 20 July 2024).
- ITU. Subjective Video Quality Assessment Methods for Multimedia Applications. ITU-T Recommendation P.910. 2008. Available online: https://www.itu.int/rec/T-REC-P.910 (accessed on 20 July 2024).
- ITU. Mean Opinion Score (MOS) Terminology. ITU-T Recommendation P.800.1 Methods for Objective and Subjective Assessment of Speech and Video Quality. 2016. Available online: https://www.itu.int/rec/T-REC-P.800.1 (accessed on 20 July 2024).
- Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. arXiv 2023, arXiv:2208.12242. [Google Scholar]
Subset | # | Shape | Color | Temporal | Overall |
---|---|---|---|---|---|
ITW-LR a | 5 | 3.20 | 3.24 | 3.84 | 3.36 |
ITW-HR b | 14 | 3.81 | 4.03 | 4.33 | 3.99 |
C-HR c | 6 | 4.70 | 4.60 | 4.67 | 4.67 |
all | 3.90 | 4.01 | 4.31 | 4.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tous, R. Lester: Rotoscope Animation through Video Object Segmentation and Tracking. Algorithms 2024, 17, 330. https://doi.org/10.3390/a17080330
Tous R. Lester: Rotoscope Animation through Video Object Segmentation and Tracking. Algorithms. 2024; 17(8):330. https://doi.org/10.3390/a17080330
Chicago/Turabian StyleTous, Ruben. 2024. "Lester: Rotoscope Animation through Video Object Segmentation and Tracking" Algorithms 17, no. 8: 330. https://doi.org/10.3390/a17080330
APA StyleTous, R. (2024). Lester: Rotoscope Animation through Video Object Segmentation and Tracking. Algorithms, 17(8), 330. https://doi.org/10.3390/a17080330