From Pixels to Motion: A Systematic Analysis of Translation-Based Video Synthesis Techniques
Abstract
1. Introduction
2. Translation-Based Video Synthesis
3. Image-to-Video Translation
4. Video-to-Video Translation
4.1. Paired Video-to-Video Translation
4.2. Unpaired Video-to-Video Translation
4.2.1. 3D GAN-Based Approaches
4.2.2. Temporal Constraint-Based Approaches
4.2.3. Optical Flow-Based Approaches
4.2.4. Content-Motion Disentanglement Learning Approaches
4.2.5. Extended Image-to-Image Approaches
5. Datasets, Evaluation Metrics, and Loss Functions
5.1. Datasets
5.2. Loss Functions
5.2.1. Adversarial Loss
5.2.2. Reconstruction Loss
5.2.3. Temporal Consistency Loss
5.2.4. Cycle Consistency Loss
5.2.5. Specialized Loss
5.3. Evaluation Metrics
5.3.1. Spatial Quality Metrics
5.3.2. Temporal Consistency Metrics
5.3.3. Semantic Consistency Metrics
5.3.4. Video Object Segmentation Metrics
5.3.5. Perceptual and Human Evaluation Metrics
6. Quantitative Comparison
7. Future Directions and Conclusions
7.1. Future Research Directions
7.2. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| ACD | Average Content Distance |
| CK+ | Extended Cohn–Kanade Dataset |
| CK++ | Augmented Cohn–Kanade Dataset |
| FID | Fréchet Inception Distance |
| FPS | Frames Per Second |
| FVD | Fréchet Video Distance |
| GAN | Generative Adversarial Network |
| GTA | Grand Theft Auto |
| I2I | Image-to-Image |
| I2V | Image-to-Video |
| IoU | Intersection over Union |
| IS | Inception Score |
| LPIPS | Learned Perceptual Image Patch Similarity |
| LSTM | Long Short-Term Memory |
| mIoU | Mean Intersection over Union |
| MSE | Mean Squared Error |
| PA | Pixel Accuracy |
| PSNR | Peak Signal-to-Noise Ratio |
| RNN | Recurrent Neural Network |
| SAF | Signal-to-Noise Ratio Aligned Fine-tuning |
| SfM | Structure from Motion |
| SIFT | Scale-Invariant Feature Transform |
| SNR | Signal-to-Noise Ratio |
| SPADE | Spatially-Adaptive Normalization |
| SSIM | Structural Similarity Index |
| TVS | Translation-based Video Synthesis |
| UCF-101 | University of Central Florida Action Recognition Dataset |
| V2V | Video-to-Video |
References
- Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Liu, G.; Tao, A.; Kautz, J.; Catanzaro, B. Video-to-Video Synthesis. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Shen, G.; Huang, W.; Gan, C.; Tan, M.; Huang, J.; Zhu, W.; Gong, B. Facial image-to-video translation by a hidden affine transformation. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2505–2513. [Google Scholar]
- Tulyakov, S.; Liu, M.Y.; Yang, X.; Kautz, J. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1526–1535. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar]
- Bashkirova, D.; Usman, B.; Saenko, K. Unsupervised video-to-video translation. arXiv 2018, arXiv:1806.03698. [Google Scholar]
- Chen, Y.; Pan, Y.; Yao, T.; Tian, X.; Mei, T. Mocycle-gan: Unpaired video-to-video translation. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 647–655. [Google Scholar]
- Fan, L.; Huang, W.; Gan, C.; Huang, J.; Gong, B. Controllable image-to-video translation: A case study on facial expression generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 3510–3517. [Google Scholar]
- Zhao, L.; Peng, X.; Tian, Y.; Kapadia, M.; Metaxas, D.N. Towards image-to-video translation: A structure-aware approach via multi-stage generative adversarial networks. Int. J. Comput. Vis. 2020, 128, 2514–2533. [Google Scholar] [CrossRef]
- Wang, Y.; Chen, Z.; Xiaoyu, C.; Wei, Y.; Zhu, J.; Chen, J. FrameBridge: Improving Image-to-Video Generation with Bridge Models. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
- Wei, X.; Zhu, J.; Feng, S.; Su, H. Video-to-video translation with global temporal consistency. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 18–25. [Google Scholar]
- Mallya, A.; Wang, T.C.; Sapra, K.; Liu, M.Y. World-consistent video-to-video synthesis. In Proceedings, Part VIII 16, Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 359–378. [Google Scholar]
- Bansal, A.; Ma, S.; Ramanan, D.; Sheikh, Y. Recycle-gan: Unsupervised video retargeting. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 119–135. [Google Scholar]
- Liu, H.; Li, C.; Lei, D.; Zhu, Q. Unsupervised video-to-video translation with preservation of frame modification tendency. Vis. Comput. 2020, 36, 2105–2116. [Google Scholar] [CrossRef]
- Wang, T.C.; Liu, M.Y.; Tao, A.; Liu, G.; Kautz, J.; Catanzaro, B. Few-shot Video-to-Video Synthesis. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Park, K.; Woo, S.; Kim, D.; Cho, D.; Kweon, I.S. Preserving semantic and temporal consistency for unpaired video-to-video translation. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1248–1257. [Google Scholar]
- Liu, K.; Gu, S.; Romero, A.; Timofte, R. Unsupervised multimodal video-to-video translation via self-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1030–1040. [Google Scholar]
- Szeto, R.; El-Khamy, M.; Lee, J.; Corso, J.J. HyperCon: Image-to-video model transfer for video-to-video translation tasks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3080–3089. [Google Scholar]
- Skorokhodov, I.; Tulyakov, S.; Elhoseiny, M. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3626–3636. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6629–6640. [Google Scholar]
- Unterthiner, T.; Van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv 2018, arXiv:1812.01717. [Google Scholar]
- Melnik, A.; Ljubljanac, M.; Lu, C.; Yan, Q.; Ren, W.; Ritter, H. Video Diffusion Models: A Survey. arXiv 2024, arXiv:2405.03150. [Google Scholar] [CrossRef]
- Xing, Z.; Feng, Q.; Chen, H.; Dai, Q.; Hu, H.; Xu, H.; Wu, Z.; Jiang, Y.G. A survey on video diffusion models. ACM Comput. Surv. 2024, 57, 1–42. [Google Scholar] [CrossRef]
- Sun, W.; Tu, R.C.; Liao, J.; Tao, D. Diffusion model-based video editing: A survey. arXiv 2024, arXiv:2407.07111. [Google Scholar]
- Ho, J.; Salimans, T.; Gritsenko, A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video diffusion models. Adv. Neural Inf. Process. Syst. 2022, 35, 8633–8646. [Google Scholar]
- Guo, X.; Zheng, M.; Hou, L.; Gao, Y.; Deng, Y.; Wan, P.; Zhang, D.; Liu, Y.; Hu, W.; Zha, Z.; et al. I2v-adapter: A general image-to-video adapter for diffusion models. In Proceedings of the ACM SIGGRAPH 2024 Conference Papers, Denver, CO, USA, 27 July–1 August 2024; pp. 1–12. [Google Scholar]
- Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
- Lu, H.; Yang, G.; Fei, N.; Huo, Y.; Lu, Z.; Luo, P.; Ding, M. VDT: General-purpose Video Diffusion Transformers via Mask Modeling. In Proceedings of the International Conference on Representation Learning, Vienna, Austria, 7–11 May 2024; Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y., Eds.; Volume 2024, pp. 19259–19286. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild; Technical Report CRCV-TR-12-01; Center for Research in Computer Vision, University of Central Florida: Orlando, FL, USA, 2012. [Google Scholar]
- Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8798–8807. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–30 June 2016; pp. 3213–3223. [Google Scholar]
- Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–30 June 2016; pp. 724–732. [Google Scholar]
- Longuet-Higgins, H.C. A computer algorithm for reconstructing a scene from two projections. Nature 1981, 293, 133–135. [Google Scholar] [CrossRef]
- Schonberger, J.L.; Frahm, J.M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
- Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
- Balakrishnan, G.; Zhao, A.; Dalca, A.V.; Durand, F.; Guttag, J. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8340–8348. [Google Scholar]
- Liang, F.; Kodaira, A.; Xu, C.; Tomizuka, M.; Keutzer, K.; Marculescu, D. Looking Backward: Streaming Video-to-Video Translation with Feature Banks. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
- Liang, F.; Wu, B.; Wang, J.; Yu, L.; Li, K.; Zhao, Y.; Misra, I.; Huang, J.B.; Zhang, P.; Vajda, P.; et al. Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 8207–8216. [Google Scholar]
- Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 694–711. [Google Scholar]
- Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for data: Ground truth from computer games. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 102–118. [Google Scholar]
- Richter, S.R.; Hayder, Z.; Koltun, V. Playing for benchmarks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2213–2222. [Google Scholar]
- Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
- Wang, K.; Akash, K.; Misu, T. Learning Temporally and Semantically Consistent Unpaired Video-to-video Translation Through Pseudo-Supervision From Synthetic Optical Flow. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 2477–2486. [Google Scholar]
- Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. Adv. Neural Inf. Process. Syst. 2017, 30, 5622–5632. [Google Scholar]
- Jiang, H.; Sun, D.; Jampani, V.; Yang, M.H.; Learned-Miller, E.; Kautz, J. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9000–9008. [Google Scholar]
- Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8934–8943. [Google Scholar]
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
- Rivoir, D.; Pfeiffer, M.; Docea, R.; Kolbinger, F.; Riediger, C.; Weitz, J.; Speidel, S. Long-term temporally consistent unpaired video translation from simulated surgical 3D data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 10–17 October 2021; pp. 3343–3353. [Google Scholar]
- Thies, J.; Zollhöfer, M.; Nießner, M. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graph. (TOG) 2019, 38, 1–12. [Google Scholar] [CrossRef]
- Pfeiffer, M.; Funke, I.; Robu, M.R.; Bodenstedt, S.; Strenger, L.; Engelhardt, S.; Roß, T.; Clarkson, M.J.; Gurusamy, K.; Davidson, B.R.; et al. Generating large labeled data sets for laparoscopic image processing tasks using unpaired image-to-image translation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 119–127. [Google Scholar]
- Zhang, J.; Xu, C.; Liu, L.; Wang, M.; Wu, X.; Liu, Y.; Jiang, Y. Dtvnet: Dynamic time-lapse video generation via single still image. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 300–315. [Google Scholar]
- Wang, K.; Wu, Q.; Song, L.; Yang, Z.; Wu, W.; Qian, C.; He, R.; Qiao, Y.; Loy, C.C. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 700–717. [Google Scholar]
- Aifanti, N.; Papachristou, C.; Delopoulos, A. The MUG facial expression database. In Proceedings of the 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10, Garda, Italy, 12–14 April 2010; pp. 1–4. [Google Scholar]
- Rössler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. FaceForensics++: Learning to Detect Manipulated Facial Images. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef]
- Luvizon, D.C.; Picard, D.; Tabia, H. 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5137–5146. [Google Scholar]
- Li, Z.; Dekel, T.; Cole, F.; Tucker, R.; Snavely, N.; Liu, C.; Freeman, W.T. Learning the depths of moving people by watching frozen people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4521–4530. [Google Scholar]
- Huang, X.; Cheng, X.; Geng, Q.; Cao, B.; Zhou, D.; Wang, P.; Lin, Y.; Yang, R. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 954–960. [Google Scholar]
- Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
- Xu, J.; Mei, T.; Yao, T.; Rui, Y. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5288–5296. [Google Scholar]
- Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: Lille, France, 2015; pp. 843–852. [Google Scholar]
- Xu, X.; Dehghani, A.; Corrigan, D.; Caulfield, S.; Moloney, D. Convolutional neural network for 3d object recognition using volumetric representation. In Proceedings of the 2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE), Aalborg, Denmark, 6–8 July 2016; pp. 1–5. [Google Scholar]
- Akkus, Z.; Ali, I.; Sedlář, J.; Agrawal, J.P.; Parney, I.F.; Giannini, C.; Erickson, B.J. Predicting deletion of chromosomal arms 1p/19q in low-grade gliomas from MR images using machine intelligence. J. Digit. Imaging 2017, 30, 469–476. [Google Scholar] [CrossRef] [PubMed]
- Vallieres, M.; Kay-Rivest, E.; Perrin, L.J.; Liem, X.; Furstoss, C.; Aerts, H.J.; Khaouam, N.; Nguyen-Tan, P.F.; Wang, C.S.; Sultanem, K.; et al. Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer. Sci. Rep. 2017, 7, 10117. [Google Scholar] [CrossRef] [PubMed]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Wu, X.; Sun, K.; Zhu, F.; Zhao, R.; Li, H. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2096–2105. [Google Scholar]
- Villegas, R.; Yang, J.; Zou, Y.; Sohn, S.; Lin, X.; Lee, H. Learning to generate long-term future via hierarchical prediction. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: Sydney, Australia, 2017; pp. 3560–3569. [Google Scholar]
- Xing, Y.; He, Y.; Tian, Z.; Wang, X.; Chen, Q. Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 7151–7161. [Google Scholar]
- Ruan, L.; Ma, Y.; Yang, H.; He, H.; Liu, B.; Fu, J.; Yuan, N.J.; Jin, Q.; Guo, B. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 10219–10228. [Google Scholar]
- Kushwaha, S.S.; Tian, Y. Vintage: Joint video and text conditioning for holistic audio generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 13529–13539. [Google Scholar]
- Huang, Z.; He, Y.; Yu, J.; Zhang, F.; Si, C.; Jiang, Y.; Zhang, Y.; Wu, T.; Jin, Q.; Chanpaisit, N.; et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 21807–21818. [Google Scholar]
- Alanazi, S.; Asif, S.; Caird-daley, A.; Moulitsas, I. Unmasking deepfakes: A multidisciplinary examination of social impacts and regulatory responses. Hum.-Intell. Syst. Integr. 2025, 1–23. [Google Scholar] [CrossRef]
- Ma’arif, A.; Maghfiroh, H.; Suwarno, I.; Prayogi, D.; Lonang, S.; Sharkawy, A.N. Social, legal, and ethical implications of AI-Generated deepfake pornography on digital platforms: A systematic literature review. Soc. Sci. Humanit. Open 2025, 12, 101882. [Google Scholar] [CrossRef]
- Parraga, O.; More, M.D.; Oliveira, C.M.; Gavenski, N.S.; Kupssinskü, L.S.; Medronha, A.; Moura, L.V.; Simões, G.S.; Barros, R.C. Fairness in Deep Learning: A survey on vision and language research. ACM Comput. Surv. 2025, 57, 1–40. [Google Scholar] [CrossRef]
- Kotwal, K.; Marcel, S. Review of demographic fairness in face recognition. IEEE Trans. Biom. Behav. Identity Sci. 2025. [Google Scholar] [CrossRef]
- Ren, K.; Yang, Z.; Lu, L.; Liu, J.; Li, Y.; Wan, J.; Zhao, X.; Feng, X.; Shao, S. Sok: On the role and future of aigc watermarking in the era of gen-ai. arXiv 2024, arXiv:2411.11478. [Google Scholar] [CrossRef]
- Longpre, S.; Mahari, R.; Obeng-Marnu, N.; Brannon, W.; South, T.; Gero, K.I.; Pentland, A.; Kabbara, J. Position: Data Authenticity, Consent, & Provenance for AI are all broken: What will it take to fix them? In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Mastoi, Q.u.A.; Memon, M.F.; Jan, S.; Jamil, A.; Faique, M.; Ali, Z.; Lakhan, A.; Syed, T.A. Enhancing Deepfake Content Detection Through Blockchain Technology. Int. J. Adv. Comput. Sci. Appl. 2025, 16. [Google Scholar] [CrossRef]
- Hagendorff, T. The ethics of AI ethics: An evaluation of guidelines. Minds Mach. 2020, 30, 99–120. [Google Scholar] [CrossRef]








| Application | Dataset Name | Total Data | Resolution |
|---|---|---|---|
| Facial Expression | Cohn–Kanade (CK+) [28] | 593 video sequences, 123 subjects | 640 × 480 |
| MEAD [54] | 60 actors, 8 emotions, 3 intensities | 1024 × 1024 | |
| MUG Facial Expression [55] | 1462 sequences, 86 subjects | 896 × 896 | |
| FaceForensics [56] | 1000 videos | 128 × 128 | |
| Human Motion | Human3.6M [57] | 3.6 million 3D poses, 11 subjects, 15 actions | 1000 × 1000 (varies) |
| Penn Action [58] | 2326 video sequences, 15 actions | 640 × 480 | |
| YouTube Dancing [16] | 1500 videos, 15,000 training clips | 256 × 256 or 512 × 512 (varies) | |
| Mannequin Challenge [59] | 3040 training, 292 test sequences | 1024 × 512 | |
| Scene and Environment | Cityscapes [32] | 5000 annotated images, 50 cities | 2048 × 1024 |
| Viper [41] | 77 video sequences, >250,000 frames, 5 conditions | 1920 × 1080 | |
| Apolloscape [60] | 73 scenes, >140,000 frames, 100–1000 frames/scene | 1920 × 1080 | |
| ScanNet [61] | 1513 scans, 2.5 million frames | 1296 × 968 | |
| DAVIS2017 [33] | 150 video sequences, 10,459 frames | 3840 × 2160 (4K), 480p for challenge | |
| Medical | Laparoscopic [52] | 4097 video clips, 34 patients | 1920 × 1080 |
| MRCT [7] | 225 MR, 234 CT images (2D slices) | 256 × 256 | |
| General Purpose/Synthetic | UCF-101 [30] | 13,320 video clips, 101 actions | 320 × 240 |
| MSR-VTT [62] | 10,000 video clips, 200,000 captions | 320 × 240 | |
| Moving MNIST [63] | 10,000 sequences, 20 frames/seq | 64 × 64 | |
| SkyTimelapse [53] | >10,000 videos | 256 × 256 | |
| Volumetric MNIST [64] | 10,000 sequences, 3D volumes | 30 × 84 × 84 or 28 × 28 × 28 |
| Task | Dataset | Method | Metric | Value |
|---|---|---|---|---|
| Urban Scene Segmentation | Cityscapes | Pix2pixHD [31] | FID (↓) | 5.57 |
| Vid2vid [1] | FID (↓)/mIoU (↑) | 4.66/61.2 | ||
| World-consistent vid2vid [13] | FID (↓)/mIoU (↑) | 49.89/64.8 | ||
| Synthetic-to-Real Domain Adaptation | Viper | CycleGAN [5] | mIoU (↑)/PA (↑) | 8.2/54.3% |
| Recycle-GAN [14] | mIoU (↑)/PA (↑) | 11.0/61.2% | ||
| MoCycle-GAN [8] | mIoU (↑)/PA (↑) | 13.2/68.1% | ||
| UVIT [18] | mIoU (↑)/PA (↑) | 13.71/68.06% | ||
| Human Motion Synthesis | Human3.6M | Villegas et al. [73] | PSNR (↑) | 19.2 |
| Zhao et al. [10] | PSNR (↑) | 22.6 | ||
| UCF-101 | Video Diffusion [26] | FVD (↓) | 171 | |
| FrameBridge [11] | FVD (↓) | 154 | ||
| Time Lapse Video Generation | SkyTimelapse | MoCoGAN [3] | FVD (↓) | 206.6 |
| StyleGAN-V [20] | FVD (↓) | 79.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Saha, P.; Zhang, C. From Pixels to Motion: A Systematic Analysis of Translation-Based Video Synthesis Techniques. Information 2025, 16, 990. https://doi.org/10.3390/info16110990
Saha P, Zhang C. From Pixels to Motion: A Systematic Analysis of Translation-Based Video Synthesis Techniques. Information. 2025; 16(11):990. https://doi.org/10.3390/info16110990
Chicago/Turabian StyleSaha, Pratim, and Chengcui Zhang. 2025. "From Pixels to Motion: A Systematic Analysis of Translation-Based Video Synthesis Techniques" Information 16, no. 11: 990. https://doi.org/10.3390/info16110990
APA StyleSaha, P., & Zhang, C. (2025). From Pixels to Motion: A Systematic Analysis of Translation-Based Video Synthesis Techniques. Information, 16(11), 990. https://doi.org/10.3390/info16110990

