Dance Motion-Guided Music Generation via Residual Vector Quantization †
Abstract
1. Introduction
- (1)
- We develop a new dance-to-music generation approach that adopts an RVQ-based encoder–decoder. Unlike existing methods that directly use music waveforms or MIDI, our method instead predicts the indices of latent-space vectors from the RVQ’s codebooks.
- (2)
2. Related Work
2.1. Music-to-Dance Generation
2.2. Dance-to-Music Generation
3. Proposed Method
3.1. Dance Dataset
3.2. Data Representations
3.3. Music Encoder and Decoder
3.4. Dance-Based Music Prediction
4. Experimental Section
4.1. Experimental Setup
4.2. Evaluation Metrics
4.3. Results
4.3.1. Beats Coverage Score (Higher = Better)
4.3.2. Beats Hit Score (Higher = Better)
4.4. Ablation Experiments
5. Discussion and Future Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chen, K.; Tan, Z.; Lei, J.; Zhang, S.H.; Guo, Y.C.; Zhang, W.; Hu, S.M. ChoreoMaster: Choreography-Oriented Music-Driven Dance Synthesis. ACM Trans. Graph. 2021, 40, 1–13. [Google Scholar] [CrossRef]
- Ye, Z.; Wu, H.; Jia, J.; Bu, Y.; Chen, W.; Meng, F.; Wang, Y. ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit. In Proceedings of the 28th ACM International Conference on Multime; Association for Computing Machinery: New York, NY, USA, 2020; pp. 744–752. [Google Scholar] [CrossRef]
- Zhou, Y.; Li, Z.; Xiao, S.; He, C.; Huang, Z.; Li, H. Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis. In Proceedings of the ICLR 2018 Conference Track 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Huang, R.; Hu, H.; Wu, W.; Sawada, K.; Zhang, M.; Jiang, D. Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning. In Proceedings of the International Conference on Learning Representations (ICLR 2021), Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Gan, C.; Huang, D.; Chen, P.; Tenenbaum, J.B.; Torralba, A. Foley Music: Learning to Generate Music from Videos. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2020; pp. 758–775. [Google Scholar]
- Aggarwal, G.; Parikh, D. Dance2Music: Automatic Dance-driven Music Generation. arXiv 2021, arXiv:cs.SD/2107.06252. [Google Scholar] [CrossRef]
- Liang, X.; Li, W.; Huang, L.; Gao, C. DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music Generator. IEEE Trans. Multimed. 2024, 26, 10237–10250. [Google Scholar] [CrossRef]
- Zeghidour, N.; Luebs, A.; Omran, A.; Skoglund, J.; Tagliasacchi, M. SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Trans. Audio Speech Lang. Proc. 2021, 30, 495–507. [Google Scholar] [CrossRef]
- Zhu, Y.; Wu, Y.; Olszewski, K.; Ren, J.; Tulyakov, S.; Yan, Y. Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 7828–7853. [Google Scholar]
- Zhu, Y.; Olszewski, K.; Wu, Y.; Achlioptas, P.; Chai, M.; Yan, Y.; Tulyakov, S. Quantized GAN for Complex Music Generation from Dance Videos. In Proceedings of the 17th European Conference on Computer Vision (ECCV 2022), Tel Aviv, Israel, 23–27 October 2022; pp. 182–199. [Google Scholar]
- Li, S.; Dong, W.; Zhang, Y.; Tang, F.; Ma, C.; Deussen, O.; Lee, T.Y.; Xu, C. Dance-to-Music Generation with Encoder-based Textual Inversion. In Proceedings of SIGGRAPH Asia 2024 Conference Papers; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–11. [Google Scholar]
- Li, R.; Yang, S.; Ross, D.A.; Kanazawa, A. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 13381–13392. [Google Scholar] [CrossRef]
- Lee, H.Y.; Yang, X.; Liu, M.Y.; Wang, T.C.; Lu, Y.D.; Yang, M.H.; Kautz, J. Dancing to Music. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 3581–3591. [Google Scholar]
- Zhuang, W.; Wang, C.; Chai, J.; Wang, Y.; Shao, M.; Xia, S. Music2Dance: DanceNet for Music-Driven Dance Generation. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–21. [Google Scholar] [CrossRef]
- Li, B.; Zhao, Y.; Zhelun, S.; Sheng, L. DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2022; Volume 36, pp. 1272–1279. [Google Scholar] [CrossRef]
- Niu, B.; Yang, R.; Zhang, Q.; Zhang, Y.; Fan, Y. CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning. Big Data Cogn. Comput. 2025, 9, 307. [Google Scholar] [CrossRef]
- Li, X.; Li, R.; Fang, S.; Xie, S.; Guo, X.; Zhou, J.; Peng, J.; Wang, Z. Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2025; pp. 14420–14430. [Google Scholar]
- Dong, B.; Lei, W.; Liu, L. FFD: Fine-Finger Diffusion Model for Music to Fine-grained Finger Dance Generation. In Proceedings of the Interspeech 2025; International Speech Communication Association (ISCA): Grenoble, France, 2025. [Google Scholar]
- Chen, Z.; Xu, H.; Song, G.; Xie, Y.; Zhang, C.; Chen, X.; Wang, C.; Chang, D.; Luo, L. X-Dancer: Expressive Music to Human Dance Video Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2025; pp. 10602–10611. [Google Scholar]
- Wang, H.; Song, Y.; Jiang, W.; Wang, T. A Music-Driven Dance Generation Method Based on a Spatial-Temporal Refinement Model to Optimize Abnormal Frames. Sensors 2024, 24, 588. [Google Scholar] [CrossRef] [PubMed]
- Han, B.; Li, Y.; Shen, Y.; Ren, Y.; Han, F. Dance2MIDI: Dance-driven multi-instrument music generation. Comput. Vis. Media 2024, 10, 791–802. [Google Scholar] [CrossRef]
- Sun, C.; Liu, G.; Fleming, C.; Yan, Y. Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2025; pp. 8321–8330. [Google Scholar]
- Lin, S.; Zukerman, M.; Yan, H. Dance to Music Generation Based on Residual Vector Quantization. In Proceedings of the IEEE BigData 2025; IEEE: New York, NY, USA, 2025; pp. 5175–5178. [Google Scholar]
- Ronchi, M.R.; Perona, P. Benchmarking and Error Diagnosis in Multi-instance Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017; IEEE: New York, NY, USA, 2017; pp. 369–378. [Google Scholar] [CrossRef]
- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 2015, 34, 1–16. [Google Scholar] [CrossRef]
- Ho, C.; Tsai, W.T.; Lin, K.S.; Chen, H.H. Extraction and alignment evaluation of motion beats for street dance. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing; IEEE: New York, NY, USA, 2013; pp. 2429–2433. [Google Scholar]







| Method | Scope | Dataset | Limitations | Advantages |
|---|---|---|---|---|
| Dance2Music [6] | Music2Dance | AIST++ | Similarity-based method | Online generation |
| Foley Music [5] | Video2MIDI | Foley Music | Only MIDI | MIDI can generate different styles of music |
| D2M-GAN [10] | Video2Music | AIST++ and TikTok Dance-Music | Slow inference speed | Style control |
| CDCD [9] | Video2Music and Music2Video | AIST++ and TikTok Dance-Music | Slow inference speed | Both Video2Music and Music2Video |
| Wang et al. [20] | Music2Video | DanceIt | Dataset is small | Abnormal frames fix |
| Niu et al. [16] | Music2Video | Niu et al. | Chinese ethnic folk dance generation only | First culture-aware framework |
| SoulNet [17] | Music2Dance | SoulDance | Large model size | Motions include hand and face |
| Dong et al. [18] | Music2Dance | DanceFingers-4K | Large model size | Finger movement |
| X-dancer [19] | Music2Video | X-dancer | Large model size | Generate background |
| Sun et al. [22] | Dance2Music | AIST++ and TikTok dance video | Large model size | Use both positive and negative conditioning |
| Module | Input Dim. | Output Dim. | Num. Layers | Parameter Size |
|---|---|---|---|---|
| Pose and Beat Projection | 73 | 256 | 1 | 18.94 K |
| Multi-Head Self-Attention Encoder | 256 | 256 | 2 | 1.57 M |
| LSTM Decoder | 256 | 1024 | 3 | 16.09 M |
| RVQ Classification Heads | 1024 | 1024 | 8 | 8.40 M |
| Total | – | – | – | 26.08 M |
| Method | Beats Coverage Score ↑ | Beats Hit Score ↑ |
|---|---|---|
| Dance2Music [6] | 83.5 | 82.4 |
| Foley Music [5] | 74.1 | 69.4 |
| D2M-GAN [10] | 88.2 | 84.7 |
| CDCD [9] | 89.0 | 83.8 |
| Ours | 89.6 | 85.2 |
| Method | Beats Coverage Score ↑ | Beats Hit Score ↑ |
|---|---|---|
| Dance-to-Music [11] | 47.6 | |
| Ours | 38.4 |
| Method | Beats Coverage Score ↑ | Beats Hit Score ↑ |
|---|---|---|
| Full model (5 × 10−5) | 89.1 | 84.9 |
| Full model (1 × 10−5) | ||
| Full model (5 × 10−4) | 88.6 | 84.8 |
| Full model without KB (5 × 10−5) | 88.9 | 83.2 |
| Full model without KB (1 × 10−5) | 89.0 | 84.5 |
| Full model without KB (5 × 10−4) | 88.3 | 84.4 |
| Method | Beats Coverage Score ↑ | Beats Hit Score ↑ |
|---|---|---|
| Full model (4 codebooks) | 72.2 | 70.9 |
| Full model (8 codebooks) | 89.6 | 85.2 |
| Full model (12 codebooks) | 89.4 | 85.3 |
| Method | Beats Coverage Score ↑ | Beats Hit Score ↑ | Test Accuracy ↑ |
|---|---|---|---|
| model with GRU | 85.7 | 82.9 | 72.4% |
| model with LSTM | 89.6 | 85.2 | 75.2% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lin, S.; Zukerman, M.; Yan, H. Dance Motion-Guided Music Generation via Residual Vector Quantization. Electronics 2026, 15, 2098. https://doi.org/10.3390/electronics15102098
Lin S, Zukerman M, Yan H. Dance Motion-Guided Music Generation via Residual Vector Quantization. Electronics. 2026; 15(10):2098. https://doi.org/10.3390/electronics15102098
Chicago/Turabian StyleLin, Shuhong, Moshe Zukerman, and Hong Yan. 2026. "Dance Motion-Guided Music Generation via Residual Vector Quantization" Electronics 15, no. 10: 2098. https://doi.org/10.3390/electronics15102098
APA StyleLin, S., Zukerman, M., & Yan, H. (2026). Dance Motion-Guided Music Generation via Residual Vector Quantization. Electronics, 15(10), 2098. https://doi.org/10.3390/electronics15102098

