Dance Motion-Guided Music Generation via Residual Vector Quantization

Lin, Shuhong; Zukerman, Moshe; Yan, Hong

doi:10.3390/electronics15102098

Open AccessArticle

Dance Motion-Guided Music Generation via Residual Vector Quantization^†

by

Shuhong Lin

^*

,

Moshe Zukerman

and

Hong Yan

Department of Electrical Engineering, The Center for Intelligent Multidimensional Data Analysis, City University of Hong Kong, Kowloon, Hong Kong SAR, China

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in 2025 IEEE International Conference on Big Data.

Electronics 2026, 15(10), 2098; https://doi.org/10.3390/electronics15102098

Submission received: 12 April 2026 / Revised: 8 May 2026 / Accepted: 12 May 2026 / Published: 14 May 2026

(This article belongs to the Topic Advanced Development and Applications of AI-Generated Content (AIGC))

Download

Browse Figures

Versions Notes

Abstract

Music is a traditional and important element in human entertainment. We propose a novel deep learning-based method for generating music from human dance movements. We employ Residual Vector Quantization (RVQ) for music tokenization, using RVQ indices as the music representation, thereby reducing learning complexity. A cross-modal generation model integrating LSTM and attention mechanisms is designed to generate RVQ codebook indices from dance motions. Finally, the indices can be converted back to music waveforms through a music decoder. Experimental results demonstrate the feasibility of generating coherent music that aligns with dance dynamics, providing a new approach for cross-modal entertainment content creation.

Keywords:

background music generation; residual vector quantization; dance-driven music

1. Introduction

Dance, as a form of art that uses the human body as a medium to express emotions and ideas, has a long-standing relationship with music. Traditionally, choreographers and composers work together to create a harmonious combination of dance movements and music. However, this process is often time-consuming and requires a high level of professional skills and creativity from both parties. Machine learning has shown remarkable capabilities in pattern recognition, prediction, and generation tasks in recent years. By using machine learning, researchers have made significant progress in choreography generation [1,2,3,4]. But they mainly focus on the dance motion generation. On the other hand, little work has been devoted to generating corresponding accompaniment music from dance movements. Chuang et al. [5] introduced a framework for Musical Instrument Digital Interface (MIDI) file prediction. But the audio generated via MIDI files relies heavily on external synthesizers, lacking acoustic richness. Aggarwal et al. [6] put forward two distinct approaches for dance-to-music generation, catering to both offline and real-time use cases. One is a search-based offline strategy. The other is a deep neural network-driven online approach. It differs from the offline mode and can generate music “on-the-fly” as the dance video plays. They select the dance similarity matrix as input and note sequences as output. But this design limits the model’s ability to align music dynamics and style with the expressive details of dance. Liang et al. [7] proposed a dance-to-music framework that generates rhythmically and stylistically consistent multi-track music from dance videos.

Residual Vector Quantization (RVQ), originally applied in the field of music recovery and generation [8], has been demonstrated by numerous experiments to be an efficient model. On the motion side, rather than using raw video frames as input, we adopt human keypoint representations, which similarly substantially reduce the difficulty of model training.

In this paper, we propose a new background music generation method built on a music RVQ encoder–decoder framework. Specifically, the RVQ module can ensure the required bit rate of the output audio while compressing the continuous, high-dimensional audio waveforms into discrete vectors sampled from a fixed codebook. By doing so, it avoids the difficulty of modeling infinitely continuous features in audio spaces and instead allows us to generate discrete, finite RVQ codebook vectors. On the motion side, rather than using raw video frames as input, we adopt human keypoint representations, which similarly substantially reduce the difficulty of model training compared to video-based motion generation by leveraging the structured, low-dimensional nature of keypoint features to avoid redundant computational overhead.

For experimental data, we leverage the AIST++ dataset, where human dance is represented via sequences of 3D human-skeleton keypoints, and background music is represented as waveforms. The key contributions of this work are summarized as follows:

(1): We develop a new dance-to-music generation approach that adopts an RVQ-based encoder–decoder. Unlike existing methods that directly use music waveforms or MIDI, our method instead predicts the indices of latent-space vectors from the RVQ’s codebooks.
(2): We perform a comparative analysis of our method against other state-of-the-art approaches [5,6,9,10,11]. Quantitative evaluation results confirm that our method outperforms these existing alternatives.

The remainder of this paper is organized as follows. Section 2 reviews existing methods from related publications. In Section 3, we present the details of the new method to generate music based on dance motion. Section 4 contains extensive quantitative results of its comparison against existing methods and a discussion.

2. Related Work

In the field of dance and music generation, existing approaches primarily revolve around two core directions, both centered on the interplay between two key components: dance movements and background music. One direction focuses on generating specific dance movements from given background music, while the other aims to create background music that aligns with pre-existing dance movements.

2.1. Music-to-Dance Generation

Recurrent Neural Networks (RNNs) are naturally suited for motion generation tasks due to their ability to model sequential dependencies by updating hidden states with historical input information. Specifically, an RNN predicts the next dance frame by fusing its previous output and current hidden state [3,4].

In the conventional RNN training phase, the RNN relies on a strategy called teacher forcing. Specifically, RNN is recursively given a sequence of ground-truth motion data instead of its own training output. This method increases training speed by using ground-truth data, but leads to error accumulation in the test phase because the RNN is unaware of the ground truth in that phase [3]. Zhou et al. [3] addressed the error accumulation issue in conventional RNN-based motion generation by proposing an Auto-Conditioned Recurrent Neural Network (acRNN). A core innovation of this framework lies in its training strategy: instead of relying entirely on ground-truth motion data (as in standard teacher forcing), acRNN periodically replaces a portion of the ground-truth inputs with the model’s own generated outputs during the training process. Critically, the proportion of these self-generated inputs is not fixed. It is gradually increased as training progresses. Huang et al. [4] followed Zhou et al. [3] and used a dynamic auto-condition learning approach to address the error accumulation by the RNN so that their neural network has the ability to generate a long-term dance movement without freezing motion.

Li et al. [12] proposed the largest public dance–music paired dataset, AIST++. Unlike earlier datasets [4,13], which were limited to one-to-one dance–music mappings, AIST++ supports many-to-many pairings, allowing the same music clip to correspond to multiple distinct dance sequences. Li et al. [12] proposed the Full Attention Cross-Modal Transformer (FACT), which uses a short motion seed and musical data to synthesize future dance motions so that the network can learn the flexible relationships between musical data and dance. To address the common issue of motion freezing, its decoder predicts multiple consecutive frames (not just one) per step. This setup lets FACT learn flexible music–dance links and generate coherent dance motions.

The musical rhythm and melody were first extracted by Zhang et al. [14]. Additionally, they incorporate a musical style classifier to capture the style of the track and the musical style, melody, and rhythm, as the control signals are applied as explicit control signals.

These methods generate dance motions frame by frame, treating dance as a sequence of poses and training neural networks to learn pose-to-pose relationships. However, a short dance consists of hundreds of poses, making this approach prone to error accumulation and computationally expensive. Notably, recent approaches have begun transforming continuous dance sequences into discrete units to reduce the network’s learning burden. For instance, Ye et al. [2] decomposed dance into indivisible choreographic action units (CAUs) inspired by the human choreographers’ procedure. Lee et al. [13] used a VAE-GAN framework to model CAUs and their temporal organization; Li et al. [15] also simplified learning via a two-stage keyframe-based method and used splines to model dance motions. Across all music-to-dance generation methods, aligning dance rhythm with musical rhythm is a core focus. This is reflected in designs such as Chen et al.’s [1] embedding module, which is built to explicitly capture music–dance connections.

Niu et al. [16] introduced a culture-aware framework for generating Chinese ethnic folk dances and constructed the Helouwu ethnic dance dataset. They use cultural embedding vectors derived from cultural knowledge to model cultural semantics.

Li et al. [17] proposed the SoulNet and SoulDance dataset. SoulNet leverages a hierarchical Residual Vector Quantization (RVQ) architecture to model fine-grained motion dependencies across body, hand, and facial movements. Their SoulDance dataset contains high-quality motion capture data with hand and face movements.

Dong et al. [18] proposed an end-to-end framework for music-to-finger-dance generation. Previous dance generation methods typically ignore finger movements. They used a diffusion model to align the finger movement and the music. They established the large-scale finger-dance dataset (DanceFingers-4K).

Chen et al. [19] proposed a transformer–diffusion framework that can generate human image animation based on a single image and music. The output image contains a consistent background scene. Their own dataset contains the full- and half-body dance.

Wang et al. [20] developed a refinement learning model for generating music from video. Their framework contains a cross-modal alignment model for audio–pose relationship learning and a spatial refinement model for natural transitional dance movement.

2.2. Dance-to-Music Generation

A prominent task within the realm of cross-modal generation involves producing audio that aligns semantically with the activities of characters for enhancing multimedia production, from film scoring to interactive media design. The generated audio typically manifests in two primary formats: Musical Instrument Digital Interface (MIDI) files, which encode symbolic musical information such as note pitches, durations, and instrument types, and raw waveforms, which represent the actual acoustic signals that can be played directly and easily.

Notably, Chuang et al. [5] introduced a framework for generating music. Their framework is designed to predict MIDI events by analyzing video footage of individuals playing musical instruments. To achieve this, they employed a pre-trained keypoints extraction network to extract 2D coordinates of human body keypoints from a video. Specifically, the network outputs the spatial positions of anatomically meaningful keypoints capturing the core structural topology of the human body in each frame. This keypoint set includes 25 markers capturing movements of the limbs and trunk, along with 24 keypoints per hand to detail fine-grained finger and wrist motions, which are critical for interpreting instrumental playing gestures. Unlike raw images, which contain abundant redundant information, keypoints are a more compact, motion-centric, and noise-robust representation for dance motion analysis and cross-modal generation tasks. These spatial features were then processed using a Graph Convolutional Neural Network (Graph CNN) architecture, which effectively models the relational dynamics between keypoints to generate meaningful MIDI tokens. Han et al. [21] proposed a dance-to-MIDI dataset called D2MIDI containing multi-instrument music and paired dance motion. Together with the dataset, Han et al. developed a deep learning framework for dance-to-music generation. Specifically, their sequentially structured framework contains three cascaded components: a motion encoder, a drum rhythm generator, and a MIDI generator. In parallel, Aggarwal et al. [6] addressed music generation for dance videos through two distinct approaches: a search-based offline method and a real-time online framework. The offline approach operates by first processing the entire dance video to generate cohesive music, leveraging the similarity matrices of pose and music to ensure structural consistency. Conversely, the online approach employs a Long Short-Term Memory (LSTM) network to extract sequential music features from MIDI files, capturing temporal dependencies in musical patterns, and a Convolutional Neural Network (CNN) to derive motion features from pose-similarity matrices that quantify relational dynamics between consecutive poses. This combination enables the model to generate the next frame of music in real time as the dance video progresses, maintaining synchronization with ongoing movements. Liang et al. [7] proposed a framework for multi-track music generation. Similar to the work of Chuang et al., they also use 2D keypoints extracted from video as the input to generate a MIDI file.

Li et al. [11] leverages arbitrary pre-trained text-to-music models to control the emotion and genre of the generated music. They utilized the 2D human motion extracted from the videos. Zhu et al. further advanced this field with two key contributions: Dance2Music-GAN (D2M-GAN) [9] and Conditional Discrete Contrastive Diffusion (CDCD) [10]. D2M-GAN integrates RVQ into both its audio encoder and motion encoder, transforming high-dimensional, complex input data (audio signals and motion sequences) into discrete indices mapped to an RVQ codebook. This discretization simplifies cross-modal alignment by reducing data dimensionality. Similarly, CDCD adopts RVQ for extracting intermediate features, a strategy that alleviates the learning burden by constraining the feature space to a manageable set of discrete representations. A key distinction lies in their generation mechanisms: unlike D2M-GAN, which relies on GAN-based frameworks, CDCD employs a discrete diffusion model to generate dance background music, a choice that often enhances the temporal coherence and diversity of the output.

Sun et al. [22] use both the normal forward-played dance videos and their reverse-played counterparts to better realize the temporal correlation and rhythmic consistency between dance videos and music.

In Table 1, we provide a summary of representative and recent related studies.

This paper extends our previously published four-page conference paper [23]. In this extended version, we provide a comprehensive description of our methodological approach and present additional experimental results that further validate and enrich our findings. These new results not only complement the initial analyses but also provide a broader context for understanding the implications of our research. By elaborating on key aspects of our methodology and incorporating further empirical evidence, we aim to contribute a more robust and nuanced perspective to the ongoing discourse in our field.

3. Proposed Method

In this section, we describe the details of our method. The entire process consists of two main stages. We first train a music generation network based on RVQ, and then we utilize the indices of latent-space vectors from RVQ’s codebooks as music features to train a network with a self-attention-based encoder and an LSTM decoder that learns the relationship between input dance sequences and output index sequences. The framework of our method is shown in Figure 1 and Figure 2 below.

3.1. Dance Dataset

We train our model by using the dance dataset, AIST++ [12]. AIST++ stands as one of the largest publicly available human dance motion datasets, encompassing 1408 sequences of 3D human dance motions, each paired with corresponding music data to support cross-modal music-to-dance research. For human pose representation, the dataset utilizes a widely used human skeleton format, which is composed of bones and joints. Joints mark key body points, and bones serve as links between distinct joints (see Figure 3 for illustration). It includes two types of human skeletons to accommodate different research needs: one follows the Common Objects In Context (COCO) format with 17 joints [24], and the other uses the Skinned Multi-Person Linear Model (SMPL) format with 24 joints [25]. To enable systematic cross-modal analysis between human motion and music data, Li et al. [12] processed AIST++ into non-overlapping subsets. Specifically, they selected 998 sequences from the full dataset and split this subset into separate training and testing datasets, laying a standardized foundation for evaluating models trained on music-to-dance generation tasks.

3.2. Data Representations

A dance motion sequence in AIST++, denoted as

S

, is a time series of human poses over time:

S = {P_{1}, P_{2}, \dots, P_{T}},

(1)

where T is the total number of frames in the sequence, and

P_{t}

represents the pose at time frame t. Each pose

P_{t}

is defined by the 3D displacements of 24 joints relative to the previous frame

t - 1

:

P_{t} = \{Δ J_{k}^{(t)} ∣ k = 1, 2, \dots, 24\} .

(2)

Here,

Δ J_{k}^{(t)}

denotes the 3D displacement of the k-th joint at frame t relative to frame

t - 1

, calculated as

Δ J_{k}^{(t)} = J_{k}^{(t)} - J_{k}^{(t - 1)} = (Δ x_{k}^{(t)}, Δ y_{k}^{(t)}, Δ z_{k}^{(t)}) \in R^{3},

(3)

where

J_{k}^{(t)}

and

J_{k}^{(t - 1)}

are the 3D coordinates of the k-th joint at frames t and

t - 1

, respectively, and

Δ x_{k}^{(t)}, Δ y_{k}^{(t)}, Δ z_{k}^{(t)}

are the displacement components in the three spatial dimensions. We show the structure of the SMPL skeleton, SMPL joint names, and indices in Figure 3.

The initial displacement (at

t = 1

) is set to zero for all joints:

Δ J_{k}^{(1)} = 0 = (0, 0, 0) \in R^{3} \forall k = 1, 2, \dots, 24,

(4)

where 0 denotes the zero vector in 3D space.

In matrix form, the entire motion sequence can be compactly represented with the dimensionality:

S \in R^{T \times 72},

(5)

where the second dimension in Equation (5) corresponds to the product of the number of joints and 3 spatial coordinates.

3.3. Music Encoder and Decoder

The music encoder and decoder implement a critical step in converting raw acoustic signals into discrete representations suitable for cross-modal alignment with dance motions. The original audio data from the AIST++ dataset is sampled at 44.1 kHz, but to ensure temporal synchronization with the dance motion sequences, we downsample the audio to 24 kHz. This adjustment facilitates precise timestamp alignment between the audio and motion data streams, a key prerequisite for coherent cross-modal modeling.

For the audio encoding and decoding process, we adopt the SoundStream architecture [8], a state-of-the-art neural audio codec known for its efficiency in capturing perceptual audio features. As shown in Figure 1, the encoder component of our framework comprises four consecutive convolution blocks, each configured with specific stride parameters:

[2, 4, 5, 10]

. The decoder block is symmetric to the encoder architecture, comprising a transposed convolution followed by the corresponding four residual units. The same strides as those used in the encoder are used in reverse order to reconstruct a waveform with the same resolution as the input. These stride values are strategically chosen to progressively reduce the temporal resolution of the input waveform while expanding the feature dimension, enabling the model to capture both local acoustic details and global structural patterns.

The encoder generates an embedding vector every 400 sample points within the 24 kHz audio stream. This results in a rate of 60 embedding vectors per second (calculated as

24, 000 samples / second \div 400 samples / vector

), which is deliberately synchronized with the sampling rate of the dance motion data. This temporal alignment ensures that each motion frame corresponds to a consistent number of audio embeddings, laying the foundation for effective cross-modal feature fusion and subsequent music generation conditioned on dance movements.

Let the sequence of embedding vectors generated by the music encoder be denoted as

E = \{e_{1}, e_{2}, \dots, e_{N}\},

(6)

where

e_{n} \in R^{d}

represents the d-dimensional embedding vector corresponding to the n-th time step in the audio signal.

The RVQ model employs a stack of L codebooks, defined as

C = \{C_{1}, C_{2}, \dots, C_{L}\},

(7)

where

C_{l} = \{c_{l, 1}, c_{l, 2}, \dots, c_{l, M_{l}}\}

denotes the l-th codebook containing

M_{l}

codewords, with

c_{l, m} \in R^{d}

representing the m-th d-dimensional codeword in the l-th codebook.

For each embedding vector

e_{n}

in the sequence, RVQ performs hierarchical quantization to produce a tuple of indices:

q_{n} = RVQ (e_{n}; C) = (q_{n, 1}, q_{n, 2}, \dots, q_{n, L}),

(8)

where

q_{n, l} \in {1, 2, \dots, M_{l}}

is the index of the selected codeword from the l-th codebook for the n-th embedding vector. Specifically, the RVQ block outputs the codeword closest to the input vector via querying the codebook, where the codeword is selected based on the minimum distance criterion between the input vector and all candidate codewords in the codebook. The residual between the input and the output is then passed as input to the next codebook. These cascaded codebooks gradually refine the residual error and capture increasingly fine-grained features of the original input vector. Through this layered, sequential quantization process, the RVQ block achieves high-fidelity representation of the input while maintaining a compact codebook for each layer, effectively balancing quantization accuracy and computational efficiency.

As a result, the entire embedding sequence

E

is converted into a sequence of index tuples:

Q = \{q_{1}, q_{2}, \dots, q_{N}\},

(9)

where

Q

represents the quantized sequence of codebook indices generated by RVQ from the original music embedding sequence. We also use the same STFT-based discriminator, multi-scale discriminator, and multi-period discriminator in SoundStream [8] for training.

3.4. Dance-Based Music Prediction

Inspired by DanceRevolution [4], we employ an encoder based on a local self-attention mechanism, where the size of the receptive field is controlled by a window length w. This design allows the model to focus on local temporal patterns within the dance motion sequence while maintaining computational efficiency, as the attention operation is constrained within a sliding window of length w rather than being applied globally across the entire sequence. Such a strategy enables effective capture of short-to-medium range motion dependencies, which is critical for modeling rhythmic correlations between consecutive dance movements, while avoiding the quadratic complexity associated with full self-attention.

The encoder is designed to transform two critical motion components (the displacement of each frame and kinematic beats) into a sequence of hidden variables that capture both temporal rhythm and spatial movement features. Formally, let the motion displacement sequence, formed by concatenating kinematic beats, be denoted as

D = {Δ J_{1}, Δ J_{2}, \dots, Δ J_{T}, b_{k i n e m a t i c}}

. The

b_{k i n e m a t i c}

, a key rhythmic marker, is the kinematic beat obtained according to [26], which is widely applied in works focused on music-driven dance synthesis, such as dancing to music [13]. Given that dancers typically execute significant movements during motion beat frames, we select frames with rapid direction changes, zero velocity, and peak angle values. The direction changes of all joints are determined by calculating the average angular difference in the velocity of each joint between two consecutive frames.

The encoder maps this sequence to a hidden variable sequence:

H = Encoder (D) = {h_{1}, h_{2}, \dots, h_{T}},

(10)

where

h_{t} \in R^{d}

denotes the hidden variable corresponding to frame t, and d is the dimension of the latent space.

The decoder, based on a Long Short-Term Memory (LSTM) architecture, leverages these hidden variables to predict the index. Specifically, the LSTM decoder processes the hidden sequence

H

in a sequential manner and outputs a classification vector, which is subsequently mapped to codebook indices through a softmax layer. The output layer contains L independent classification output heads, and each head is solely responsible for predicting the index of one codebook. This avoids mutual interference between indices of different codebooks and improves prediction accuracy.

{\hat{p}}_{t}

is the predicted probability distribution over all possible codeword indices at frame

t + 1

:

{\hat{p}}_{t + 1} = LSTM - Decoder (h_{t}, {\hat{p}}_{t}, s_{t}),

(11)

where

s_{t}

represents the hidden state of the LSTM at step t.

{\hat{p}}_{1} = ({\hat{p}}_{1, 1}, {\hat{p}}_{1, 2}, \dots, {\hat{p}}_{1, L}),

(12)

where

{\hat{p}}_{1, l} \sim Uniform (1, M_{l}) \forall l = 1, 2, \dots, L .

(13)

The decoder initializes with

{\hat{q}}_{1}

to maintain temporal consistency with the input sequence. This autoregressive framework enables the model to capture long-range temporal dependencies in motion dynamics, complementing the local pattern modeling of the encoder.

The index tuple

{\hat{q}}_{n} = ({\hat{q}}_{n, 1}, \dots, {\hat{q}}_{n, L})

is then determined by selecting the maximum probability indices for each codebook. We use cross-entropy loss for training:

L_{dance} = - \sum_{l = 1}^{N} log (p_{t}^{(l)} (q_{l})) .

(14)

The dance motion encoder adopts 2 layers with a model hidden dimension of 256, 8 attention heads, and a feed-forward dimension of 1024. Each attention head adopts the scaled dot-product attention mechanism, where the query and key feature dimension per head is set to 64, and the value feature dimension per head is also set to 64. The LSTM-based decoder consists of 3 layers with a hidden dimension of 1024. The input pose feature of the dance encoder is 73-dimensional, which is first projected to a 256-dimensional space. The final output contains 8 independent classification heads, each predicting 8-level RVQ codebook indices with 1024 categories. Table 2 summarizes the architectural details and parameter size of our dance-based music prediction network.

4. Experimental Section

We compared our method to four neural network-based state-of-the-art methods: Dance2Music [6], FoleyMusic [5], CDCD [9], D2M-GAN [10], and Dance-to-Music [11].

4.1. Experimental Setup

We first downsample the audio in AIST++ from 44.1 kHz to 24 kHz. There are 60 pieces of audio in AIST++. We use 50 of them for training and 10 for testing. There are 8 codebooks in RVQ, and each codebook has 1024 entries.

The training processes of the music decoder and the dance-to-music generation network are completely decoupled. Specifically, we first train the music encoder and decoder as a standalone module; subsequently, we fix the parameters of the pre-trained music encoder and decoder and integrate this frozen module into the dance-to-music generation network for its training. When we train the music encoder and decoder, we use a batch size of 12 and use 50 audio files as the training set. The codebook size of the RVQ is 8, and each codebook has 1024 vectors for quantization. The encoder component of our framework consists of four consecutive convolutional blocks, with each block configured to use distinct stride parameters: (2, 4, 5, 10). This specific stride design ensures that the RVQ token generation rate from the input audio is 60 tokens per second. It is a value intentionally aligned with the sample rate of the input dance motions. Such alignment eliminates temporal mismatches between the audio-derived tokens and dance motion sequences, as both modalities are now represented at a consistent temporal granularity. This consistency is critical for the subsequent cross-modal generation network to establish precise rhythmic and temporal correlations between dance movements and music tokens.

We use a 2080ti and Intel i7-9700k. The PyTorch version is 2.6.0. The music decoder train takes 5 days. The dance-to-music generation network training takes 3 days. In terms of inference efficiency, our model can generate corresponding music for a 10 s dance sequence within 1 s.

4.2. Evaluation Metrics

Musical rhythm serves as a fundamental bridge between dance motions and generated audio, as the synchronization between bodily movements and auditory beats directly impacts the perceived quality of cross-modal alignment. Both Dance2Music [6] and FoleyMusic [5] apply human study to evaluate the quality of their generated music. In addition, FoleyMusic [5] clusters the spectrogram of generated music to evaluate the diversity of generated sound, and Dance2Music [6] uses the accuracy of the note prediction task as a metric. To quantitatively assess this rhythmic coherence, we use two specialized metrics, building upon established practices in cross-modal rhythm evaluation [10]. We denote the beats of generated music as

B_{g}

and the beats of the ground-truth music as

B_{t}

. The beat coverage score can be calculated as

B_{g} / B_{t}

. The beat hits score is utilized to measure the ratio of aligned beats to the total musical beats. The aligned beat from the generated beats is denoted as

B_{a}

. If the temporal distance between each generated beat and the original beat is less than 0.1 s, we treat the generated beat as an aligned beat. Then the beat hit score can be calculated as

B_{a} / B_{t}

. These two evaluation metrics are also used by Dance-to-Music [11], CDCD [9], and D2M-GAN [10].

We denote the beats of generated music as

B_{g} \in {0, 1}^{N}

and the beats of ground-truth music as

B_{t} \in {0, 1}^{N}

. We define the aligned beat vector

B_{a} \in {0, 1}^{N}

to indicate time steps where both generated and ground-truth beats are close enough. For each time step

n \in {1, 2, \dots, N}

, the n-th element of

B_{a}

(denoted as

B_{a} (n)

) is determined by

B_{a}^{(n)} = \{\begin{matrix} 1 & if B_{t} (n) = 1 and | n - m | < 6, \\ 0 & otherwise, \end{matrix}

(15)

where m is the closest generated beat index in

B_{g}

of

B_{t} (n)

. The beat coverage score and the beat hit score can be calculated by

B e a t s C o v e r a g e S c o r e = \frac{\sum_{n = 1}^{N} B_{g} (n)}{\sum_{n = 1}^{N} B_{t} (n)}

(16)

B e a t s H i t S c o r e = \frac{\sum_{n = 1}^{N} B_{a} (n)}{\sum_{n = 1}^{N} B_{t} (n)} .

(17)

4.3. Results

The quantitative results on the AIST++ test set are shown in Table 3. We compared our method with four state-of-the-art methods: Dance2Music [6], FoleyMusic [5], CDCD [9], and D2M-GAN [10]. In Figure 4, we also use a bar chart to illustrate the quantitative evaluation in Table 3.

4.3.1. Beats Coverage Score (Higher = Better)

Our method attains a Beats Coverage Score (BCS) of 89.6%, which outperforms all baselines with clear margins: Compared to the lowest-performing baseline, Foley Music (BCS 74.1%), our method achieves a 15.5% improvement. This large gap indicates that our approach captures the overall rhythmic density of the original music far better, avoiding the sparse or incomplete beat generation that plagues simpler cross-modal mapping methods. Despite the mid-tier baseline of Dance2Music (BCS 83.5%), our method still achieves a gain of 6.1 percentage points. This confirms that, even compared to methods specifically designed for dance-to-music tasks, our model retains the full range of rhythmic information from the input dance more effectively. When benchmarked against the two most advanced baselines (from the recent works [9,10]), we outperform D2M-GAN (BCS 88.2%) by 1.4%, and we surpass CDCD (BCS 89.0%) by 0.6%. These gaps, although smaller, highlight our method’s ability to further optimize beat coverage.

4.3.2. Beats Hit Score (Higher = Better)

Our method’s Beats Hit Score (BHS) of 85.2% also leads all baselines, with gaps that underscore its strength in temporal synchronization: Relative to Foley Music (BHS 69.4%) and Dance2Music (BHS 82.4%), our method improves by 15.8% and 2.8%, respectively. This means that our generated beats are far more likely to align with the original music’s temporal positions. Compared to the two top-performing baselines, we outperform D2M-GAN (BHS 84.7%) by 0.5% and achieve a more notable 1.4% advantage over CDCD (BHS 83.8%).

To verify the effectiveness of our method on 2D motion input, we conduct an additional experiment using 2D dance sequences from the AIST++ dataset.

Table 4 demonstrates that the proposed motion feature extraction and music generation algorithm can more comprehensively capture the rhythmic characteristics of dance movements, even in the scenario of 2D input. As shown in Table 4, our method achieves superior performance on the AIST++ dataset with 2D dance sequence input, particularly in the critical metric of Beats Coverage Score. Specifically, our method achieves a Beats Coverage Score of 49.2%, which outperforms the Dance-to-Music [11] method (BCS 47.6%) by 1.6%, which is equivalent to a 3.4% relative improvement. This significant gain demonstrates that our approach can more comprehensively capture the rhythmic characteristics of 2D dance movements and map them to coherent music, while our Beats Hit Score of 38.4% is lower than that of the comparison method (44.0%). The leading performance in the Beats Coverage Score reflects overall rhythmic consistency between dance and music, validating the effectiveness of our framework for 2D motion input, even though it is originally designed for 3D motion input.

Compared with the quantitative results under the 3D scenario, the model achieves markedly lower evaluation scores in the 2D scenario. Fundamentally, human skeletons in 3D space obey rigid physical constraints. The bone length between adjacent joints is fixed and invariant throughout the whole body movement. These stable anatomical priors yield consistent, physically plausible spatial correlations across all keypoints, enabling the model to learn and infer precise musical beat positions. In contrast, 2D keypoints retain only planar XY coordinates, discarding depth information along the Z-axis. Dance rhythm and beat regularity are highly dependent on 3D motion characteristics, including limb swinging, body rotation, and forward–backward displacement—all of which cannot be adequately represented in a 2D space.

Collectively, the quantitative results in the scenarios of 2D and 3D input highlight that our method delivers robust and comprehensive music generation performance compared to existing work on the 2D and 3D dance-to-music task.

Figure 5 illustrates the training loss and the accuracy in the test dataset. The left red curve is the training loss and the right blue curve is the test accuracy.

In Figure 6, we illustrate a 6-s segment to visualize the synchronization between the generated beats and the ground-truth music beats in a 3D scenario. Benefiting from the kinetic beat features, our model effectively captures the rhythmic patterns of music and achieves precise beat alignment.

4.4. Ablation Experiments

Building upon the preliminary work presented in our conference paper [23], we conducted additional ablation experiments to systematically explore the importance of the kinematic beat (KB) feature in enhancing the model’s music–motion beat alignment capability. This supplementary experimental design aims to address a gap in the initial study by quantifying the specific contribution of the KB feature, a core component of our proposed framework, thereby strengthening the reliability and persuasiveness of our research conclusions. The KB feature is specifically designed to capture the fine-grained rhythmic alignment between music signals and motion dynamics. As one of the core input features to the music RVQ module in our model, it plays a vital role in facilitating the RVQ module to learn discriminative cross-modal representations that bridge music and motion. To rigorously quantify the impact of the KB feature on model performance, we constructed a dedicated ablation variant: the full model architecture was retained, except that the KB feature was removed from the input of the music RVQ module. The quantitative comparison results between the full model (with the KB feature) and the ablation variant (without the KB feature) under different hyperparameter settings are summarized in Table 5.

First, we analyzed the impact of the learning rate on model performance, as it is a critical hyperparameter that directly affects the convergence quality and final performance of deep learning models. As shown in Table 5, the full model achieved optimal performance when the learning rate was set to 1 × 10⁻⁵, attaining the highest Beats Coverage Score (89.6%) and Beats Hit Score (85.2%) among all tested learning rates. In contrast, when the learning rate was increased to 5 × 10⁻⁴ or decreased to 5 × 10⁻⁵, the model’s performance showed a noticeable decline (e.g., Beats Coverage Score dropped to 88.6% at 5 × 10⁻⁴ and 89.1% at 5 × 10⁻⁵). This result highlights that an appropriate learning rate is crucial for the model to converge to a high-quality local optimum and thus achieve superior beat alignment capability. Specifically, it is 1 × 10⁻⁵ in this study.

More importantly, the indispensable role of the KB feature in improving model performance is clearly evidenced by the comparative results. Across all three tested learning rates (5 × 10⁻⁵, 1 × 10⁻⁵, and 5 × 10⁻⁴), the ablation variant without the KB feature consistently exhibited a lower Beats Coverage Score and Beats Hit Score than the full model under the same learning rate condition. A typical example is observed at the optimal learning rate of 1 × 10⁻⁵. Removing the KB feature led to a 0.6% drop in the Beats Coverage Score (from 89.6% to 89.0%) and a 0.7% drop in the Beats Hit Score (from 85.2% to 84.5%). Similar declining trends were also observed at the other two learning rates. At 5 × 10⁻⁵, the ablation variant showed a 0.2% decrease in the Beats Coverage Score and a 1.7% decrease in the Beats Hit Score; at 5 × 10⁻⁴, the corresponding drops were 0.3% and 0.4%, respectively. These consistent performance degradations clearly demonstrate that the KB feature effectively enhances the model’s ability to capture the rhythmic correlation between music and motion, thereby improving beat alignment accuracy.

In conclusion, the supplementary ablation experiments confirm two key findings. First, the KB feature plays a non-negligible and positive role in enhancing the model’s beat alignment performance, which validates the rationality of its integration into the music RVQ module. Second, the learning rate of 1 × 10⁻⁵ is determined as the optimal hyperparameter for the proposed model, resulting in stable and superior performance.

The quantitative comparison results of our model with different codebook quantities are shown in Table 6. In the comparison of different codebook quantities, the model with four codebooks produces far fewer musical beats, and the temporal deviation between generated beats and the ground truth increases significantly. The resulting music suffers from noticeable distortion, where the model is only capable of generating high-amplitude dominant musical beats. By contrast, the model with 12 codebooks achieves no substantial performance improvement over the optimal setting. However, the increased number of codebooks leads to a rise in GPU memory consumption and training time cost.

For sequence modeling, we use LSTM as the core model. Compared with other sequence modeling approaches, LSTMs have significant advantages in capturing long-term dependencies, which are crucial for the cross-modal task of dance–music generation. To further verify the rationality of choosing LSTM, we have conducted comparative experiments with other sequence models, and the results are shown in Table 7. Specifically, LSTM consistently outperforms GRU on both the Beats Coverage Score and the Beats Hit Score, yielding higher numerical values on both evaluation metrics and increasing test accuracy from 72.4% to 75.2%.

The curves of the loss functions and the test accuracy are shown in Figure 7. Experimental results show that GRU converges faster at the initial stage and consumes less memory, while LSTM achieves higher test accuracy and better generalization, which is more appropriate for our dance–music cross-modal modeling task.

5. Discussion and Future Work

Although our method achieves promising performance, it currently lacks the capability to introduce explicit control conditions, such as music style. Furthermore, the background music in the AIST++ dataset exclusively contains instrumental tracks without human vocals. In addition, the dance sequences in AIST++ do not cover style transitions within a single performance, and the original dataset also lacks real-world contextual information, such as stage environments. Moreover, our experiments have not evaluated the model on sophisticated and complex dance choreographies, nor have we tested its performance on long-duration dance sequences. By contrast, numerous dance videos circulating on online platforms feature vocal music and dynamic style shifts, intricate dance movements, realistic stage settings, and long-duration performances.

In future work, we will enhance the model with music-style embeddings and multi-condition control mechanisms to address the lack of controllability. We also intend to construct more comprehensive datasets that incorporate vocal background music, intra-performance style variations, realistic stage context, sophisticated dance choreographies, and long-duration dance sequences. These improvements will further enhance the generalization ability and practical applicability of the dance generation model in real application scenarios. Additionally, to verify the statistical significance of the experimental results, we will conduct sufficient repeated training runs to perform cross-validation, collect data on standard deviation, and perform p-value tests, which will further validate the reliability and robustness of our proposed method. These improvements will further enhance the generalization ability and practical applicability of the dance generation model in real application scenarios.

6. Conclusions

This paper presented a novel background music generation method constructed on a music RVQ encoder–decoder framework, aiming to advance the task of dance-to-music synthesis. For the experimental setup, we utilized the AIST++ dataset, where human dance is represented by sequences of 3D human skeleton keypoints, and background music is formatted as waveforms to ensure the validity and performance of our method. Two key contributions of this work drive its value in the field. First, we have developed a distinctive dance-to-music generation approach that employs an RVQ-based encoder–decoder. Unlike conventional methods that directly adopt music waveforms or MIDI as input, our method predicts the indices of latent-space vectors from the RVQ’s codebooks, which optimizes the mapping between dance motion features and musical representations. Second, we have conducted a comprehensive comparative analysis between our method and other state-of-the-art solutions, including representative works in dance-to-music generation [5,6,9,10,11]. We also conducted ablation experiments to explore the effectiveness of the kinematic feature and the impact of different learning rates on model training. Quantitative evaluation results consistently confirm that our method outperforms these existing alternatives.

Author Contributions

Conceptualization, S.L. and H.Y.; methodology, S.L.; software, S.L.; validation, S.L.; formal analysis, S.L.; investigation, S.L. and H.Y.; resources, S.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, H.Y. and M.Z.; visualization, S.L.; supervision, H.Y. and M.Z.; project administration, H.Y. and M.Z.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA), and City University of Hong Kong (Project 9610460).

Data Availability Statement

The dance dataset used in this study are available in AIST++ [12]. The other data presented in this article are not readily available because the data are part of an ongoing study. Requests to access other data should be directed to the first author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, K.; Tan, Z.; Lei, J.; Zhang, S.H.; Guo, Y.C.; Zhang, W.; Hu, S.M. ChoreoMaster: Choreography-Oriented Music-Driven Dance Synthesis. ACM Trans. Graph. 2021, 40, 1–13. [Google Scholar] [CrossRef]
Ye, Z.; Wu, H.; Jia, J.; Bu, Y.; Chen, W.; Meng, F.; Wang, Y. ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit. In Proceedings of the 28th ACM International Conference on Multime; Association for Computing Machinery: New York, NY, USA, 2020; pp. 744–752. [Google Scholar] [CrossRef]
Zhou, Y.; Li, Z.; Xiao, S.; He, C.; Huang, Z.; Li, H. Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis. In Proceedings of the ICLR 2018 Conference Track 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Huang, R.; Hu, H.; Wu, W.; Sawada, K.; Zhang, M.; Jiang, D. Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning. In Proceedings of the International Conference on Learning Representations (ICLR 2021), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Gan, C.; Huang, D.; Chen, P.; Tenenbaum, J.B.; Torralba, A. Foley Music: Learning to Generate Music from Videos. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2020; pp. 758–775. [Google Scholar]
Aggarwal, G.; Parikh, D. Dance2Music: Automatic Dance-driven Music Generation. arXiv 2021, arXiv:cs.SD/2107.06252. [Google Scholar] [CrossRef]
Liang, X.; Li, W.; Huang, L.; Gao, C. DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music Generator. IEEE Trans. Multimed. 2024, 26, 10237–10250. [Google Scholar] [CrossRef]
Zeghidour, N.; Luebs, A.; Omran, A.; Skoglund, J.; Tagliasacchi, M. SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Trans. Audio Speech Lang. Proc. 2021, 30, 495–507. [Google Scholar] [CrossRef]
Zhu, Y.; Wu, Y.; Olszewski, K.; Ren, J.; Tulyakov, S.; Yan, Y. Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 7828–7853. [Google Scholar]
Zhu, Y.; Olszewski, K.; Wu, Y.; Achlioptas, P.; Chai, M.; Yan, Y.; Tulyakov, S. Quantized GAN for Complex Music Generation from Dance Videos. In Proceedings of the 17th European Conference on Computer Vision (ECCV 2022), Tel Aviv, Israel, 23–27 October 2022; pp. 182–199. [Google Scholar]
Li, S.; Dong, W.; Zhang, Y.; Tang, F.; Ma, C.; Deussen, O.; Lee, T.Y.; Xu, C. Dance-to-Music Generation with Encoder-based Textual Inversion. In Proceedings of SIGGRAPH Asia 2024 Conference Papers; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–11. [Google Scholar]
Li, R.; Yang, S.; Ross, D.A.; Kanazawa, A. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 13381–13392. [Google Scholar] [CrossRef]
Lee, H.Y.; Yang, X.; Liu, M.Y.; Wang, T.C.; Lu, Y.D.; Yang, M.H.; Kautz, J. Dancing to Music. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 3581–3591. [Google Scholar]
Zhuang, W.; Wang, C.; Chai, J.; Wang, Y.; Shao, M.; Xia, S. Music2Dance: DanceNet for Music-Driven Dance Generation. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–21. [Google Scholar] [CrossRef]
Li, B.; Zhao, Y.; Zhelun, S.; Sheng, L. DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2022; Volume 36, pp. 1272–1279. [Google Scholar] [CrossRef]
Niu, B.; Yang, R.; Zhang, Q.; Zhang, Y.; Fan, Y. CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning. Big Data Cogn. Comput. 2025, 9, 307. [Google Scholar] [CrossRef]
Li, X.; Li, R.; Fang, S.; Xie, S.; Guo, X.; Zhou, J.; Peng, J.; Wang, Z. Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2025; pp. 14420–14430. [Google Scholar]
Dong, B.; Lei, W.; Liu, L. FFD: Fine-Finger Diffusion Model for Music to Fine-grained Finger Dance Generation. In Proceedings of the Interspeech 2025; International Speech Communication Association (ISCA): Grenoble, France, 2025. [Google Scholar]
Chen, Z.; Xu, H.; Song, G.; Xie, Y.; Zhang, C.; Chen, X.; Wang, C.; Chang, D.; Luo, L. X-Dancer: Expressive Music to Human Dance Video Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2025; pp. 10602–10611. [Google Scholar]
Wang, H.; Song, Y.; Jiang, W.; Wang, T. A Music-Driven Dance Generation Method Based on a Spatial-Temporal Refinement Model to Optimize Abnormal Frames. Sensors 2024, 24, 588. [Google Scholar] [CrossRef] [PubMed]
Han, B.; Li, Y.; Shen, Y.; Ren, Y.; Han, F. Dance2MIDI: Dance-driven multi-instrument music generation. Comput. Vis. Media 2024, 10, 791–802. [Google Scholar] [CrossRef]
Sun, C.; Liu, G.; Fleming, C.; Yan, Y. Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2025; pp. 8321–8330. [Google Scholar]
Lin, S.; Zukerman, M.; Yan, H. Dance to Music Generation Based on Residual Vector Quantization. In Proceedings of the IEEE BigData 2025; IEEE: New York, NY, USA, 2025; pp. 5175–5178. [Google Scholar]
Ronchi, M.R.; Perona, P. Benchmarking and Error Diagnosis in Multi-instance Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017; IEEE: New York, NY, USA, 2017; pp. 369–378. [Google Scholar] [CrossRef]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 2015, 34, 1–16. [Google Scholar] [CrossRef]
Ho, C.; Tsai, W.T.; Lin, K.S.; Chen, H.H. Extraction and alignment evaluation of motion beats for street dance. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing; IEEE: New York, NY, USA, 2013; pp. 2429–2433. [Google Scholar]

Figure 1. The overview of the music encoder and decoder.

Figure 2. The overview of the dance-based music prediction.

Figure 3. The SMPL [25] skeleton and its joint names and indices.

Figure 4. Quantitative evaluation on AIST++ dataset.

Figure 5. The red curve is the training loss of the dance-based music prediction network, and the blue curve is the test accuracy of the dance-based music prediction network. We sample the data once every 100 epochs.

Figure 6. The red dashed lines represent the generated music beats, while the blue solid lines denote the ground-truth music beats.

Figure 7. The loss curves and accuracy curves between GRU and LSTM. We sample the data once every 100 epochs.

Table 1. Summary of representative and recent related studies.

Method	Scope	Dataset	Limitations	Advantages
Dance2Music [6]	Music2Dance	AIST++	Similarity-based method	Online generation
Foley Music [5]	Video2MIDI	Foley Music	Only MIDI	MIDI can generate different styles of music
D2M-GAN [10]	Video2Music	AIST++ and TikTok Dance-Music	Slow inference speed	Style control
CDCD [9]	Video2Music and Music2Video	AIST++ and TikTok Dance-Music	Slow inference speed	Both Video2Music and Music2Video
Wang et al. [20]	Music2Video	DanceIt	Dataset is small	Abnormal frames fix
Niu et al. [16]	Music2Video	Niu et al.	Chinese ethnic folk dance generation only	First culture-aware framework
SoulNet [17]	Music2Dance	SoulDance	Large model size	Motions include hand and face
Dong et al. [18]	Music2Dance	DanceFingers-4K	Large model size	Finger movement
X-dancer [19]	Music2Video	X-dancer	Large model size	Generate background
Sun et al. [22]	Dance2Music	AIST++ and TikTok dance video	Large model size	Use both positive and negative conditioning

Table 2. Dance-based music prediction network architecture details and parameter statistics.

Module	Input Dim.	Output Dim.	Num. Layers	Parameter Size
Pose and Beat Projection	73	256	1	18.94 K
Multi-Head Self-Attention Encoder	256	256	2	1.57 M
LSTM Decoder	256	1024	3	16.09 M
RVQ Classification Heads	1024	1024	8	8.40 M
Total	–	–	–	26.08 M

Table 3. Quantitative evaluation results for music generation on the AIST++ dataset; ↑ indicates that a higher value is a superior score; bold values denote the best performance for each column.

Method	Beats Coverage Score ↑	Beats Hit Score ↑
Dance2Music [6]	83.5	82.4
Foley Music [5]	74.1	69.4
D2M-GAN [10]	88.2	84.7
CDCD [9]	89.0	83.8
Ours	89.6	85.2

Table 4. Quantitative evaluation results for music generation on the AIST++ dataset using 2D dance sequences; ↑ indicates that a higher value is a superior score; bold values denote the best performance for each column.

Method	Beats Coverage Score ↑	Beats Hit Score ↑
Dance-to-Music [11]	47.6	$44.0$
Ours	$49.2$	38.4

Table 5. Quantitative evaluation results for ablation experiments with different learning rates and impact of kinematic beat (KB); ↑ indicates that a higher value is a superior score; bold values denote the best performance for each column.

Method	Beats Coverage Score ↑	Beats Hit Score ↑
Full model (5 × 10⁻⁵)	89.1	84.9
Full model (1 × 10⁻⁵)	$89.6$	$85.2$
Full model (5 × 10⁻⁴)	88.6	84.8
Full model without KB (5 × 10⁻⁵)	88.9	83.2
Full model without KB (1 × 10⁻⁵)	89.0	84.5
Full model without KB (5 × 10⁻⁴)	88.3	84.4

Table 6. Quantitative evaluation results for ablation experiments with different codebook quantities; ↑ indicates that a higher value is a superior score; bold values denote the best performance for each column.

Method	Beats Coverage Score ↑	Beats Hit Score ↑
Full model (4 codebooks)	72.2	70.9
Full model (8 codebooks)	89.6	85.2
Full model (12 codebooks)	89.4	85.3

Table 7. Quantitative evaluation results between Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs); ↑ indicates that a higher value is a superior score; bold values denote the best performance for each column.

Method	Beats Coverage Score ↑	Beats Hit Score ↑	Test Accuracy ↑
model with GRU	85.7	82.9	72.4%
model with LSTM	89.6	85.2	75.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, S.; Zukerman, M.; Yan, H. Dance Motion-Guided Music Generation via Residual Vector Quantization. Electronics 2026, 15, 2098. https://doi.org/10.3390/electronics15102098

AMA Style

Lin S, Zukerman M, Yan H. Dance Motion-Guided Music Generation via Residual Vector Quantization. Electronics. 2026; 15(10):2098. https://doi.org/10.3390/electronics15102098

Chicago/Turabian Style

Lin, Shuhong, Moshe Zukerman, and Hong Yan. 2026. "Dance Motion-Guided Music Generation via Residual Vector Quantization" Electronics 15, no. 10: 2098. https://doi.org/10.3390/electronics15102098

APA Style

Lin, S., Zukerman, M., & Yan, H. (2026). Dance Motion-Guided Music Generation via Residual Vector Quantization. Electronics, 15(10), 2098. https://doi.org/10.3390/electronics15102098

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dance Motion-Guided Music Generation via Residual Vector Quantization^†

Abstract

1. Introduction

2. Related Work

2.1. Music-to-Dance Generation

2.2. Dance-to-Music Generation

3. Proposed Method

3.1. Dance Dataset

3.2. Data Representations

3.3. Music Encoder and Decoder

3.4. Dance-Based Music Prediction

4. Experimental Section

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. Results

4.3.1. Beats Coverage Score (Higher = Better)

4.3.2. Beats Hit Score (Higher = Better)

4.4. Ablation Experiments

5. Discussion and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Dance Motion-Guided Music Generation via Residual Vector Quantization †

Abstract

1. Introduction

2. Related Work

2.1. Music-to-Dance Generation

2.2. Dance-to-Music Generation

3. Proposed Method

3.1. Dance Dataset

3.2. Data Representations

3.3. Music Encoder and Decoder

3.4. Dance-Based Music Prediction

4. Experimental Section

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. Results

4.3.1. Beats Coverage Score (Higher = Better)

4.3.2. Beats Hit Score (Higher = Better)

4.4. Ablation Experiments

5. Discussion and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Dance Motion-Guided Music Generation via Residual Vector Quantization^†