1. Introduction
As an important vehicle of human intangible culture, dance not only carries a group’s historical memory and aesthetic heritage, but also serves as a living conduit for cross-cultural exchange. Chinese ethnic folk dance refers to dance forms that originate and circulate among the people, are shaped by folk culture, and primarily serve self-entertainment; it features distinct ethnic styles and regional characteristics, and it reflects each ethnic group’s unique lifestyles, cultural traditions, and religious beliefs [
1,
2]. Chinese ethnic folk dance is not only an artistic expression, but also a vehicle of cultural memory and a symbol of identity, conveying emotions, history, and cultural meanings through movement [
3,
4]. As shown in
Figure 1, Chinese ethnic folk dance is classified by ethnicity into Han ethnic dance, Korean ethnic dance, Dai ethnic dance, Mongolian ethnic dance, Uyghur ethnic dance, and other forms, including those of the Dong, Yi, and Tujia, among others [
5,
6,
7,
8]. As shown in
Figure 1g,h, the Han Helou Dance, which originates in the Lingnan region and is renowned as a “living fossil” of dance, traces its history to the Shang and Zhou periods [
8]. With its distinctive bodily signifiers and cultural system, Chinese ethnic folk dance constitutes a key section of the diversity atlas of Chinese civilization [
9,
10]. In the process of modernization, these treasures of world culture face a crisis of discontinuity in transmission, and their safeguarding urgently requires technological intervention [
11].
With the deep integration of digital cultural heritage preservation and cross-modal generation, dance generation evolves from basic motion replay into a core engine for the living transmission of traditional arts. Current solutions fall into four paradigms: motion-unit decoupling (e.g., ChoreoNet) [
12], cross-modal diffusion generation (e.g., EDGE’s dance diffusion model) [
13], choreographic memory codebooks (e.g., Bailando++) [
14], and multi-condition controlled generation (e.g., keyframe optimization in DanceCamAnimator) [
15]. Although these methods [
12,
13,
14,
15] achieve notable progress on generic dances (e.g., hip-hop), they rely heavily on high-precision motion capture and manual annotation. When facing culturally distinctive movements in the Helou Dance—such as stepping the Dipper pattern (ta-gang-bu-dou) and five-direction prostration—they commonly exhibit semantic distortion and cross-modal mismatch. This limitation constrains faithful reconstruction of fine-grained movements in Chinese ethnic folk dance and weakens applicability to cultural transmission and cross-scene generalization.
Under a minimal-annotation regime, constructing a Chinese ethnic folk dance training set that is both culturally representative and scalable is the core challenge for improving the robustness of generation models. This challenge arises because existing datasets rely mainly on manual annotation or costly capture [
16,
17], which leads to insufficient coverage of minority dances and poor scalability, making it difficult to support fine-grained modeling of long-tail styles [
18]. For example, although the AIST++ dataset [
18] covers ten dance categories and reconstructs 3D motion from multi-view videos, its samples mainly focus on modern forms such as street dance and lack culture-rich annotation for Mongolian Andai or Dai Peacock dances, resulting in low pose-trajectory variance and curation costs exceeding several hundred RMB per minute. Therefore, the primary challenge is how to build, under zero-manual-label supervision, a data foundation that is both culturally representative and scalable to provide reliable support for generation models.
Transforming cultural semantics from narrative concepts into a learnable and interpretable attention mechanism is an urgent requirement for high-fidelity generation of ethnic dance [
17,
19]. The difficulty lies in the fact that traditional methods quantify generic poses but fail to capture cultural differences across ethnic dances, which yields low semantic consistency scores [
20]. For example, although the Bailando framework constructs a choreographic memory codebook via VQ-VAE to encode dance units, it shows a markedly low overlap ratio between attention hotspots and key joints when handling Helou Dance actions such as “Invoking the Deity” (qing-shen), exposing a lack of cultural interpretability [
21]. Therefore, current models quantify generic poses yet struggle to capture culture-specific differences in ethnic dance, which limits their semantic consistency and interpretability.
Under weak supervision, achieving tri-modal alignment of music, motion, and culture is the key bottleneck that drives progress in ethnic dance generation. The root of this challenge lies in current frameworks that handle music conditioning but overlook the nonlinear mapping of cultural semantics, which leads to unstable phase synchronization rates under low-SNR audio [
17,
19,
22]. For example, although the MoFusion model [
15] integrates music and motion within a diffusion framework, it lacks coupling among culture, motion, and music in tests on Yi or Uyghur ethnic dances, revealing insufficient relation learning in cross-modal attention. Consequently, under weak supervision, how to realize dynamic tri-modal alignment among musical rhythm, dance motion, and cultural semantics becomes the core bottleneck that affects generation stability and cultural consistency.
Motivated by the above challenges, and building on the Chinese traditional ethnic Helou Dance, this paper introduces a culture-aware framework designed to generate dance movements that reflect distinct ethnic characteristics. The main contributions of this work are summarized as follows:
It proposes a zero-manual-label cultural data construction method for ethnic dance that, through automatic skeleton extraction and fusion of cultural semantic labels guided by a curated cultural knowledge base, for the first time achieves cultural representativeness and scalability of dance data without any frame-level manual labels, providing a sustainable data foundation for culture-aware dance generation.
It designs a Culture-Aware Attention Mechanism (CAAM) that enables the generation model to adaptively capture ethnic dance features and visualize cultural semantic hotspots, improving cultural interpretability and performance consistency in dance generation.
It builds a music–motion–culture Tri-Modal Alignment Network (TMA-Net) that achieves dynamic coupling and temporal synchronization of tri-modal semantics under weak supervision, enhancing stability and cultural consistency of dance generation in low-annotation scenarios.
4. Experimental Setup
To validate the effectiveness of CAFE-Dance, we conduct systematic experiments on an in-house ethnic dance skeleton dataset and compare against multiple baselines and ablation configurations. This section introduces, in sequence, the experimental dataset, the model training environment, the hyperparameters, and the evaluation metrics.
4.1. Datasets
We construct the Helouwu ethnic dance dataset from publicly available online video resources. To enable systematic analysis, we collect 170 raw performance clips totaling approximately 7.5 h, with an average duration of about 2.5 min per video. The videos primarily originate from intangible cultural heritage performances, archival materials released by local cultural institutions, and documentary footage of folk cultural activities (excluding professionally studio-recorded content). Before use, all videos undergo manual quality screening to ensure visual clarity, performance completeness, and content relevance.
Regarding data use, we strictly adhere to academic ethical standards. All videos are sourced from public platforms or institutional websites that permit non-commercial public use and are restricted to this academic research. To protect individual rights, we anonymize video content involving identifiable individuals and exclude materials explicitly marked as non-redistributable or requiring additional authorization.
Based on ZDCM (Equation (
1)), we construct the Helouwu ethnic dance dataset. Starting from 170 raw Helouwu clips (about 7.5 h in total), we first resample each video to 30 Hz and apply a fixed-length temporal windowing strategy. Concretely, for each dance video stream
, we extract the motion sequence
(33 × 4 keypoints with
and visibility) with a sampling interval
=
. We then segment each clip into approximately 5 s windows (corresponding to
T = 150 frames at 30 Hz) with mild overlap, followed by denoising and duration normalization; segments shorter than 5 s are zero-padded and longer ones are center-cropped. In parallel, we extract audio features
(22,050 Hz sampling) for the same temporal windows and the video spatiotemporal representation
(Equation (
2)). Then, following the cross-modal semantic alignment strategy in
Figure 3 (Equation (
3)), we generate the cultural label vector
from the knowledge base
containing 108 cultural attribute vectors (costume, props, rituals, etc.), and filter low-confidence samples with the threshold
=
. After temporal windowing and confidence filtering, we obtain approximately 2400 motion–audio–culture segments, corresponding to roughly
effective video frames and covering the fine-grained cultural elements of Helou Dance. We empirically compared several candidate sampling rates (15, 25, 30, 45, 60 Hz) and window lengths (2, 3, 5, 8, 10 s); 30 Hz and a 5 s window achieved the best trade-off between accuracy, semantic completeness, and efficiency (see
Appendix B for a detailed ablation).
4.2. Implementation Details
Experiments in this paper use the PyTorch 2.5.1 library and run on a high-performance server with the Ubuntu 22.04 operating system, equipped with an 18-core AMD EPYC 9754 processor (Advanced Micro Devices, Inc., Santa Clara, CA, USA) and three × NVIDIA GeForce RTX 5090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA).
The hyperparameters used in our experiments are listed in
Table 1; unless otherwise specified, all comparisons and ablations keep identical data splits and training settings. Training uses the Adam optimizer and a StepLR learning-rate scheduler; the single-label multi-class classification task adopts the cross-entropy loss. We evaluate on the validation set every epoch and report final results on the test set.
During training, we further apply a light-weight data augmentation pipeline to improve robustness while preserving the semantic content of Helou Dance movements. For each 5 s motion–audio clip, we randomly sample (i) temporal jitter by speed scaling factors , (ii) small spatial perturbations on the 3D keypoints via in-plane rotations within and isotropic scale jitter in , (iii) brightness and contrast jitter of up to on the raw video frames used by ZDCM, and (iv) low-variance Gaussian noise () on joint coordinates. In a pilot study with dance experts, these augmentations were confirmed to maintain the semantic correctness of the original Helou Dance actions. Notably, to avoid interference across configurations, all models use the same random split and seed and are compared under an identical training schedule and identical stopping criteria.
4.3. Evaluation Metrics
To comprehensively assess the overall performance of the CAFE-Dance framework on ethnic dance generation, we construct a systematic evaluation metric suite across three dimensions—technical, cultural, and Overall Quality—to ensure that motion naturalness, cultural consistency, and overall generation quality are all quantitatively verified.
At the technical dimension, we use Fréchet Inception Distance (FID), Beat Alignment Score (BAS), and Physical Plausibility (PP). FID measures the similarity between generated and real dances in feature distributions, defined as follows:
Here,
and
denote the mean and covariance of the feature distributions for real and generated samples, respectively. A smaller FID indicates that the generated results are closer to the real distribution. BAS evaluates the synchronization between dance motion sequences and musical beats, defined as follows:
Here,
and
denote the temporal positions of motion and musical beats, and
is the allowed synchronization tolerance. PP measures the physical plausibility of generated skeletal motions via the smoothness of joint angular velocity changes:
Here, denotes the joint angular velocity at frame t. A higher PP indicates that the motion sequence better conforms to human kinematics.
At the cultural dimension, we introduce two metrics—Cultural Feature Accuracy (CFA) and Style Consistency (SC)—to quantify the capacity for cultural feature expression in generated dances. CFA measures the proportion of cultural semantic features present in generated dances that are correctly identified by a cultural recognition model:
This metric reflects the model’s accuracy in capturing cultural features. SC measures the similarity between generated motions and the target cultural style in the embedding space, defined as follows:
Here, denotes the feature representation of a generated sample, and denotes the mean embedding vector of the target cultural style. A higher SC indicates stronger style consistency in the generated dance.
At the overall-quality dimension, we design an Overall Quality Score (OQS)to comprehensively evaluate the overall performance of generated dances. OQS jointly considers the weighted contributions of FID, BAS, and CFA, defined as follows:
Here, are the weighting coefficients, empirically set to . A higher OQS indicates a better balance between generation quality and cultural feature expression. For OQS, we first normalize FID to [0, 1] by min–max scaling over all compared methods, and then use in the computation.
This paper compares four representative mainstream dance generation models: Bailando [
14], EDGE [
13], Dancing to Music [
25], and Music2Dance [
26]. Specifically, Bailando adopts a latent-space mapping strategy based on pose sequences to match dance and music features; EDGE proposes a cross-modal rhythm–motion alignment network that effectively enhances rhythmic synchronization; and Dancing to Music introduces a dual-stream temporal attention mechanism to improve motion fluency and expressivity. By contrast, our CAFE-Dance framework further integrates a cultural attention mechanism and a ZDCM module, achieving notable gains in cultural semantic interpretability and cross-modal consistency of motion generation.
5. Experimental Results
Figure 4 presents the CAFE-Dance-generated signature Helou Dance actions “trembling step” and “bell-ringing.” To comprehensively evaluate the proposed CAFE-Dance method on music-driven dance generation, this section designs a systematic validation protocol comprising four parts: first, quantitative and qualitative comparative analysis against mainstream methods to verify overall advantages; second, ablation studies to probe the contributions of key modules; third, evaluation of the ZDCM module for automatic cultural feature recognition; and finally, expert subjective evaluation to assess cultural authenticity and aesthetic quality. All experiments are conducted on the unified Helouwu ethnic dance dataset to ensure comparability and reliability.
5.1. Comparison and Analysis with Existing Methods
This study evaluates the effectiveness of the proposed CAFE-Dance method for the music-driven dance generation task from both quantitative and qualitative perspectives.
5.1.1. Quantitative Results Analysis
As shown in
Table 2, this paper compares CAFE-Dance with four mainstream music-driven dance generation methods. All methods are run independently five times on the same test set, and results are reported as mean ± standard deviation. The evaluation metrics include generative quality (FID↓), beat alignment (Beat Score↑), cultural consistency (Cultural Acc.↑), motion smoothness (Smoothness↑), and overall score (Overall↑).
CAFE-Dance achieves the best results on all five metrics, with standard deviations maintained within –, indicating good stability. For generative quality, CAFE-Dance attains an FID of , a relative reduction of about compared with the best baseline Music2Dance-Baseline (), indicating that the generated distribution is closer to real samples. For beat alignment, the Beat Score reaches , an improvement of about over Music2Dance-Baseline (), suggesting more effective modeling of music–dance synchronization.
For cultural consistency, Cultural Acc. increases to , an absolute gain of , and a relative gain of about over Music2Dance-Baseline (), validating the effectiveness of the cultural knowledge modeling module. For motion smoothness and the overall score, CAFE-Dance reaches and , representing relative improvements of about and over the respective best baselines—Bailando () for smoothness and Music2Dance-Baseline () for overall. Paired t-tests show that these performance differences relative to the baselines are statistically significant (), with large effect sizes (Cohen’s ), indicating the robustness of CAFE-Dance’s advantage.
Furthermore, to analyze performance dynamics during training, we plot the validation accuracy trend as shown in
Figure 5. The figure shows that CAFE-Dance not only converges faster but also reaches a final validation accuracy of
, clearly higher than Bailando and EDGE, further confirming advantages in learning efficiency and generalization. In addition to these aggregate metrics, we further inspect typical failure cases where CAFE-Dance still struggles, such as severe occlusions and complex multi-person formations. A representative breakdown in a dense multi-person Helou Dance scene is visualized in
Figure A1, and the corresponding failure categories and complexity-dependent statistics are summarized in
Table A4 and
Table A5 in
Appendix C. These analyses provide a complementary view of the system’s limitations alongside its strengths.
5.1.2. Qualitative Result Analysis
This part aims to verify, via visual comparison, the superiority of CAFE-Dance in visual realism, cultural expressiveness, and rhythmic consistency of generated dance motions.
We select representative music segments and generate dance sequences with CAFE-Dance, Bailando, EDGE, and Dancing to Music for visual comparison.
Figure 6 shows a multi-dimensional performance radar chart that intuitively presents each method’s performance across different dimensions.
From
Figure 6, CAFE-Dance stands out especially on Cultural Authenticity and Beat Alignment, forming a clear region of advantage. For Diversity and Smoothness, CAFE-Dance also maintains a high level, indicating that it preserves motion diversity without sacrificing coherence and naturalness. In contrast, Bailando and EDGE show evident weaknesses in Cultural Authenticity and Beat Alignment, with particularly poor performance on Cultural Authenticity.
Overall, CAFE-Dance not only leads comprehensively on quantitative metrics but also demonstrates stronger capability in visual quality and cultural expression, validating its effectiveness and practicality for music-driven dance generation.
5.2. Ablation Experiment
This experiment systematically verifies the contribution of each key module in the proposed CAFE-Dance framework to overall performance and investigates the mechanisms by which different components affect cultural expression, rhythmic synchronization, and visual fidelity. Specifically, the goals are to validate the effectiveness of the Cultural-Aware Attention Mechanism (CAAM) in improving the accuracy of cultural feature recognition and cross-modal semantic consistency; to assess the impact of the Tri-Modal Alignment Network (TMA-Net) on Beat Alignment; and to analyze how the Weakly Supervised Optimization Module (WSOM) and the Cultural Prior constraint improve the stability of the generative distribution (FID) and the overall score.
The ablation study is conducted on the Helouwu ethnic dance dataset. All experiments use the same training–validation split to ensure comparability and fairness. The evaluation metrics include Cultural Accuracy, Beat Score, FID, and overall score. Each metric is averaged over three runs under identical hyperparameters. All ablation models start from the full framework and remove or replace a single module in turn to objectively evaluate the performance contribution of the corresponding component.
As shown in
Figure 7, the full model achieves the best performance on all metrics: Cultural Accuracy reaches 0.83, Beat Score is 0.91, FID drops to 0.65, and the overall score (Overall) is 0.83. When the Culture-Aware Attention Mechanism is removed (w/o CAAM), the cultural feature recognition rate falls markedly to 0.52 (a 37.3% decrease), and the overall performance declines to 0.70, validating the module’s key role in modeling cross-modal semantic relations and cultural weighting. When TMA-Net is removed, rhythmic synchronization decreases significantly (Beat Score from 0.91 to 0.79), indicating that the temporal alignment network plays a central role in maintaining music–dance correspondence. After removing WSOM, FID rises from 0.65 to 0.71, showing that weak-supervision regularization effectively stabilizes training and suppresses spurious cultural patterns. Under the removal of the Cultural Prior, Cultural Accuracy declines to 0.48, and the overall score drops to 0.69, highlighting the importance of cultural constraints in guiding the learning direction of the feature space.
From the overall trend, the full model achieves the best performance on both cultural and rhythmic dimensions (CA = 0.83, BS = 0.91), improving by about 3.7% over the strongest baseline (w/o WSOM, Overall = 0.80). When CAAM or the Cultural Prior is absent, the model’s cultural expressiveness drops markedly, indicating that cultural modeling is the primary driver of system performance gains. As shown in
Figure 7, the right-to-left performance shift (FID → Cultural Accuracy) reflects a strong coupled dependency between the cultural and generative modalities. Further observation shows that using CAAM in conjunction with WSOM effectively reduces feature distribution shift, enabling a better balance between cultural expression and physical plausibility.
In summary, the ablation results demonstrate the complementary roles of the components in CAFE-Dance with respect to cultural awareness, temporal alignment, and generation stability, validating the effectiveness and necessity of the proposed multimodal culture-aware mechanism and providing a structural basis for subsequent cultural semantic modeling.
5.3. Automatic Cultural Annotation Performance Evaluation
This experiment aims to verify the effectiveness of the proposed ZDCM module in automatically recognizing cultural features of ethnic dance. As an essential component of the CAFE-Dance framework, ZDCM is responsible for learning cultural semantic features—such as costume elements, ritual actions, Music–Dance correlation, and formation choreography—under zero-manual-label conditions. The core goals are: (1) to validate the accuracy and stability of ZDCM on multi-class cultural feature recognition; (2) to evaluate its agreement with expert annotations in unlabeled settings; (3) to investigate recognition disparities across cultural feature categories and identify potential room for improvement.
We use Precision, Recall, and F1-Score to quantitatively evaluate recognition performance for each cultural feature. All experiments follow a unified network configuration and training hyperparameters to ensure reproducibility and fairness.
As shown in
Figure 8, ZDCM achieves strong recognition performance across four cultural feature categories. In the Costume Elements dimension, it attains the highest accuracy (Precision = 0.92, F1 = 0.90), about a 5.9% improvement over the average level, indicating stable capture of costume-related visual semantics. For Music–Dance Correlation, the model reaches an F1 of 0.86, reflecting an advantage in modeling cross-modal relations between rhythm and motion. Overall, the mean Precision, Recall, and F1 for ZDCM are 0.87, 0.84, and 0.85, respectively, validating its high robustness and generalization ability under zero-manual-label supervision.
Notably, the model performs slightly lower on the Formation Patterns dimension (F1 = 0.81), mainly because formation changes span long temporal windows and exhibit large inter-sample variation, making global structural cues vulnerable to occlusion and pose ambiguity. The overall trend shows consistently strong performance on low-level visual semantics (e.g., costume and ritual actions) and on mid-level cross-modal features (e.g., Music–Dance rhythmic coupling), indicating that ZDCM effectively learns cultural semantic representations in the absence of manual annotations.
In summary, the automatic cultural annotation results substantiate the reliability and practicality of ZDCM for cultural feature recognition, with performance approaching expert-annotation levels. It markedly reduces labor cost and provides a solid data foundation for subsequent cultural feature modeling and cross-cultural dance generation.
5.4. Computational Complexity Analysis
This experiment systematically evaluates the computational efficiency of CAFE-Dance, focusing on practical deployment metrics such as inference latency, resource footprint, and throughput. Under a unified hardware setup, we compare against Bailando, EDGE, Dancing to Music, and Music2Dance-Baseline, recording five key indicators: single-clip inference time, memory footprint, parameter count, FLOPs, and throughput.
As shown in
Figure 9, CAFE-Dance performs best across the efficiency metrics. Specifically, the single-clip inference time is
ms, a
reduction relative to the best baseline EDGE (
ms); the throughput reaches 35 clips/s, a
improvement over EDGE (
clips/s); and the memory footprint is just 156 MB,
lower than EDGE (189 MB). These results indicate that the model markedly improves inference efficiency while maintaining high performance.
In terms of computational complexity, the parameter count of CAFE-Dance (
M,
Figure 9) and FLOPs (
G) are slightly higher than the baselines, increasing by approximately
relative to EDGE. However, parameter count does not translate linearly to inference efficiency: Bailando, despite having the smallest parameter count (
M), exhibits lower inference efficiency. This supports that CAFE-Dance achieves a better efficiency–performance trade-off through an optimized network architecture (e.g., parallel design and shortened temporal paths). We note that CAFE-Dance still faces challenges in extremely resource-constrained scenarios (e.g., embedded devices).
In summary, CAFE-Dance strikes a favorable balance between computational complexity and generation quality, providing an efficient solution for the practical deployment of music-driven dance generation.
5.5. Cross-Ethnic Generalization Experiment on Chinese Folk and Ethnic Dance
As shown in
Figure 10, to validate the cross-cultural generalization of the CAFE-Dance framework across different ethnic dance types, we conduct rigorous statistical evaluations on four representative Chinese ethnic-dance datasets. All experiments adopt a fully cross-validated design: each dance type is tested independently five times, results are reported as mean ± standard deviation, and homogeneity-of-variance tests are performed to ensure statistical assumptions are met.
As shown in
Table 3, we first conduct a one-way ANOVA, which reveals significant differences across dance types on all metrics (all
), demonstrating that dance type has a statistically significant effect on generative performance.
Post hoc pairwise comparisons using Tukey HSD show that, for Cultural Accuracy, Helou Dance () and Uyghur () do not differ significantly (), but both are significantly higher than Tibetan (, ). This pattern may reflect the unique religious-cultural elements and bodily expressions in Tibetan dance, which pose greater challenges to the model.
For Beat Score, Uyghur performs best () and does not differ significantly from Mongolian (, ), but is significantly higher than Tibetan (, ). This suggests that while the model handles rhythmic complexity effectively, learning specific cultural rhythm patterns varies by dance type.
From the perspective of variability, FID exhibits the smallest standard deviations (–), indicating the highest stability in generative quality, whereas Cultural Accuracy shows relatively larger standard deviations (–), reflecting variability in learning cultural features across samples. This pattern is consistent across dance types, implying that capturing cultural expression is more challenging than lower-level visual features.
Notably, Levene’s tests for homogeneity of variances are non-significant across groups (), satisfying the ANOVA assumption and enhancing the reliability of the conclusions. The standard deviations across the five runs remain within 0.02–0.03, indicating high stability.
In terms of effect size, the partial for the effect of dance type on Cultural Accuracy is (a large effect), indicating that dance-type differences explain of the variance in Cultural Accuracy. This finding underscores the importance of accounting for dance-type characteristics in cross-cultural dance generation research.
Overall, the statistical results indicate that CAFE-Dance exhibits strong adaptability in cross-cultural generalization. Although there remains room to improve the fine-grained capture of certain cultural elements, the framework provides a reliable technical foundation for generating core ethnic-dance styles.
5.6. Parameter Sensitivity Analysis
This experiment systematically investigates the sensitivity of the proposed CAFE-Dance framework to hyperparameter configurations in the multi-objective optimization. Specifically, we analyze the impact of weight coefficients in the Overall Quality score (OQS) formulation to validate the robustness of our weighting strategy.
The experiment is conducted on the Helouwu ethnic dance dataset with identical model architecture, training epochs, and optimizer settings across all trials. We systematically vary the loss weights while maintaining the constraint . Four distinct weight configurations are evaluated, each representing a perturbation from the baseline by adjusting one weight component by while proportionally adjusting the remaining weights. All results are averaged over five independent runs.
As shown in
Table 4, the OQS demonstrates remarkable stability across all weight configurations, with a maximum deviation of only 0.01 from the baseline. This indicates that the overall performance is largely insensitive to moderate perturbations in the weight coefficients. The baseline configuration
achieves the optimal balance among the three evaluation metrics, yielding the lowest FID (0.65) while maintaining high Beat Score (0.91) and Cultural Accuracy (0.83).
When increasing to 0.5 (emphasizing generation fidelity), we observe a marginal degradation in Cultural Accuracy (0.82 vs. 0.83), suggesting that excessive focus on distribution matching may slightly compromise cultural expressiveness. Conversely, when elevating to 0.4 (prioritizing cultural alignment), Cultural Accuracy improves to 0.84 at the cost of slightly increased FID (0.68 vs. 0.65) and reduced Beat Score (0.90 vs. 0.91). This reveals an inherent trade-off between cultural specificity and other quality dimensions.
The consistent performance across weight configurations () underscores the robustness of our approach, which we attribute to the complementary nature of the Cultural Attention Alignment Module (CAAM) and Tri-Modal Alignment Network (TMA-Net). These components enable the model to maintain cultural authenticity and rhythmic synchronization without over-reliance on any single loss term.
Statistical analysis confirms that the observed OQS variations are not significant (paired t-test, across all comparisons), further validating the stability of our method. The optimal configuration demonstrates consistent performance in all five independent runs, with OQS standard deviation of 0.008.
5.7. Expert Subjective Evaluation
This experiment aims to evaluate the subjective performance of the proposed CAFE-Dance framework in terms of the cultural authenticity, ritual atmosphere, and overall perceptual quality of generated dances, with a focus on interpretability and aesthetic consistency at the level of cultural expression.
The subjective evaluation is independently conducted by three dance experts, each with over ten years of experience in ethnic dance teaching and research. The assessment uses a five-point Likert scale ranging from 1 (very poor) to 5 (excellent). The experiment is based on generated samples from the Helouwu ethnic dance dataset. After viewing the real performance and the corresponding samples generated by the three methods (CAFE-Dance, Bailando, and EDGE), each expert assigns independent scores on three criteria. To ensure result consistency, we compute Cohen’s Kappa for inter-rater agreement (), which indicates a high level of consistency among expert ratings. At this stage, however, the expert panel does not yet include Helou Dance practitioners or cultural preservation organizations from Guangdong Province; incorporating their participatory feedback will be an important direction for future work to further strengthen the cultural validity of the evaluation.
As shown in
Figure 11, CAFE-Dance significantly outperforms existing models on all three subjective metrics. Its Cultural Authenticity score is 4.2, representing improvements of 50% and 61.5% over Bailando (2.8) and EDGE (2.6), respectively; its Ritual Accuracy is 4.3, approaching the level of real performances (4.7), indicating that the generated dance closely matches real dancers in ritualized expression and postural coordination; its Overall Quality reaches 4.2, about a 35% gain over the best baseline, Bailando (3.1). These results substantiate the effectiveness of cultural modeling and cross-modal alignment in enhancing the perceptual quality of dance generation.
Notably, although CAFE-Dance performs close to real performances, there remains a slight gap in cultural atmosphere coherence. We attribute this mainly to slight rhythmic lag and insufficient pose smoothness during long-horizon motion transitions in some generated sequences, which causes the overall ritual atmosphere to fall slightly below that of real videos.
Overall, the subjective evaluation indicates that CAFE-Dance attains high-quality generation at the technical level and earns expert recognition for Cultural Authenticity and aesthetic perception. This result validates that the proposed framework achieves its design goal of combining interpretability and artistic expressivity in cultural dance generation.
6. Conclusions
This paper investigates the core challenges of generating ethnically authentic dance sequences under weak supervision. To address the bottlenecks of heavy reliance on manual annotation, difficulty in modeling cultural semantics, and multi-modal alignment, we propose a new culture-aware dance generation framework, CAFE-Dance, based on Helouwu. The framework systematically integrates zero-manual-label cultural data construction, a cultural attention mechanism, and a tri-modal alignment network, enabling automatic generation of dance sequences with high cultural fidelity, precise musical synchronization, and natural motion smoothness without costly motion capture or manual intervention. Quantitative experiments show that the design reduces FID to 0.65 and attains a subjective “ritual accuracy” score of 4.3 (close to 4.7 for real performances), significantly outperforming current mainstream methods.
Despite the strong results on Helou Dance, this study has limitations, especially the reliance on visual features for cultural semantic representation. To address this, future work explores language-driven conditioning by incorporating multimodal large language models to enhance cultural attribute semantics and fine-grained generative control. We will also analyze potential biases in the cultural annotation process, particularly systematic biases introduced by automatic labeling, and establish bias detection and mitigation mechanisms. In addition, we plan to expand the dance database across multiple ethnicities and genres, and to develop standardized cultural expression guidelines with a source-community feedback process to ensure cultural compliance and continuous improvement. Furthermore, we aim to establish long-term collaborations with Helou Dance practitioners and cultural preservation organizations in Guangdong Province to co-design evaluation protocols and iteratively refine the system based on practitioner feedback, thereby further enhancing the cultural validity and social robustness of CAFE-Dance. On the deployment side, we will advance efficient on-device inference, including lightweight strategies such as model compression, quantization, and knowledge distillation, to enable low-latency, high-fidelity cultural dance generation and interactive experiences. In parallel, we will address the failure modes identified under severe occlusions and dense multi-person formations by augmenting CAFE-Dance with occlusion-aware 3D reconstruction and dynamic multi-person association modules informed by the failure-case statistics in
Appendix C.