CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning

Niu, Bin; Yang, Rui; Zhang, Qiuyu; Zhang, Yani; Fan, Ying

doi:10.3390/bdcc9120307

Open AccessArticle

CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning

by

Bin Niu

^1,*

,

Rui Yang

²

,

Qiuyu Zhang

²

,

Yani Zhang

^3,*

and

Ying Fan

⁴

¹

School of Dance, Northwest Normal University, Lanzhou 730070, China

²

School of Computer Science and Artificial Intelligence, Lanzhou University of Technology, Lanzhou 730050, China

³

School of Arts, Shandong University, Jinan 250100, China

⁴

School of International Communication and Arts, Hainan University, Haikou 570228, China

^*

Authors to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(12), 307; https://doi.org/10.3390/bdcc9120307

Submission received: 26 October 2025 / Revised: 26 November 2025 / Accepted: 28 November 2025 / Published: 2 December 2025

(This article belongs to the Topic Generative AI and Interdisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

As a vital carrier of human intangible culture, dance plays an important role in cultural transmission through digital generation. However, existing dance generation methods rely heavily on high-precision motion capture and manually annotated datasets, and they fail to effectively model the culturally distinctive movements of Chinese ethnic folk dance, resulting in semantic distortion and cross-modal mismatch. Building on the Chinese traditional ethnic Helou Dance, this paper proposes a culture-aware Chinese ethnic folk dance generation framework, CAFE-Dance, which dispenses with manual annotation and automatically generates dance sequences that achieve high cultural fidelity, precise music synchronization, and natural, fluent motion. To address the high cost and poor scalability of cultural annotation, we introduce a Zero-Manual-Label Cultural Data Construction Module (ZDCM) that performs self-supervised cultural learning from raw dance videos, using cross-modal semantic alignment and a knowledge-base-guided automatic annotation mechanism to construct a high-quality dataset of Chinese ethnic folk dance covering 108 classes of curated cultural attributes without any frame-level manual labels. To address the difficulty of modeling cultural semantics and the weak interpretability, we propose a Culture-Aware Attention Mechanism (CAAM) that incorporates cultural gating and co-attention to adaptively enhance culturally key movements. To address the challenge of aligning the music–motion–culture tri-modalities, we propose a Tri-Modal Alignment Network (TMA-Net) that achieves dynamic coupling and temporal synchronization of tri-modal semantics under weak supervision. Experimental results show that our framework improves Beat Alignment and Cultural Accuracy by 4.0–5.0 percentage points and over 30 percentage points, respectively, compared with the strongest baseline (Music2Dance), and it reveals an intrinsic coupling between cultural embedding density and motion stability. The code and the curated Helouwu dataset are publicly available.

Keywords:

Helou dance; Chinese folk and ethnic dance; zero-manual-label cultural data construction; culture-aware dance generation; multimodal alignment

1. Introduction

As an important vehicle of human intangible culture, dance not only carries a group’s historical memory and aesthetic heritage, but also serves as a living conduit for cross-cultural exchange. Chinese ethnic folk dance refers to dance forms that originate and circulate among the people, are shaped by folk culture, and primarily serve self-entertainment; it features distinct ethnic styles and regional characteristics, and it reflects each ethnic group’s unique lifestyles, cultural traditions, and religious beliefs [1,2]. Chinese ethnic folk dance is not only an artistic expression, but also a vehicle of cultural memory and a symbol of identity, conveying emotions, history, and cultural meanings through movement [3,4]. As shown in Figure 1, Chinese ethnic folk dance is classified by ethnicity into Han ethnic dance, Korean ethnic dance, Dai ethnic dance, Mongolian ethnic dance, Uyghur ethnic dance, and other forms, including those of the Dong, Yi, and Tujia, among others [5,6,7,8]. As shown in Figure 1g,h, the Han Helou Dance, which originates in the Lingnan region and is renowned as a “living fossil” of dance, traces its history to the Shang and Zhou periods [8]. With its distinctive bodily signifiers and cultural system, Chinese ethnic folk dance constitutes a key section of the diversity atlas of Chinese civilization [9,10]. In the process of modernization, these treasures of world culture face a crisis of discontinuity in transmission, and their safeguarding urgently requires technological intervention [11].

With the deep integration of digital cultural heritage preservation and cross-modal generation, dance generation evolves from basic motion replay into a core engine for the living transmission of traditional arts. Current solutions fall into four paradigms: motion-unit decoupling (e.g., ChoreoNet) [12], cross-modal diffusion generation (e.g., EDGE’s dance diffusion model) [13], choreographic memory codebooks (e.g., Bailando++) [14], and multi-condition controlled generation (e.g., keyframe optimization in DanceCamAnimator) [15]. Although these methods [12,13,14,15] achieve notable progress on generic dances (e.g., hip-hop), they rely heavily on high-precision motion capture and manual annotation. When facing culturally distinctive movements in the Helou Dance—such as stepping the Dipper pattern (ta-gang-bu-dou) and five-direction prostration—they commonly exhibit semantic distortion and cross-modal mismatch. This limitation constrains faithful reconstruction of fine-grained movements in Chinese ethnic folk dance and weakens applicability to cultural transmission and cross-scene generalization.

Under a minimal-annotation regime, constructing a Chinese ethnic folk dance training set that is both culturally representative and scalable is the core challenge for improving the robustness of generation models. This challenge arises because existing datasets rely mainly on manual annotation or costly capture [16,17], which leads to insufficient coverage of minority dances and poor scalability, making it difficult to support fine-grained modeling of long-tail styles [18]. For example, although the AIST++ dataset [18] covers ten dance categories and reconstructs 3D motion from multi-view videos, its samples mainly focus on modern forms such as street dance and lack culture-rich annotation for Mongolian Andai or Dai Peacock dances, resulting in low pose-trajectory variance and curation costs exceeding several hundred RMB per minute. Therefore, the primary challenge is how to build, under zero-manual-label supervision, a data foundation that is both culturally representative and scalable to provide reliable support for generation models.

Transforming cultural semantics from narrative concepts into a learnable and interpretable attention mechanism is an urgent requirement for high-fidelity generation of ethnic dance [17,19]. The difficulty lies in the fact that traditional methods quantify generic poses but fail to capture cultural differences across ethnic dances, which yields low semantic consistency scores [20]. For example, although the Bailando framework constructs a choreographic memory codebook via VQ-VAE to encode dance units, it shows a markedly low overlap ratio between attention hotspots and key joints when handling Helou Dance actions such as “Invoking the Deity” (qing-shen), exposing a lack of cultural interpretability [21]. Therefore, current models quantify generic poses yet struggle to capture culture-specific differences in ethnic dance, which limits their semantic consistency and interpretability.

Under weak supervision, achieving tri-modal alignment of music, motion, and culture is the key bottleneck that drives progress in ethnic dance generation. The root of this challenge lies in current frameworks that handle music conditioning but overlook the nonlinear mapping of cultural semantics, which leads to unstable phase synchronization rates under low-SNR audio [17,19,22]. For example, although the MoFusion model [15] integrates music and motion within a diffusion framework, it lacks coupling among culture, motion, and music in tests on Yi or Uyghur ethnic dances, revealing insufficient relation learning in cross-modal attention. Consequently, under weak supervision, how to realize dynamic tri-modal alignment among musical rhythm, dance motion, and cultural semantics becomes the core bottleneck that affects generation stability and cultural consistency.

Motivated by the above challenges, and building on the Chinese traditional ethnic Helou Dance, this paper introduces a culture-aware framework designed to generate dance movements that reflect distinct ethnic characteristics. The main contributions of this work are summarized as follows:

It proposes a zero-manual-label cultural data construction method for ethnic dance that, through automatic skeleton extraction and fusion of cultural semantic labels guided by a curated cultural knowledge base, for the first time achieves cultural representativeness and scalability of dance data without any frame-level manual labels, providing a sustainable data foundation for culture-aware dance generation.
It designs a Culture-Aware Attention Mechanism (CAAM) that enables the generation model to adaptively capture ethnic dance features and visualize cultural semantic hotspots, improving cultural interpretability and performance consistency in dance generation.
It builds a music–motion–culture Tri-Modal Alignment Network (TMA-Net) that achieves dynamic coupling and temporal synchronization of tri-modal semantics under weak supervision, enhancing stability and cultural consistency of dance generation in low-annotation scenarios.

2. Related Work

2.1. Minimal-Annotation and Culturally Representative Dataset Construction for Dance Generation

In recent years, research on music-driven dance generation has shifted from manually annotated data toward an automated data-construction paradigm, with prior work focusing primarily on the fundamental tasks of multi-view capture, skeleton reconstruction, and music synchronization. However, existing datasets rely heavily on expensive motion capture systems and manual labels (e.g., the AIST++ and AI Choreographer datasets [18,23]), which provide limited coverage for long-tail styles such as ethnic dances and thus fall short for generative modeling of cultural diversity [1,2]. For example, Ye et al. propose the ChoreoNet framework [12], which learns an implicit mapping from music to dance to improve generation coherence, yet its data still depend on manually segmented annotations; Takano et al. construct a multi-view 3D dance dataset based on AIST++ [23], achieving music–motion time synchronization but exhibiting extreme scarcity of cultural categories and insufficient pose variance. To address these limitations, this paper introduces a zero-manual-label cultural data pipeline module that combines MediaPipe-based automatic skeleton extraction with culture-ontology–guided heuristic labels derived from a curated cultural knowledge base to simultaneously achieve cultural representativeness and data scalability without any frame-level manual labels [8,9].

2.2. Cultural Semantics Modeling and Interpretable Attention in Dance Generation

At the semantic and cultural levels of dance generation, explicit modeling of cultural features and interpretability is becoming a research focus [3,6]. Although existing methods achieve notable advances in motion quality and music synchronization (e.g., the Bailando and EDGE frameworks [13,21]), they still lack explicit modeling of culturally distinctive movements such as those in ethnic dance, and their attention weight distributions fail to explain cultural semantic associations [5,7]. For example, Li et al. build a choreographic memory codebook with VQ-VAE in Bailando to capture the structure of dance units [21], yet the model shows an imbalance in attention regions for Helou Dance postures such as stepping the Dipper pattern (ta-gang-bu-dou), with the hotspot distribution over key joints poorly aligned to the topological structure of ritual actions [7]. Zhou et al. propose EDGE to improve controllability via editable dance generation [13], but it does not achieve style interpretability at the cultural level. To address this gap, we design a Culture-Aware Attention Mechanism that introduces cultural semantic embedding vectors into a multi-head attention architecture, enabling the model to adaptively capture ethnic features and visualize cultural semantic hotspots during motion generation, thereby significantly improving the consistency and interpretability of cultural expression [11,24].

2.3. Tri-Modal Alignment of Music, Motion, and Cultural Semantics Under Weak Supervision

Cross-modal alignment is one of the core problems in music-driven dance generation, and under weak supervision, the tri-modal coordination of music, motion, and cultural semantics still faces significant challenges [4,10]. Although existing work (e.g., MoFusion and the Human Motion Diffusion Model) [15,22] employs diffusion models to realize music–motion conditional generation, it generally overlooks the nonlinear mapping of cultural semantics, resulting in insufficient synchronization between cultural style and rhythmic structure [9]. For example, Zhou et al.’s MoFusion framework improves motion smoothness through multi-task training [15] but exhibits unstable phase synchronization and cultural distortion in ethnic dance tests. To address this bottleneck, this paper constructs a Tri-Modal Alignment Network (TMA-Net) that, under weak supervision, fuses musical features, motion skeletons, and cultural embeddings via a cross-modal attention mechanism to achieve tri-modal semantic alignment and dynamic synchronization, thereby maintaining generation stability and cultural consistency in low-annotation scenarios [3,6].

2.4. Helou Dance

Chinese ethnic folk dance, as an integral component of China’s fine traditional culture, presents an artistic sublimation and concentrated expression of production and livelihood, religious beliefs, and aesthetic sentiments that emerge from specific regions and groups over long historical development [1,2]. Chinese ethnic folk dance exhibits rich and diverse forms and a vast corpus; it not only constitutes the cultural markers of each ethnicity, but also provides a dynamic visual text for understanding the cultural diversity of Chinese society and its historical transformations. Against this broad backdrop, Helou Dance, which originates in the Nanjiang River basin of the Lingnan region, stands as a living legacy and unique specimen of Chinese ritual ethnic dance [5,6,7,8], as shown in Figure 2.

Helou Dance circulates in the area of Liantan Town, Yunan County, Guangdong Province, China. It originates in the harvest celebrations and sacrificial dances of the Wuhu people of the Yue in the Nanjiang River basin during the Qin and Han periods. Its movement vocabulary uses stepping as the base and arm-swinging as the principal motif, emphasizing the rhythmic coordination between firm lower-limb grounding and expansive upper-limb swinging. Typical movements include stepping the Dipper pattern (ta-gang-bu-dou), five-direction prostration, and a procession around the performance space; the musical melody is fixed and lyrical, with a distinct beat. Typical attire includes lotus coronets, red-and-yellow shawls, an ox-head ringed staff, and bronze bells tied with colored ribbons. In group formations, dancers often wear masks and hold rice ears, step to the song, and perform facing the four cardinal directions (east, south, west, north). As an important remnant of ancient rice-farming culture, Helou Dance presents salient Lingnan folk characteristics and ritual aesthetics; however, it currently faces pressures on survival and transmission and therefore urgently requires digital preservation and re-creation [8].

3. Methods

This section presents how CAFE-Dance addresses key issues in cultural dance generation, including multimodal alignment, cultural feature modeling, and weak-supervision optimization. As shown in Figure 3, CAFE-Dance comprises four innovative modules: zero-manual-label cultural data construction, a cultural attention mechanism, a tri-modal alignment network, and a weak-supervision optimization strategy. We start with a system overview, then introduce the design principles and technical implementation of each module in sequence, and finally describe the overall workflow.

3.1. Zero-Manual-Label Cultural Dance Dataset Construction

Existing dance datasets generally rely on manual annotation of cultural labels, which limits data scale and leaves cultural features undercovered. For example, the AIST++ [18] dataset only annotates modern genres (e.g., hip-hop, house, jazz) and lacks cultural semantic distinctions for traditional ethnic dance. This bottleneck arises primarily from the expertise required for cultural annotation and the high cost of manual labeling, which in turn constrains model generalization and cultural diversity.

For this purpose, we propose a Zero-Manual-Label Cultural Data Construction Module(ZDCM), which is embedded in the data preprocessing layer of the framework in Figure 3. Its formal definition is given in Equation (1): it receives the raw dance video stream

V = {v_{1}, v_{2}, \dots, v_{T}}

and outputs a triplet augmented with cultural semantics,

D = (M, A, C)

:

D = Ψ (V),

(1)

where

M

denotes the motion sequence,

A

denotes the audio features, and

C

denotes the cultural label vector. Here, “zero-manual-label” indicates that no frame-level or clip-level manual annotations are required for

V

; instead, ZDCM leverages a curated cultural knowledge base

K

and self-supervised confidence-based filtering to generate pseudo cultural labels.

ZDCM consists of two core processes: cultural symbol extraction and cross-modal semantic alignment. In the cultural symbol extraction stage, the model automatically identifies cultural visual elements from the raw video, as shown in Equation (2):

F_{vis} = Φ_{ViT} (V) \in R^{T \times d},

(2)

where

Φ_{ViT}

denotes a spatiotemporal feature encoder and

F_{vis}

denotes the extracted visual feature sequence.

Subsequently, to strengthen cultural semantic associations, we introduce cross-modal semantic alignment, as shown in Equation (3):

C_{t} = softmax (f_{t} W_{c} K^{⊤}),

(3)

where

f_{t}

is the t-th frame feature in

F_{vis}

,

W_{c}

is a learnable projection matrix, and

K

contains 10 cultural embedding vectors in a 108-dimensional semantic space, each corresponding to one dance category and encoding 108 cultural attributes (attire, props, rituals, etc.). The knowledge-base matrix

K \in R^{10 \times 108}

thus serves as a compact representation of curated cultural prototypes, which are further refined during training.

More specifically, given the query vector

q = f_{t} W_{c} \in R^{108}

, we compute its semantic similarity to each dance category via an attention mechanism, as shown in Equation (4):

a = softmax (\frac{q K^{⊤}}{\sqrt{108}}) \in R^{10},

(4)

and then retrieve the cultural feature

C_{t} \in R^{108}

via a weighted combination of the cultural prototypes, as shown in Equation (5):

C_{t} = a K \in R^{108} .

(5)

The optimization uses a composite loss function, as shown in Equation (6):

L = L_{CE} + λ {∥K∥}_{F}^{2},

(6)

where

L_{CE}

denotes the cross-entropy classification loss that supervises cultural category prediction, and

λ = 0.01

controls the strength of the Frobenius-norm regularization on

K

to avoid overfitting the cultural prototypes.

The ZDCM module is designed to enable cultural feature recognition and alignment under zero-manual-label supervision. In practical experiments, we use the module for structured annotation of Helou Dance video data, which significantly reduces annotation cost, and we suppress noisy labels via a confidence filtering mechanism (

τ = 0.85

, determined by the grid-search procedure described in Appendix A). Results show that the module achieves an average F1 score of 0.85 on the automated cultural feature recognition task for Helou Dance, verifying its high stability and accuracy in low-annotation settings.

3.2. Cultural Attention Mechanism

Conventional multi-head attention mechanisms show pronounced shortcomings in cultural feature modeling. The defect arises mainly from the decoupling between cultural semantics and motion topology, which limits the fidelity of cultural action generation.

To this end, we design a Culture-Aware Attention Mechanism (CAAM) and embed it in the 3rd layer of the spatiotemporal encoder in Figure 3. Its formal definition is given in Equation (7): it receives motion features

F \in R^{T \times d}

and a cultural vector

C

, and outputs features recalibrated by cultural weights,

\tilde{F}

:

\tilde{F} = Γ (F, C),

(7)

where

Γ (\cdot)

denotes a culture-aware mapping function that models the culture–motion interaction.

The CAAM module contains two core units: a Cultural Gate and culture–motion co-attention. The Cultural Gate performs semantic injection via Sigmoid activation:

Q^{'} = Q + σ (W g [Q \oplus C]),

(8)

and the co-attention establishes cross-modal cultural associations:

Attn = softmax (\frac{Q^{'} K^{⊤}}{\sqrt{d_{k}}} + M_{cult}) V .

(9)

where the cultural similarity matrix is defined as follows:

M_{cult}^{i, j} = \frac{C_{i} \cdot C_{j}}{| C_{i} | | C_{j} |} .

(10)

In Helou Dance generation experiments, the CAAM module markedly enhances response regions for costume-and-prop features such as the lotus coronet and ox-head staff, as well as ritual actions such as the stepping dance and circumferential procession.

3.3. Tri-Modal Alignment Network

Existing dance generation methods typically focus on the motion–music modalities while ignoring cultural semantic modulation, which leads to insufficient cultural coherence in generated dances. To address this issue, we propose a Tri-Modal Alignment Network (TMA-Net), embedded in the cross-modal fusion layer of Figure 3. It is defined as follows:

F_{align} = Ω (F_{m}, F_{a}, C),

(11)

and comprises two parts: dynamic modal projection and culture-conditioned contrastive learning. The dynamic projection generates culture-aware embeddings:

E_{i} = tanh (W_{i} [F_{i} \oplus C]), i \in {m, a},

(12)

and the contrastive loss constrains cross-modal distances:

L_{align} = - log \frac{exp (s (E_{m}, E_{a}) / γ)}{\sum_{k = 1}^{K} exp (s (E_{m}, E_{a}^{k}) / γ)},

(13)

where

s (\cdot)

denotes cosine similarity and

γ

is the temperature coefficient.

3.4. Weakly Supervised Optimization and Ablation Design

Diffusion-based dance generation models (e.g., MoFusion [15]) typically rely on full tri-modal supervision, and optimization becomes unstable when cultural labels are sparse. Similarly, Bailando [14] exhibits mode collapse under few-shot settings. To address this, we introduce a Weakly Supervised Optimization Module (WSOM) integrated at the end of the training framework. The total objective is defined as follows:

L_{total} = Λ (\hat{M}, C),

(14)

and the module comprises a cultural adversarial loss and a consistency constraint:

L_{adv} = E [log D_{c} (M, C)] + E [log (1 - D_{c} (G (z), C))],

(15)

L_{consist} = {∥Φ_{ViT} (G (z)) - C∥}_{2} .

(16)

The progressive curriculum weight scheduling is defined as follows:

λ_{t} = \{\begin{matrix} 0.1, & t < 20 k, 0.5, & 20 k \leq t < 50 k, 1.0, & t \geq 50 k . \end{matrix}

(17)

On Helou Dance data, this module effectively mitigates training oscillations caused by sparse cultural labels. Experiments show that adding WSOM reduces FID from 0.71 to 0.65, validating the effectiveness of weak supervision in improving generation stability and cultural consistency. Ablation studies further indicate that, when combined with the Culture-Aware Attention Mechanism (CAAM), WSOM markedly reduces spurious cultural patterns and yields results with stronger ritual expressivity and stylistic consistency.

4. Experimental Setup

To validate the effectiveness of CAFE-Dance, we conduct systematic experiments on an in-house ethnic dance skeleton dataset and compare against multiple baselines and ablation configurations. This section introduces, in sequence, the experimental dataset, the model training environment, the hyperparameters, and the evaluation metrics.

4.1. Datasets

We construct the Helouwu ethnic dance dataset from publicly available online video resources. To enable systematic analysis, we collect 170 raw performance clips totaling approximately 7.5 h, with an average duration of about 2.5 min per video. The videos primarily originate from intangible cultural heritage performances, archival materials released by local cultural institutions, and documentary footage of folk cultural activities (excluding professionally studio-recorded content). Before use, all videos undergo manual quality screening to ensure visual clarity, performance completeness, and content relevance.

Regarding data use, we strictly adhere to academic ethical standards. All videos are sourced from public platforms or institutional websites that permit non-commercial public use and are restricted to this academic research. To protect individual rights, we anonymize video content involving identifiable individuals and exclude materials explicitly marked as non-redistributable or requiring additional authorization.

Based on ZDCM (Equation (1)), we construct the Helouwu ethnic dance dataset. Starting from 170 raw Helouwu clips (about 7.5 h in total), we first resample each video to 30 Hz and apply a fixed-length temporal windowing strategy. Concretely, for each dance video stream

V

, we extract the motion sequence

M \in R^{T \times 132}

(33 × 4 keypoints with

x, y, z

and visibility) with a sampling interval

Δ t

=

1 / 30 s

. We then segment each clip into approximately 5 s windows (corresponding to T = 150 frames at 30 Hz) with mild overlap, followed by denoising and duration normalization; segments shorter than 5 s are zero-padded and longer ones are center-cropped. In parallel, we extract audio features

A

(22,050 Hz sampling) for the same temporal windows and the video spatiotemporal representation

F_{vis} = Φ_{ViT} (V)

(Equation (2)). Then, following the cross-modal semantic alignment strategy in Figure 3 (Equation (3)), we generate the cultural label vector

C

from the knowledge base

K

containing 108 cultural attribute vectors (costume, props, rituals, etc.), and filter low-confidence samples with the threshold

τ

=

0.85

. After temporal windowing and confidence filtering, we obtain approximately 2400 motion–audio–culture segments, corresponding to roughly

4.2 \times 10^{5}

effective video frames and covering the fine-grained cultural elements of Helou Dance. We empirically compared several candidate sampling rates (15, 25, 30, 45, 60 Hz) and window lengths (2, 3, 5, 8, 10 s); 30 Hz and a 5 s window achieved the best trade-off between accuracy, semantic completeness, and efficiency (see Appendix B for a detailed ablation).

4.2. Implementation Details

Experiments in this paper use the PyTorch 2.5.1 library and run on a high-performance server with the Ubuntu 22.04 operating system, equipped with an 18-core AMD EPYC 9754 processor (Advanced Micro Devices, Inc., Santa Clara, CA, USA) and three × NVIDIA GeForce RTX 5090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA).

The hyperparameters used in our experiments are listed in Table 1; unless otherwise specified, all comparisons and ablations keep identical data splits and training settings. Training uses the Adam optimizer and a StepLR learning-rate scheduler; the single-label multi-class classification task adopts the cross-entropy loss. We evaluate on the validation set every epoch and report final results on the test set.

During training, we further apply a light-weight data augmentation pipeline to improve robustness while preserving the semantic content of Helou Dance movements. For each 5 s motion–audio clip, we randomly sample (i) temporal jitter by speed scaling factors

{0.8, 1.0, 1.2}

, (ii) small spatial perturbations on the 3D keypoints via in-plane rotations within

\pm 15^{\circ}

and isotropic scale jitter in

[0.9, 1.1]

, (iii) brightness and contrast jitter of up to

\pm 10 %

on the raw video frames used by ZDCM, and (iv) low-variance Gaussian noise (

σ = 0.01

) on joint coordinates. In a pilot study with dance experts, these augmentations were confirmed to maintain the semantic correctness of the original Helou Dance actions. Notably, to avoid interference across configurations, all models use the same random split and seed and are compared under an identical training schedule and identical stopping criteria.

4.3. Evaluation Metrics

To comprehensively assess the overall performance of the CAFE-Dance framework on ethnic dance generation, we construct a systematic evaluation metric suite across three dimensions—technical, cultural, and Overall Quality—to ensure that motion naturalness, cultural consistency, and overall generation quality are all quantitatively verified.

At the technical dimension, we use Fréchet Inception Distance (FID), Beat Alignment Score (BAS), and Physical Plausibility (PP). FID measures the similarity between generated and real dances in feature distributions, defined as follows:

FID = | μ_{r} - μ_{g} |^{2} + Tr (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{1 / 2}) .

(18)

Here,

(μ_{r}, Σ_{r})

and

(μ_{g}, Σ_{g})

denote the mean and covariance of the feature distributions for real and generated samples, respectively. A smaller FID indicates that the generated results are closer to the real distribution. BAS evaluates the synchronization between dance motion sequences and musical beats, defined as follows:

BAS = \frac{1}{N} \sum_{t = 1}^{N} δ (| b_{t} - m_{t} | < ϵ) .

(19)

Here,

b_{t}

and

m_{t}

denote the temporal positions of motion and musical beats, and

ϵ

is the allowed synchronization tolerance. PP measures the physical plausibility of generated skeletal motions via the smoothness of joint angular velocity changes:

PP = 1 - \frac{1}{N - 1} \sum_{t = 2}^{N} |{\dot{θ}}_{t} - {\dot{θ}}_{t - 1}| .

(20)

Here,

{\dot{θ}}_{t}

denotes the joint angular velocity at frame t. A higher PP indicates that the motion sequence better conforms to human kinematics.

At the cultural dimension, we introduce two metrics—Cultural Feature Accuracy (CFA) and Style Consistency (SC)—to quantify the capacity for cultural feature expression in generated dances. CFA measures the proportion of cultural semantic features present in generated dances that are correctly identified by a cultural recognition model:

CFA = \frac{number of correctly identified samples}{total number of samples} .

(21)

This metric reflects the model’s accuracy in capturing cultural features. SC measures the similarity between generated motions and the target cultural style in the embedding space, defined as follows:

SC = \frac{1}{N} \sum_{i = 1}^{N} cos (f_{i}, f_{style}) .

(22)

Here,

f_{i}

denotes the feature representation of a generated sample, and

f_{s t y l e}

denotes the mean embedding vector of the target cultural style. A higher SC indicates stronger style consistency in the generated dance.

At the overall-quality dimension, we design an Overall Quality Score (OQS)to comprehensively evaluate the overall performance of generated dances. OQS jointly considers the weighted contributions of FID, BAS, and CFA, defined as follows:

OQS = α (1 - {FID}_{norm}) + β \cdot BAS + γ \cdot CFA,

(23)

Here,

α, β, γ

are the weighting coefficients, empirically set to

(0.4, 0.3, 0.3)

. A higher OQS indicates a better balance between generation quality and cultural feature expression. For OQS, we first normalize FID to [0, 1] by min–max scaling over all compared methods, and then use

(1 - {FID}_{norm})

in the computation.

This paper compares four representative mainstream dance generation models: Bailando [14], EDGE [13], Dancing to Music [25], and Music2Dance [26]. Specifically, Bailando adopts a latent-space mapping strategy based on pose sequences to match dance and music features; EDGE proposes a cross-modal rhythm–motion alignment network that effectively enhances rhythmic synchronization; and Dancing to Music introduces a dual-stream temporal attention mechanism to improve motion fluency and expressivity. By contrast, our CAFE-Dance framework further integrates a cultural attention mechanism and a ZDCM module, achieving notable gains in cultural semantic interpretability and cross-modal consistency of motion generation.

5. Experimental Results

Figure 4 presents the CAFE-Dance-generated signature Helou Dance actions “trembling step” and “bell-ringing.” To comprehensively evaluate the proposed CAFE-Dance method on music-driven dance generation, this section designs a systematic validation protocol comprising four parts: first, quantitative and qualitative comparative analysis against mainstream methods to verify overall advantages; second, ablation studies to probe the contributions of key modules; third, evaluation of the ZDCM module for automatic cultural feature recognition; and finally, expert subjective evaluation to assess cultural authenticity and aesthetic quality. All experiments are conducted on the unified Helouwu ethnic dance dataset to ensure comparability and reliability.

5.1. Comparison and Analysis with Existing Methods

This study evaluates the effectiveness of the proposed CAFE-Dance method for the music-driven dance generation task from both quantitative and qualitative perspectives.

5.1.1. Quantitative Results Analysis

As shown in Table 2, this paper compares CAFE-Dance with four mainstream music-driven dance generation methods. All methods are run independently five times on the same test set, and results are reported as mean ± standard deviation. The evaluation metrics include generative quality (FID↓), beat alignment (Beat Score↑), cultural consistency (Cultural Acc.↑), motion smoothness (Smoothness↑), and overall score (Overall↑).

CAFE-Dance achieves the best results on all five metrics, with standard deviations maintained within

0.01

–

0.03

, indicating good stability. For generative quality, CAFE-Dance attains an FID of

0.65 \pm 0.01

, a relative reduction of about

14.5 %

compared with the best baseline Music2Dance-Baseline (

0.76 \pm 0.03

), indicating that the generated distribution is closer to real samples. For beat alignment, the Beat Score reaches

0.91 \pm 0.02

, an improvement of about

4.6 %

over Music2Dance-Baseline (

0.87 \pm 0.02

), suggesting more effective modeling of music–dance synchronization.

For cultural consistency, Cultural Acc. increases to

0.83 \pm 0.02

, an absolute gain of

0.31

, and a relative gain of about

59.6 %

over Music2Dance-Baseline (

0.52 \pm 0.04

), validating the effectiveness of the cultural knowledge modeling module. For motion smoothness and the overall score, CAFE-Dance reaches

0.92 \pm 0.01

and

0.83 \pm 0.02

, representing relative improvements of about

4.5 %

and

6.4 %

over the respective best baselines—Bailando (

0.88 \pm 0.03

) for smoothness and Music2Dance-Baseline (

0.78 \pm 0.02

) for overall. Paired t-tests show that these performance differences relative to the baselines are statistically significant (

p < 0.01

), with large effect sizes (Cohen’s

d > 0.8

), indicating the robustness of CAFE-Dance’s advantage.

Furthermore, to analyze performance dynamics during training, we plot the validation accuracy trend as shown in Figure 5. The figure shows that CAFE-Dance not only converges faster but also reaches a final validation accuracy of

0.811

, clearly higher than Bailando and EDGE, further confirming advantages in learning efficiency and generalization. In addition to these aggregate metrics, we further inspect typical failure cases where CAFE-Dance still struggles, such as severe occlusions and complex multi-person formations. A representative breakdown in a dense multi-person Helou Dance scene is visualized in Figure A1, and the corresponding failure categories and complexity-dependent statistics are summarized in Table A4 and Table A5 in Appendix C. These analyses provide a complementary view of the system’s limitations alongside its strengths.

5.1.2. Qualitative Result Analysis

This part aims to verify, via visual comparison, the superiority of CAFE-Dance in visual realism, cultural expressiveness, and rhythmic consistency of generated dance motions.

We select representative music segments and generate dance sequences with CAFE-Dance, Bailando, EDGE, and Dancing to Music for visual comparison. Figure 6 shows a multi-dimensional performance radar chart that intuitively presents each method’s performance across different dimensions.

From Figure 6, CAFE-Dance stands out especially on Cultural Authenticity and Beat Alignment, forming a clear region of advantage. For Diversity and Smoothness, CAFE-Dance also maintains a high level, indicating that it preserves motion diversity without sacrificing coherence and naturalness. In contrast, Bailando and EDGE show evident weaknesses in Cultural Authenticity and Beat Alignment, with particularly poor performance on Cultural Authenticity.

Overall, CAFE-Dance not only leads comprehensively on quantitative metrics but also demonstrates stronger capability in visual quality and cultural expression, validating its effectiveness and practicality for music-driven dance generation.

5.2. Ablation Experiment

This experiment systematically verifies the contribution of each key module in the proposed CAFE-Dance framework to overall performance and investigates the mechanisms by which different components affect cultural expression, rhythmic synchronization, and visual fidelity. Specifically, the goals are to validate the effectiveness of the Cultural-Aware Attention Mechanism (CAAM) in improving the accuracy of cultural feature recognition and cross-modal semantic consistency; to assess the impact of the Tri-Modal Alignment Network (TMA-Net) on Beat Alignment; and to analyze how the Weakly Supervised Optimization Module (WSOM) and the Cultural Prior constraint improve the stability of the generative distribution (FID) and the overall score.

The ablation study is conducted on the Helouwu ethnic dance dataset. All experiments use the same training–validation split to ensure comparability and fairness. The evaluation metrics include Cultural Accuracy, Beat Score, FID, and overall score. Each metric is averaged over three runs under identical hyperparameters. All ablation models start from the full framework and remove or replace a single module in turn to objectively evaluate the performance contribution of the corresponding component.

As shown in Figure 7, the full model achieves the best performance on all metrics: Cultural Accuracy reaches 0.83, Beat Score is 0.91, FID drops to 0.65, and the overall score (Overall) is 0.83. When the Culture-Aware Attention Mechanism is removed (w/o CAAM), the cultural feature recognition rate falls markedly to 0.52 (a 37.3% decrease), and the overall performance declines to 0.70, validating the module’s key role in modeling cross-modal semantic relations and cultural weighting. When TMA-Net is removed, rhythmic synchronization decreases significantly (Beat Score from 0.91 to 0.79), indicating that the temporal alignment network plays a central role in maintaining music–dance correspondence. After removing WSOM, FID rises from 0.65 to 0.71, showing that weak-supervision regularization effectively stabilizes training and suppresses spurious cultural patterns. Under the removal of the Cultural Prior, Cultural Accuracy declines to 0.48, and the overall score drops to 0.69, highlighting the importance of cultural constraints in guiding the learning direction of the feature space.

From the overall trend, the full model achieves the best performance on both cultural and rhythmic dimensions (CA = 0.83, BS = 0.91), improving by about 3.7% over the strongest baseline (w/o WSOM, Overall = 0.80). When CAAM or the Cultural Prior is absent, the model’s cultural expressiveness drops markedly, indicating that cultural modeling is the primary driver of system performance gains. As shown in Figure 7, the right-to-left performance shift (FID → Cultural Accuracy) reflects a strong coupled dependency between the cultural and generative modalities. Further observation shows that using CAAM in conjunction with WSOM effectively reduces feature distribution shift, enabling a better balance between cultural expression and physical plausibility.

In summary, the ablation results demonstrate the complementary roles of the components in CAFE-Dance with respect to cultural awareness, temporal alignment, and generation stability, validating the effectiveness and necessity of the proposed multimodal culture-aware mechanism and providing a structural basis for subsequent cultural semantic modeling.

5.3. Automatic Cultural Annotation Performance Evaluation

This experiment aims to verify the effectiveness of the proposed ZDCM module in automatically recognizing cultural features of ethnic dance. As an essential component of the CAFE-Dance framework, ZDCM is responsible for learning cultural semantic features—such as costume elements, ritual actions, Music–Dance correlation, and formation choreography—under zero-manual-label conditions. The core goals are: (1) to validate the accuracy and stability of ZDCM on multi-class cultural feature recognition; (2) to evaluate its agreement with expert annotations in unlabeled settings; (3) to investigate recognition disparities across cultural feature categories and identify potential room for improvement.

We use Precision, Recall, and F1-Score to quantitatively evaluate recognition performance for each cultural feature. All experiments follow a unified network configuration and training hyperparameters to ensure reproducibility and fairness.

As shown in Figure 8, ZDCM achieves strong recognition performance across four cultural feature categories. In the Costume Elements dimension, it attains the highest accuracy (Precision = 0.92, F1 = 0.90), about a 5.9% improvement over the average level, indicating stable capture of costume-related visual semantics. For Music–Dance Correlation, the model reaches an F1 of 0.86, reflecting an advantage in modeling cross-modal relations between rhythm and motion. Overall, the mean Precision, Recall, and F1 for ZDCM are 0.87, 0.84, and 0.85, respectively, validating its high robustness and generalization ability under zero-manual-label supervision.

Notably, the model performs slightly lower on the Formation Patterns dimension (F1 = 0.81), mainly because formation changes span long temporal windows and exhibit large inter-sample variation, making global structural cues vulnerable to occlusion and pose ambiguity. The overall trend shows consistently strong performance on low-level visual semantics (e.g., costume and ritual actions) and on mid-level cross-modal features (e.g., Music–Dance rhythmic coupling), indicating that ZDCM effectively learns cultural semantic representations in the absence of manual annotations.

In summary, the automatic cultural annotation results substantiate the reliability and practicality of ZDCM for cultural feature recognition, with performance approaching expert-annotation levels. It markedly reduces labor cost and provides a solid data foundation for subsequent cultural feature modeling and cross-cultural dance generation.

5.4. Computational Complexity Analysis

This experiment systematically evaluates the computational efficiency of CAFE-Dance, focusing on practical deployment metrics such as inference latency, resource footprint, and throughput. Under a unified hardware setup, we compare against Bailando, EDGE, Dancing to Music, and Music2Dance-Baseline, recording five key indicators: single-clip inference time, memory footprint, parameter count, FLOPs, and throughput.

As shown in Figure 9, CAFE-Dance performs best across the efficiency metrics. Specifically, the single-clip inference time is

28.6

ms, a

26.5 %

reduction relative to the best baseline EDGE (

38.9

ms); the throughput reaches 35 clips/s, a

36.2 %

improvement over EDGE (

25.7

clips/s); and the memory footprint is just 156 MB,

17.5 %

lower than EDGE (189 MB). These results indicate that the model markedly improves inference efficiency while maintaining high performance.

In terms of computational complexity, the parameter count of CAFE-Dance (

12.4

M, Figure 9) and FLOPs (

3.1

G) are slightly higher than the baselines, increasing by approximately

10.7 %

relative to EDGE. However, parameter count does not translate linearly to inference efficiency: Bailando, despite having the smallest parameter count (

8.7

M), exhibits lower inference efficiency. This supports that CAFE-Dance achieves a better efficiency–performance trade-off through an optimized network architecture (e.g., parallel design and shortened temporal paths). We note that CAFE-Dance still faces challenges in extremely resource-constrained scenarios (e.g., embedded devices).

In summary, CAFE-Dance strikes a favorable balance between computational complexity and generation quality, providing an efficient solution for the practical deployment of music-driven dance generation.

5.5. Cross-Ethnic Generalization Experiment on Chinese Folk and Ethnic Dance

As shown in Figure 10, to validate the cross-cultural generalization of the CAFE-Dance framework across different ethnic dance types, we conduct rigorous statistical evaluations on four representative Chinese ethnic-dance datasets. All experiments adopt a fully cross-validated design: each dance type is tested independently five times, results are reported as mean ± standard deviation, and homogeneity-of-variance tests are performed to ensure statistical assumptions are met.

As shown in Table 3, we first conduct a one-way ANOVA, which reveals significant differences across dance types on all metrics (all

p < 0.01

), demonstrating that dance type has a statistically significant effect on generative performance.

Post hoc pairwise comparisons using Tukey HSD show that, for Cultural Accuracy, Helou Dance (

0.83 \pm 0.05

) and Uyghur (

0.82 \pm 0.06

) do not differ significantly (

p = 0.087

), but both are significantly higher than Tibetan (

0.76 \pm 0.08

,

p < 0.05

). This pattern may reflect the unique religious-cultural elements and bodily expressions in Tibetan dance, which pose greater challenges to the model.

For Beat Score, Uyghur performs best (

0.93 \pm 0.04

) and does not differ significantly from Mongolian (

0.91 \pm 0.05

,

p = 0.132

), but is significantly higher than Tibetan (

0.89 \pm 0.05

,

p < 0.05

). This suggests that while the model handles rhythmic complexity effectively, learning specific cultural rhythm patterns varies by dance type.

From the perspective of variability, FID exhibits the smallest standard deviations (

0.01

–

0.03

), indicating the highest stability in generative quality, whereas Cultural Accuracy shows relatively larger standard deviations (

0.05

–

0.08

), reflecting variability in learning cultural features across samples. This pattern is consistent across dance types, implying that capturing cultural expression is more challenging than lower-level visual features.

Notably, Levene’s tests for homogeneity of variances are non-significant across groups (

p > 0.1

), satisfying the ANOVA assumption and enhancing the reliability of the conclusions. The standard deviations across the five runs remain within 0.02–0.03, indicating high stability.

In terms of effect size, the partial

η^{2}

for the effect of dance type on Cultural Accuracy is

0.42

(a large effect), indicating that dance-type differences explain

42 %

of the variance in Cultural Accuracy. This finding underscores the importance of accounting for dance-type characteristics in cross-cultural dance generation research.

Overall, the statistical results indicate that CAFE-Dance exhibits strong adaptability in cross-cultural generalization. Although there remains room to improve the fine-grained capture of certain cultural elements, the framework provides a reliable technical foundation for generating core ethnic-dance styles.

5.6. Parameter Sensitivity Analysis

This experiment systematically investigates the sensitivity of the proposed CAFE-Dance framework to hyperparameter configurations in the multi-objective optimization. Specifically, we analyze the impact of weight coefficients

(α, β, γ)

in the Overall Quality score (OQS) formulation to validate the robustness of our weighting strategy.

The experiment is conducted on the Helouwu ethnic dance dataset with identical model architecture, training epochs, and optimizer settings across all trials. We systematically vary the loss weights while maintaining the constraint

α + β + γ = 1.0

. Four distinct weight configurations are evaluated, each representing a perturbation from the baseline

(0.4, 0.3, 0.3)

by adjusting one weight component by

\pm 0.1

while proportionally adjusting the remaining weights. All results are averaged over five independent runs.

As shown in Table 4, the OQS demonstrates remarkable stability across all weight configurations, with a maximum deviation of only 0.01 from the baseline. This indicates that the overall performance is largely insensitive to moderate perturbations in the weight coefficients. The baseline configuration

(0.4, 0.3, 0.3)

achieves the optimal balance among the three evaluation metrics, yielding the lowest FID (0.65) while maintaining high Beat Score (0.91) and Cultural Accuracy (0.83).

When increasing

α

to 0.5 (emphasizing generation fidelity), we observe a marginal degradation in Cultural Accuracy (0.82 vs. 0.83), suggesting that excessive focus on distribution matching may slightly compromise cultural expressiveness. Conversely, when elevating

γ

to 0.4 (prioritizing cultural alignment), Cultural Accuracy improves to 0.84 at the cost of slightly increased FID (0.68 vs. 0.65) and reduced Beat Score (0.90 vs. 0.91). This reveals an inherent trade-off between cultural specificity and other quality dimensions.

The consistent performance across weight configurations (

0.3 \leq α, β, γ \leq 0.4

) underscores the robustness of our approach, which we attribute to the complementary nature of the Cultural Attention Alignment Module (CAAM) and Tri-Modal Alignment Network (TMA-Net). These components enable the model to maintain cultural authenticity and rhythmic synchronization without over-reliance on any single loss term.

Statistical analysis confirms that the observed OQS variations are not significant (paired t-test,

p > 0.05

across all comparisons), further validating the stability of our method. The optimal configuration

(α, β, γ) = (0.4, 0.3, 0.3)

demonstrates consistent performance in all five independent runs, with OQS standard deviation of 0.008.

5.7. Expert Subjective Evaluation

This experiment aims to evaluate the subjective performance of the proposed CAFE-Dance framework in terms of the cultural authenticity, ritual atmosphere, and overall perceptual quality of generated dances, with a focus on interpretability and aesthetic consistency at the level of cultural expression.

The subjective evaluation is independently conducted by three dance experts, each with over ten years of experience in ethnic dance teaching and research. The assessment uses a five-point Likert scale ranging from 1 (very poor) to 5 (excellent). The experiment is based on generated samples from the Helouwu ethnic dance dataset. After viewing the real performance and the corresponding samples generated by the three methods (CAFE-Dance, Bailando, and EDGE), each expert assigns independent scores on three criteria. To ensure result consistency, we compute Cohen’s Kappa for inter-rater agreement (

κ = 0.82

), which indicates a high level of consistency among expert ratings. At this stage, however, the expert panel does not yet include Helou Dance practitioners or cultural preservation organizations from Guangdong Province; incorporating their participatory feedback will be an important direction for future work to further strengthen the cultural validity of the evaluation.

As shown in Figure 11, CAFE-Dance significantly outperforms existing models on all three subjective metrics. Its Cultural Authenticity score is 4.2, representing improvements of 50% and 61.5% over Bailando (2.8) and EDGE (2.6), respectively; its Ritual Accuracy is 4.3, approaching the level of real performances (4.7), indicating that the generated dance closely matches real dancers in ritualized expression and postural coordination; its Overall Quality reaches 4.2, about a 35% gain over the best baseline, Bailando (3.1). These results substantiate the effectiveness of cultural modeling and cross-modal alignment in enhancing the perceptual quality of dance generation.

Notably, although CAFE-Dance performs close to real performances, there remains a slight gap in cultural atmosphere coherence. We attribute this mainly to slight rhythmic lag and insufficient pose smoothness during long-horizon motion transitions in some generated sequences, which causes the overall ritual atmosphere to fall slightly below that of real videos.

Overall, the subjective evaluation indicates that CAFE-Dance attains high-quality generation at the technical level and earns expert recognition for Cultural Authenticity and aesthetic perception. This result validates that the proposed framework achieves its design goal of combining interpretability and artistic expressivity in cultural dance generation.

6. Conclusions

This paper investigates the core challenges of generating ethnically authentic dance sequences under weak supervision. To address the bottlenecks of heavy reliance on manual annotation, difficulty in modeling cultural semantics, and multi-modal alignment, we propose a new culture-aware dance generation framework, CAFE-Dance, based on Helouwu. The framework systematically integrates zero-manual-label cultural data construction, a cultural attention mechanism, and a tri-modal alignment network, enabling automatic generation of dance sequences with high cultural fidelity, precise musical synchronization, and natural motion smoothness without costly motion capture or manual intervention. Quantitative experiments show that the design reduces FID to 0.65 and attains a subjective “ritual accuracy” score of 4.3 (close to 4.7 for real performances), significantly outperforming current mainstream methods.

Despite the strong results on Helou Dance, this study has limitations, especially the reliance on visual features for cultural semantic representation. To address this, future work explores language-driven conditioning by incorporating multimodal large language models to enhance cultural attribute semantics and fine-grained generative control. We will also analyze potential biases in the cultural annotation process, particularly systematic biases introduced by automatic labeling, and establish bias detection and mitigation mechanisms. In addition, we plan to expand the dance database across multiple ethnicities and genres, and to develop standardized cultural expression guidelines with a source-community feedback process to ensure cultural compliance and continuous improvement. Furthermore, we aim to establish long-term collaborations with Helou Dance practitioners and cultural preservation organizations in Guangdong Province to co-design evaluation protocols and iteratively refine the system based on practitioner feedback, thereby further enhancing the cultural validity and social robustness of CAFE-Dance. On the deployment side, we will advance efficient on-device inference, including lightweight strategies such as model compression, quantization, and knowledge distillation, to enable low-latency, high-fidelity cultural dance generation and interactive experiences. In parallel, we will address the failure modes identified under severe occlusions and dense multi-person formations by augmenting CAFE-Dance with occlusion-aware 3D reconstruction and dynamic multi-person association modules informed by the failure-case statistics in Appendix C.

Author Contributions

Conceptualization, B.N. and Y.Z.; methodology, B.N., R.Y. and Q.Z.; software, R.Y.; validation, R.Y., Q.Z., and Y.F.; formal analysis, R.Y. and Q.Z.; investigation, R.Y. and Q.Z.; resources, Y.Z. and Y.F.; data curation, R.Y.; writing—original draft preparation, R.Y.; writing—review and editing, B.N., Y.Z., and Y.F.; visualization, R.Y.; supervision, B.N. and Y.Z.; project administration, B.N. and Y.Z.; funding acquisition, B.N. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Program of Gansu Province grant number 22JR10KA007 and was supported in part by the National Natural Science Foundation of China under Grant 61862041 and in part by the Natural Science Foundation of Gansu Province under 892 Grant 21JR7RA120.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and the curated Helouwu dataset are available at https://github.com/BinNiu-Dance/CAFE-Dance (accessed on 27 November 2025).

Acknowledgments

We are deeply grateful to Haoliang Chen and Zu Yu (School of Dance, Northwest Normal University) for their invaluable support and generous assistance during this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Confidence Threshold Determination Method

This experiment aims to verify the scientific soundness and effectiveness of the proposed confidence threshold determination method. We systematically investigate the impact of different confidence thresholds on dance action recognition performance and determine the optimal threshold through rigorous statistical validation, so as to balance Precision and Recall and satisfy the requirements of practical application scenarios.

The experiment is conducted on the dance recognition validation set (

n = 34

, accounting for 20% of the original data) using a 5-fold cross-validation strategy. We define a grid search space (threshold range 0.50–0.95 with a step size of 0.05), and use the F1-Score as the primary evaluation metric while also recording Precision and Recall. All experiments are independently repeated five times, and we report the mean ± standard deviation to ensure statistical reliability of the results.

As shown in Table A1, Precision monotonically increases and Recall monotonically decreases as the threshold grows from 0.50 to 0.95, while the F1-Score is highest at 0.50 (0.881 ± 0.019) and gradually decreases thereafter. Although 0.50 gives the maximum F1, we select 0.85 as the operational threshold to favor higher Precision and long-term stability in deployment. The F1-Score at 0.85 (0.856 ± 0.012) remains competitive, and paired t-tests show no statistically significant degradation on the independent test set (

p = 0.156

, paired t-test).

Table A1. Grid search results for confidence threshold. The row in bold indicates the selected threshold value used in the proposed model.

Threshold	Precision	Recall	F1-Score	95% CI
0.50	0.823 ± 0.024	0.947 ± 0.028	0.881 ± 0.019	[0.862, 0.900]
0.55	0.834 ± 0.022	0.931 ± 0.026	0.880 ± 0.018	[0.862, 0.898]
0.60	0.842 ± 0.020	0.918 ± 0.025	0.878 ± 0.017	[0.861, 0.895]
0.65	0.851 ± 0.019	0.903 ± 0.024	0.876 ± 0.016	[0.860, 0.892]
0.70	0.861 ± 0.018	0.887 ± 0.023	0.874 ± 0.015	[0.859, 0.889]
0.75	0.872 ± 0.017	0.868 ± 0.022	0.870 ± 0.014	[0.856, 0.884]
0.80	0.883 ± 0.016	0.846 ± 0.021	0.864 ± 0.013	[0.851, 0.877]
0.85	0.894 ± 0.015	0.821 ± 0.020	0.856 ± 0.012	[0.844, 0.868]
0.90	0.906 ± 0.014	0.793 ± 0.019	0.846 ± 0.011	[0.835, 0.857]
0.95	0.918 ± 0.013	0.762 ± 0.018	0.833 ± 0.010	[0.823, 0.843]

We select 0.85 as the optimal threshold based on the following considerations. First, dance recognition applications require high precision, and the threshold of 0.85 provides a more favorable balance between Precision and Recall. Second, an ANOVA analysis confirms that the differences in F1-Score among different thresholds are statistically significant (

F = 12.47

,

p < 0.001

). Most importantly, on an independent test set (

n = 42

), the threshold of 0.85 shows excellent generalization ability (Precision: 0.887 ± 0.019, Recall: 0.818 ± 0.023, F1-Score: 0.851 ± 0.014) and cross-dataset stability (coefficient of variation

CV = 1.6 %

).

Although the threshold of 0.85 performs well in terms of overall performance, it still yields relatively low recall on fast-rotation dance actions (0.762 ± 0.031), mainly because these actions exhibit drastic changes in visual appearance, which reduce the model’s confidence. In addition, our threshold determination method assumes that the validation set and test set follow similar distributions; when dealing with new scenarios with large distribution shifts, the threshold may need to be recalibrated.

It is worth noting that, in practical deployment, we observe that a threshold slightly higher than the F1-Score peak (i.e., 0.85) exhibits better long-term stability. This observation is of practical value for human–computer interaction systems that demand high reliability. Future work will explore adaptive thresholding strategies that dynamically adjust the confidence threshold according to the characteristics of the input data.

Through rigorous grid search and statistical validation, we determine 0.85 as the optimal confidence threshold. This threshold maintains high precision while preserving a reasonable recall rate, effectively balancing the performance requirements of the dance recognition system in real-world applications. These findings verify the effectiveness of using a principled parameter optimization strategy to enhance system reliability and robustness.

Appendix B. Validation of Data Augmentation Strategy

This experiment aims to verify the scientific soundness and effectiveness of the proposed data augmentation strategy, in particular, the choice of sampling rate (30 Hz) and temporal window (5 s). We systematically investigate how different parameter combinations influence model performance, computational efficiency, and semantic completeness of the data, in order to determine the optimal configuration and provide a high-quality data foundation for dance action analysis.

The experiment is conducted on 170 original dance video clips (with a total duration of 7.5 h). We test different sampling rates (15 Hz, 25 Hz, 30 Hz, 45 Hz, 60 Hz) and temporal window sizes (2 s, 3 s, 5 s, 8 s, 10 s). The evaluation metrics include mAP@0.5 (performance), inference time (efficiency), storage requirement (resource usage), and semantic completeness scored by experts. Each configuration is evaluated on the full dataset, and we report the mean ± standard deviation to ensure scientific rigor in decision making.

Regarding sampling rate, as shown in Table A2, the sampling rate of 30 Hz achieves the highest mAP@0.5 (0.847 ± 0.012), significantly outperforming 15 Hz (0.742 ± 0.018,

p < 0.001

) and 60 Hz (0.821 ± 0.016,

p = 0.003

). The 30 Hz setting also provides the best trade-off between computational efficiency (37.8 ± 3.2 ms/frame) and performance, with an overall score of 0.85. It is noteworthy that when the sampling rate exceeds 30 Hz, the performance improvement becomes marginal or even slightly decreases (mAP@0.5 drops to 0.834 at 45Hz), while computation cost increases markedly, indicating that 30 Hz is an optimal balance between theoretical considerations and practical constraints.

Table A2. Comparison of sampling rates. The row in bold indicates the optimal sampling rate (30 Hz) which achieves the highest overall score.

Sampling Rate	mAP@0.5	Inference Time (ms)	Storage Req. (GB)	Overall Score
15 Hz	0.742 ± 0.018	23.4 ± 2.1	2.8	0.76
25 Hz	0.798 ± 0.015	31.2 ± 2.8	4.6	0.81
30 Hz	0.847 ± 0.012	37.8 ± 3.2	5.5	0.85
45 Hz	0.834 ± 0.014	56.7 ± 4.1	8.3	0.79
60 Hz	0.821 ± 0.016	75.3 ± 5.2	11.1	0.73

For temporal window selection, as shown in Table A3, a window of 5 s yields the best semantic completeness (0.89 ± 0.04), significantly outperforming 2 s (0.67 ± 0.08,

p < 0.001

) and 10 s (0.76 ± 0.06,

p = 0.008

). The 5 s window also produces a reasonable number of segments (480 ± 48), avoiding the issues of too few segments (only 240 ± 24 for the 10 s window) or too many segments (1200 ± 120 for the 2 s window). Expert evaluation by dance specialists confirms that 5 s is the minimal sufficient duration to represent a complete dance action unit, which is crucial for preserving the cultural semantics of dance movements.

Table A3. Comparison of temporal windows. The row in bold indicates the optimal temporal window size (5 s) selected for the proposed method.

Window Size	Number of Segments	Semantic Completeness	Computational Efficiency	Overall Score
2 s	1200 ± 120	0.67 ± 0.08	0.92	0.78
3 s	800 ± 80	0.78 ± 0.06	0.87	0.82
5 s	480 ± 48	0.89 ± 0.04	0.81	0.85
8 s	300 ± 30	0.82 ± 0.05	0.73	0.77
10 s	240 ± 24	0.76 ± 0.06	0.68	0.72

By combining a set of lightweight augmentation techniques (temporal, spatial, color, and noise-based), we expand the original 170 long clips into approximately

14.1 \times

more effective 5 s motion–audio segments at the training-sample level. The augmented data maintain high fidelity (0.91 ± 0.03) while significantly improving model performance (mAP@0.5 increases from 0.762 to 0.847). These results demonstrate the effectiveness of our data augmentation strategy and provide a high-quality data foundation for subsequent experiments.

Despite the strong performance of the 30 Hz sampling rate and 5 s temporal window on most dance types, limitations remain for extremely fast-tempo dances (e.g., tap dance), where a sampling rate of 30 Hz may not fully capture millisecond-level foot movements. In addition, a 5 s window may be insufficient to represent a complete dance narrative, which constrains the model’s understanding of global structure.

This experiment reveals an important insight: data augmentation is not merely about increasing data volume but requires careful design tailored to task characteristics. We observe that moderate augmentation (around 14×) is more effective than over-augmentation (>20×), which challenges the conventional notion that “more data is always better.” More importantly, dance actions possess unique temporal characteristics, and their optimal sampling rate and window size are substantially different from those in daily activity recognition tasks. This insight is of high reference value for the field of computational dance.

The experiments consistently demonstrate that a sampling rate of 30 Hz and a temporal window of 5 s constitute an optimal configuration, providing a reliable data basis for dance action analysis. The data augmentation strategy significantly improves model performance while preserving semantic completeness, which directly supports the core contribution of this work: constructing a high-quality, large-scale dataset for dance analysis. These findings also offer reproducible standard practices for data processing in related domains.

Appendix C. Analysis of System Failure Cases and Improvement Strategies

This appendix provides an in-depth analysis of system failure cases observed in real-world applications, with a particular focus on challenging scenarios such as severe occlusions and complex multi-person formations. By combining quantitative statistics with qualitative inspection, we identify the main bottlenecks of the current system and derive targeted directions for improvement, so that the limitations of the pipeline are made explicit rather than remaining hidden behind aggregate metrics.

A particularly severe failure case in a complex multi-person ethnic dance scene is shown in Figure A1. In the original RGB frame (top left, Figure A1a), eight dancers are present in a dense circular formation with close interpersonal distances and frequent mutual occlusions. This frame reflects a typical real-world group choreography setting in which dancers often overlap and move synchronously. Ideally, the multi-person detection and pose estimation module should recover a 2D skeleton for each individual. However, in the keypoint overlay view (top right, Figure A1b), only a single front performer is assigned a 2D skeleton, while all other dancers are entirely missed. At the scene level, this means that the multi-target detection and tracking pipeline almost completely breaks down and fails to represent the group choreography as a multi-person motion pattern.

Figure A1. Representative severe failure case in a complex multi-person ethnic dance scene. (a) Original RGB frame with eight dancers in a dense circular formation and strong mutual occlusions. (b) A 2D skeleton overlay: only a single front performer is assigned a skeleton, while all other dancers are completely missed, indicating a scene-level breakdown of multi-person detection and tracking. (c) Reconstructed 3D skeleton of the detected performer, which collapses into an almost one-dimensional structure instead of a plausible 3D human pose. (d) Joint-wise 2D keypoint confidence and visibility scores for the same performer; most joints still have relatively high confidence, highlighting a discrepancy between locally reasonable 2D predictions and globally inconsistent 3D reconstruction and multi-person reasoning. This example illustrates a complete failure of the current pipeline under dense multi-person interactions and motivates the occlusion-aware 3D reconstruction and dynamic multi-person association strategies discussed in our improvement plan.

For the lone detected performer, the reconstructed 3D skeleton (bottom left, Figure A1c) collapses into an almost one-dimensional structure instead of forming a plausible human body in 3D space. At the same time, most 2D keypoints for this performer still exhibit relatively high confidence and visibility scores, as summarized by the bar plot of joint-wise confidence (bottom right, Figure A1d). This contrast shows that, at the level of individual joints, the 2D detector appears to be locally reliable, yet at the level of global multi-view and multi-person reasoning, the system fails to integrate these cues into a stable 3D configuration and a consistent set of trajectories. In other words, the failure is not caused by isolated noisy keypoints but by the lack of robust structural constraints and association mechanisms under dense multi-person interactions.

Beyond this illustrative case, we collected a total of 162 failure cases on the test set, accounting for 28.7% of all evaluated samples. Each case is categorized according to the dominant factor that triggers the failure (occlusion, complex multi-person scenes, lighting variation, motion blur, cultural gesture recognition, or equipment issues) and is assigned a severity level based on the magnitude of performance degradation (Critical, Major, or Minor). The resulting distribution is summarized in Table A4. Occlusion (25.9%) and complex multi-person scenes (21.6%) emerge as the dominant sources of error, indicating that the current pipeline is particularly vulnerable when visual overlap or interaction density increases. Among all 162 cases, 28 (17.3%) are labeled as Critical with more than 50% performance degradation, while 67 (41.4%) are labeled as Major with a degradation between 25% and 50%. This long tail of medium-to-severe failures highlights systematic weaknesses rather than isolated outliers.

Table A4. Failure case categories and severity distribution.

Failure Type	Number of Cases	Proportion
Occlusion	42	25.9%
Complex multi-person scenes	35	21.6%
Lighting variation	28	17.3%
Motion blur	24	14.8%
Cultural gesture recognition	18	11.1%
Equipment issues	15	9.3%
Severity level	Number of cases	Proportion
Critical (>50% performance impact)	28	17.3%
Major (25–50% performance impact)	67	41.4%
Minor (<25% performance impact)	67	41.4%

To further quantify the relationship between scene complexity and system performance, we group all test sequences into three levels—Simple, Medium, and Complex—according to the number of dancers and the degree of interaction, and measure the successful detection rate, accuracy, processing time, and memory usage of the end-to-end pipeline for each group. The results, summarized in Table A5, show a clear monotonic trend: when moving from Simple (solo dance) to Complex scenes (formations with four or more dancers), the successful detection rate drops from 94.2% to 68.4% and the overall accuracy decreases from 0.887 to 0.523, while both processing time and memory usage roughly double. Correlation analysis reveals a strong negative correlation between scene complexity and accuracy, and a strong positive correlation between complexity and resource consumption, indicating that the scenes that are hardest to recognize are also those that demand the most computation.

Table A5. Performance analysis by scene complexity level.

Complexity Level	Successful Detection Rate	Accuracy (±SE)	Processing Time (ms)	Memory Usage (MB)
Simple (solo dance)	94.2%	0.887 ± 0.045	34.2 ± 3.1	1247 ± 89
Medium (2–3-person interaction)	82.6%	0.742 ± 0.067	52.8 ± 7.4	1856 ± 156
Complex (formations with 4+ dancers)	68.4%	0.523 ± 0.089	78.9 ± 12.3	2634 ± 234

Taken together, the scene-level breakdown illustrated in Figure A1, the failure statistics in Table A4, and the complexity-dependent performance profile in Table A5 form a coherent picture of where the current system struggles. In occlusion-heavy regimes, 2D pose estimation lacks explicit 3D structural priors and visibility reasoning, making it difficult to maintain plausible full-body configurations once large regions are occluded. In complex multi-person scenes, the multi-target association problem becomes combinatorial, and local association errors quickly cascade into identity switches and trajectory fragmentation, as reflected in the complete loss of most dancers in Figure A1. These observations directly motivate the improvement strategies discussed in this work, including occlusion-aware 3D reconstruction that incorporates multi-view geometric constraints and explicit visibility modeling, as well as dynamic multi-person association mechanisms that better exploit formation structure and temporal consistency. Although a full redesign of the pipeline is beyond the scope of this appendix, the failure analysis provides a concrete problem definition and a data-driven basis for these targeted enhancements, forming a closed loop from empirical failure cases to methodological improvements.

References

Mao, Q.; Mastnak, W.; Guan, R. Chinese ethnic dance therapy: Cultural anthropology and health science perspectives on Tujia ethnic dances. Front. Psychol. 2025, 16, 1561150. [Google Scholar] [CrossRef] [PubMed]
Li, R. Etiquette and dance—An analysis of the cultural phenomenon of the etiquette and custom dance of the ethnic minorities in Southwest China. Mediterr. Archaeol. Archaeom. 2025, 25, 2. [Google Scholar]
Liu, S. The Chinese dance: A mirror of cultural representations. Res. Danc. Educ. 2020, 21, 153–168. [Google Scholar] [CrossRef]
Wong, S.L.C. Dancing in the diaspora: Cultural long-distance nationalism and the staging of Chineseness by San Francisco’s Chinese Folk Dance Association. J. Transnatl. Am. Stud. 2010, 2, 126–128. [Google Scholar] [CrossRef]
Yan, Q.; Wang, X.; Rosa, R.D.D. Ethnography of Chinese Dance Etiquette Culture and Aesthetic Value. Camb. Open Engag. 2023. [Google Scholar] [CrossRef]
Wilcox, E. Revolutionary Bodies: Chinese Dance and the Socialist Legacy; University of California Press: Oakland, CA, USA, 2019. [Google Scholar]
Lei, X. The Application of Ethnic Folk Dance Elements in Choreographic Techniques from a Contemporary Perspective-Exploring the Fusion of Dai Ethnic Folk Dance and Modernity. Pac. Int. J. 2024, 7, 93–97. [Google Scholar] [CrossRef]
Ji, M. On the Origin of Helou Dance and Chunniu Dance in Southern Guangdong Province. J. Beijing Danc. Acad. 2011, 2, 72–75. [Google Scholar]
Zhang, Y.; He, X.; Wang, J.; Bai, X.; Ma, M. Exploration and research on the digital protection methods of ethnic dance. In Proceedings of the International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), Guangzhou, China, 8–10 March 2024; Volume 13180, pp. 1718–1725. [Google Scholar]
Zhang, Y. Daily Life, Time, and People in the Field: The Academic Transition of Folk Dance Research. J. Beijing Danc. Acad. 2022, 6, 44–53. [Google Scholar]
Zhang, Y. An Analysis of the Rushan Yangko Dance Becoming an Intangible Cultural Heritage in the Jiaodong Area. J. Guangxi Univ. Natl. (Philos. Soc. Sci. Ed.) 2022, 44, 145–151. [Google Scholar]
Ye, Z.; Wu, H.; Jia, J.; Bu, Y.; Chen, W.; Meng, F.; Wang, Y. ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
Tseng, J.H.; Castellon, R.; Liu, C.K. EDGE: Editable Dance Generation From Music. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 448–458. [Google Scholar]
Siyao, L.; Yu, W.; Gu, T.; Lin, C.; Wang, Q.; Qian, C.; Loy, C.C.; Liu, Z. Bailando++: 3D Dance GPT With Choreographic Memory. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14192–14207. [Google Scholar] [CrossRef] [PubMed]
Dabral, R.; Mughal, M.H.; Golyanik, V.; Theobalt, C. MoFusion: A Framework for Denoising-Diffusion-Based Motion Synthesis. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9760–9770. [Google Scholar]
Sun, S.; Tang, Q.; Liu, Y.; Zhang, H.; Song, Q.; Xu, D. YNU-Dance: A Multimodal Ethnic Dance Action Dataset. In Proceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things, online, 24–26 May 2024; pp. 273–281. [Google Scholar] [CrossRef]
Gong, K.; Lian, D.; Chang, H.; Guo, C.; Zuo, X.; Jiang, Z.; Wang, X. TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 9908–9918. [Google Scholar]
Tsuchida, S.; Fukayama, S.; Hamasaki, M.; Goto, M. AIST Dance Video Database: Multi-genre, Multi-dancer, and Multi-camera Database for Dance Information Processing. In Proceedings of the 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, 4–8 November 2019; pp. 501–510. [Google Scholar]
Zhuang, H.W.; Lei, S.; Xiao, L.; Li, W.; Chen, L.; Yang, S.; Wu, Z.; Kang, S.; Meng, H.M. GTN-Bailando: Genre Consistent long-Term 3D Dance Generation Based on Pre-Trained Genre Token Network. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Zhou, Z.; Huo, Y.; Huang, G.; Zeng, A.; Chen, X.; Huang, L.; Li, Z. QEAN: Quaternion-enhanced attention network for visual dance generation: QEAN: Quaternion-enhanced attention network for visual dance generation. Vis. Comput. 2024, 41, 961–973. [Google Scholar] [CrossRef]
Siyao, L.; Yu, W.; Gu, T.; Lin, C.; Wang, Q.; Qian, C.; Loy, C.C.; Liu, Z. Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11040–11049. [Google Scholar]
Qi, Q.; Zhuo, L.; Zhang, A.; Liao, Y.; Fang, F.; Liu, S.; Yan, S. Diffdance: Cascaded human motion diffusion model for dance generation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1374–1382. [Google Scholar]
Li, R.; Yang, S.; Ross, D.A.; Kanazawa, A. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 13381–13392. [Google Scholar]
Bai, J. Modern Consciousness and Li Ethnic Dance. J. Beijing Danc. Acad. 2009, 106–110. [Google Scholar]
Lee, H.Y.; Yang, X.; Liu, M.Y.; Wang, T.C.; Lu, Y.D.; Yang, M.H.; Kautz, J. Dancing to music. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Zhuang, W.; Wang, C.; Chai, J.; Wang, Y.; Shao, M.; Xia, S. Music2dance: Dancenet for music-driven dance generation. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2022, 18, 1–21. [Google Scholar] [CrossRef]

Figure 1. Examples of dancers performing various ethnic dances in traditional costumes: (a) Tibetan dance, (b) Dai ethnic dance, (c) Han ethnic dance (Guzi Yange), (d) Mongolian ethnic dance, (e) Uyghur dance, (f) Korean ethnic dance, (g) Han ethnic dance (Helou Dance), and (h) Han ethnic dance (Helou Dance).

Figure 2. Variations of the Helou Dance in costume, formation, and props. (a) Multi-dancer formation I; (b) rice-ear prop; (c) female attire; (d) multi-dancer formation II; (e) mask prop; (f) ox-head staff with mask.

Figure 3. The proposed CAFE-Dance framework.

Figure 4. Generated Helouwu Dance movements (Trembling Step and Bell Ringing) using the CAFE-Dance method.

Figure 5. Training performance trends of different methods on the validation set.

Figure 6. This radar chart presents the performance of different methods across five dimensions: Smoothness, Diversity, Beat Alignment, Accuracy, and Cultural Authenticity.

Figure 7. The ablation study heatmap illustrates the impact of different modules on cultural feature recognition, rhythm synchronization, and generation quality.

Figure 8. Performance of the ZDCM module on automatic cultural feature recognition across four feature categories.

Figure 9. Computational complexity comparison of CAFE-Dance and baseline methods.

Figure 10. Visualization of three representative samples from Chinese ethnic dance datasets.

Figure 11. Expert-based subjective evaluation of generated dances across three criteria: Cultural Authenticity, Ritual Accuracy, and Overall Quality.

Table 1. Hyperparameters used for training.

Hyperparameters	Value
Learning rate	$1 \times 10^{- 3}$
Batch size	16
Optimizer	Adam ( $β_{1} = 0.9, β_{2} = 0.999$ )
LR scheduler	StepLR ( $step_size = 10, γ = 0.7$ )
Epochs	120
Loss function	CrossEntropyLoss
Dropout (fusion MLP)	0.3
Dropout (backbone fusion)	0.5
Attention heads (cross-modal)	8
Attention heads (cultural)	4
Feature dim (backbone)	256
Embedding dim (per-scale)	512
Fusion dim	512
Keypoint dimension	$33 \times 4 = 132$

Table 2. Quantitative comparison with mainstream baselines (mean ± standard deviation,

n = 5

). The best results are highlighted in bold. The symbols ↑ and ↓ indicate that higher or lower values denote better performance, respectively.

Table 2. Quantitative comparison with mainstream baselines (mean ± standard deviation,

n = 5

). The best results are highlighted in bold. The symbols ↑ and ↓ indicate that higher or lower values denote better performance, respectively.

Method	FID ↓	Beat Score ↑	Cultural Acc. ↑	Smoothness ↑	Overall ↑
Bailando [14]	0.89 ± 0.02	0.76 ± 0.03	0.42 ± 0.05	0.88 ± 0.03	0.75 ± 0.02
EDGE [13]	0.78 ± 0.03	0.82 ± 0.02	0.38 ± 0.04	0.85 ± 0.02	0.72 ± 0.03
Dancing to Music [25]	0.82 ± 0.04	0.85 ± 0.03	0.45 ± 0.03	0.82 ± 0.03	0.75 ± 0.02
Music2Dance [26]	0.76 ± 0.03	0.87 ± 0.02	0.52 ± 0.04	0.86 ± 0.02	0.78 ± 0.02
CAFE-Dance (Ours)	0.65 ± 0.01	0.91 ± 0.02	0.83 ± 0.02	0.92 ± 0.01	0.83 ± 0.02

Note: *

p < 0.05

, **

p < 0.01

, ***

p < 0.001

vs. all baselines (paired t-test).

Table 3. Cross-cultural generalization performance of CAFE-Dance on four Chinese ethnic dances (mean ± standard deviation,

n = 5

).

Table 3. Cross-cultural generalization performance of CAFE-Dance on four Chinese ethnic dances (mean ± standard deviation,

n = 5

).

Dance Type	Cultural Accuracy	Beat Score	Overall Quality	FID
Tibetan	$0.76 \pm 0.08$	$0.89 \pm 0.05$	$0.81 \pm 0.06$	$0.71 \pm 0.03$
Uyghur	$0.82 \pm 0.06$	$0.93 \pm 0.04$	$0.85 \pm 0.05$	$0.68 \pm 0.02$
Mongolian	$0.79 \pm 0.07$	$0.91 \pm 0.05$	$0.83 \pm 0.06$	$0.70 \pm 0.03$
Helou Dance	$0.83 \pm 0.05$	$0.91 \pm 0.03$	$0.85 \pm 0.04$	$0.65 \pm 0.01$

Table 4. Parameter sensitivity analysis of weight coefficients on model performance. The symbols ↑ and ↓ indicate that higher or lower values denote better performance, respectively.

$α$	$β$	$γ$	FID ↓	Beat Score ↑	Cultural Acc. ↑	OQS ↑	$Δ$ OQS
0.4	0.3	0.3	0.65	0.91	0.83	0.83	baseline
0.5	0.3	0.2	0.66	0.90	0.82	0.82	$- 0.01$
0.3	0.4	0.3	0.67	0.91	0.81	0.82	$- 0.01$
0.3	0.3	0.4	0.68	0.90	0.84	0.82	$- 0.01$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Niu, B.; Yang, R.; Zhang, Q.; Zhang, Y.; Fan, Y. CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning. Big Data Cogn. Comput. 2025, 9, 307. https://doi.org/10.3390/bdcc9120307

AMA Style

Niu B, Yang R, Zhang Q, Zhang Y, Fan Y. CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning. Big Data and Cognitive Computing. 2025; 9(12):307. https://doi.org/10.3390/bdcc9120307

Chicago/Turabian Style

Niu, Bin, Rui Yang, Qiuyu Zhang, Yani Zhang, and Ying Fan. 2025. "CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning" Big Data and Cognitive Computing 9, no. 12: 307. https://doi.org/10.3390/bdcc9120307

APA Style

Niu, B., Yang, R., Zhang, Q., Zhang, Y., & Fan, Y. (2025). CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning. Big Data and Cognitive Computing, 9(12), 307. https://doi.org/10.3390/bdcc9120307

Article Menu

CAFE-Dance: A Culture-Aware Generative Framework for Chinese Folk and Ethnic Dance Synthesis via Self-Supervised Cultural Learning

Abstract

1. Introduction

2. Related Work

2.1. Minimal-Annotation and Culturally Representative Dataset Construction for Dance Generation

2.2. Cultural Semantics Modeling and Interpretable Attention in Dance Generation

2.3. Tri-Modal Alignment of Music, Motion, and Cultural Semantics Under Weak Supervision

2.4. Helou Dance

3. Methods

3.1. Zero-Manual-Label Cultural Dance Dataset Construction

3.2. Cultural Attention Mechanism

3.3. Tri-Modal Alignment Network

3.4. Weakly Supervised Optimization and Ablation Design

4. Experimental Setup

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

5. Experimental Results

5.1. Comparison and Analysis with Existing Methods

5.1.1. Quantitative Results Analysis

5.1.2. Qualitative Result Analysis

5.2. Ablation Experiment

5.3. Automatic Cultural Annotation Performance Evaluation

5.4. Computational Complexity Analysis

5.5. Cross-Ethnic Generalization Experiment on Chinese Folk and Ethnic Dance

5.6. Parameter Sensitivity Analysis

5.7. Expert Subjective Evaluation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Confidence Threshold Determination Method

Appendix B. Validation of Data Augmentation Strategy

Appendix C. Analysis of System Failure Cases and Improvement Strategies

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI