1. Introduction
In Korean traditional music, rhythmic structure is organized through recurring cyclic patterns known as jangdan. While the term jangdan primarily refers to specific named rhythmic types (e.g., Jinyangjo, Jungmori, Jajinmori), it simultaneously defines the temporal interval between two successive structural accents, a unit directly analogous to the downbeat period in Western music. In this study, we adopt this perspective: each jangdan cycle corresponds to one downbeat-to-downbeat interval. Consequently, the task of detecting jangdan boundaries is equivalent to conventional downbeat tracking. This conceptual alignment allows state-of-the-art downbeat tracking techniques to be applied to Pansori without requiring a fundamental redefinition of the temporal inference problem, while clearly distinguishing rhythmic pattern classification from temporal segmentation.
The perception and organization of musical time constitute a fundamental research question across auditory science, ethnomusicology, and music information retrieval (MIR). Beats and downbeats serve as critical temporal anchors that guide bodily entrainment, build musical expectation, and support expressive interpretation. Accurate and robust identification of these anchors is therefore essential for downstream applications, including automatic music transcription, performance analysis, computer-assisted pedagogy, digital archiving, and interactive music systems. Moreover, computational modeling of rhythm and timing provides powerful tools for investigating how temporal structure contributes to emotional expression, cultural identity, and stylistic diversity in different musical traditions [
1].
While beat and downbeat tracking systems have reached high maturity for Western popular and classical music, most models are trained under assumptions of isochronous meter, relatively stable tempo, and strong percussive reinforcement characteristics that rarely hold in many non-Western traditions. In repertoires featuring asymmetric meters, flexible micro-timing, polyrhythmic layering, or predominantly vocal articulation (e.g., Carnatic music, Turkish aksak rhythms, Uruguayan Candombe, or Korean
Pansori), conventional systems frequently exhibit poor generalization [
2,
3,
4]. Growing evidence suggests that culturally and stylistically specific modeling incorporating domain knowledge about rhythm, tempo plasticity, and instrumentation is often necessary to achieve reliable performance outside the Western canon.
Korean
Pansori, a sung narrative art form combining virtuosic vocal expression (
sori), dramatic storytelling, and minimal percussion accompaniment by the
gosu (drummer), exemplifies these challenges.
Pansori performances are structured around
jangdan cycles that govern dramatic pacing, emotional contour, and interaction between singer (
sorikkun) and drummer (see
Table 1). The extreme sparsity of accompaniment, frequent shifts between singing, speech-like delivery (
aniri), and highly flexible timing make automatic rhythm analysis particularly difficult. Despite increasing interest in computational ethnomusicology, systematic MIR research on
Pansori rhythm remains limited, and existing general-purpose beat/downbeat trackers often fail to capture its essential temporal characteristics.
To address these issues, the present study introduces a jangdan-aware downbeat tracking framework specifically designed for Pansori. By explicitly treating jangdan cycles as downbeat periods, we adapt contemporary deep learning architectures and incorporate rhythm-specific tempo constraints within a Dynamic Bayesian Network (DBN) decoder. The improved reliability of jangdan/downbeat detection then enables novel downstream analyses of two central dimensions of Pansori expressivity: beat-level vocal energy contours and intra-cycle tempo variation. The main contributions of this work are the following:
We formalize the functional equivalence between jangdan cycles and Western downbeat periods, enabling direct application and adaptation of modern downbeat tracking methods to Pansori.
We propose a rhythm-pattern-aware downbeat tracking system that combines Temporal Convolutional Networks (TCNs) and RoFormer transformers (with rotary positional embeddings) with a jangdan-conditioned DBN decoder using culturally-informed tempo priors.
We report significantly superior downbeat tracking performance compared to general-purpose systems, both on full mixture recordings and more challengingly on isolated vocal tracks.
Utilizing the enhanced jangdan detection, we conduct the first systematic computational study of characteristic vocal energy profiles and tempo shaping strategies across major jangdan patterns, offering new quantitative insight into Pansori performance practice.
The remainder of the paper is organized as follows.
Section 2 reviews prior work on beat and downbeat tracking in Western and non-Western musical traditions, with emphasis on existing computational approaches to
Pansori.
Section 3 describes the dataset, input representation, model architectures, post-processing, and training and inference procedures.
Section 4 reports quantitative downbeat tracking results and presents analyses of beat-level vocal energy and tempo variation within
jangdan cycles.
Section 5 concludes the study and outlines directions for future work.
3. Materials and Methods
3.1. Data
A total of 80,743 s (22.4 h) of professionally produced Pansori recordings were collected from commercial CDs in collaboration with Jeonbuk National University’s Department of Music. All recordings were in .wav format digitized at 44.1 kHz and 16-bit resolution. The corpus covers the principal jangdan rhythmic patterns used in traditional Pansori performance, including Jinyangjo, Jungmori, Jungjungmori, Jajinmori, Hwimori, Eotmori, and Eotjungmori.
Each performance was manually annotated by expert musicians using Sonic Visualiser. Annotators identified downbeat positions and labeled the corresponding
jangdan pattern as well. The final configuration allocates 80% of recordings for training (17.92 h), 10% for validation, and 10% for testing (2.24 h). The details of duration regarding the data according to the rhythm patterns are summarised in
Table 2.
3.2. Input Representation
In this study, the input representation uses a log-magnitude spectrogram with a 2048-sample Hann window and a 100 ms hop size at a 22,050 Hz sampling rate, resulting in a 10 Hz frame rate. This setup is tailored to Pansori genre with highly expressive, irregular, and sparse downbeat patterns that develop over long timescales (often several seconds), unlike the regular, fast pulses common in Western music. Typical MIR beat/downbeat systems employ finer resolutions (e.g., 10–20 ms hops, or 50–100 Hz frame rates) to align with narrow tolerances (±70 ms), but these can oversample irrelevant micro-timing details in Pansori and add noise that masks broader, narrative-driven rhythms. Instead, the 100 ms hop provides sufficient sampling for slow-changing structures—aligning with MIR practices that use 5–20 Hz frame rates for slowly evolving features in low-tempo music—while improving efficiency and matching the genre’s jangdan cycles. For spectrogram extraction, all audio was converted to mono.
3.3. Model Architectures and Postprocessing
Two model families were evaluated for downbeat tracking: a Temporal Convolutional Network (TCN) [
5] and a RoFormer [
6]. Both architectures share a three-layer convolutional front-end using 3 × 3 kernels with ELU activations. The TCN employs 16 filters in each convolutional layer, whereas the RoFormer uses 64. A 1 × 3 max-pooling operation reduces frequency resolution and produces a sequence of feature vectors suitable for temporal modeling.
The TCN architecture, shown in
Figure 1, comprises ten dilated one-dimensional convolutional layers with exponentially increasing dilation factors from
to
. The offline variant uses symmetric padding to provide non-causal context, whereas the online variant employs causal convolutions with look-ahead cropping to ensure real-time feasibility. This structure enables the model to integrate both short-range rhythmic cues and longer-range cyclical dependencies typical of
jangdan patterns.
The RoFormer model, shown in
Figure 2, employs a six-layer transformer encoder with a hidden size of 64 and four attention heads. Rotary positional embeddings (RoPE) are used to encode temporal relationships in a way that accommodates varying tempi, and a feed-forward expansion factor of four increases representational capacity. As with the TCN, the online RoFormer applies causal masking, whereas the offline version operates without temporal restrictions.
Frame-level downbeat activation probabilities produced by both models were refined using a rhythm-aware Dynamic Bayesian Network (DBN) based on the implementation in the
madmom library [
20]. The DBN incorporates
jangdan-specific tempo constraints between 5 and 60 DBPM and models the latent rhythmic state using bar position, tempo, and rhythm type. Tempo transitions follow a Gaussian prior, while bar positions evolve cyclically according to the tempo. Observation likelihoods are conditioned on the model activations and vary depending on whether the DBN state corresponds to a downbeat, beat, or non-beat position. Offline predictions were obtained through Viterbi decoding; online predictions used a fixed-lag smoothing window to approximate real-time performance.
3.4. Training and Inference
All models were trained using binary cross-entropy loss optimized with the Adam algorithm (learning rate ). Training segments were one minute in length to expose the models to diverse tempo fluctuations and expressive phrasing across jangdan. Early stopping based on validation loss prevented overfitting. During inference, predicted downbeat activations were constrained to the allowed tempo range and decoded with the jangdan-aware DBN. Evaluations were conducted on entire song lengths instead of 1 min chunks that were used during training.
4. Experiments and Results
4.1. Downbeat Tracking
4.1.1. Experimental Setup
During training, Pansori recordings were segmented into 1 min chunks to expose models to a variety of rhythmic transitions within and across jangdan patterns. For evaluation, the models were tested on full-length recordings to assess their robustness to tempo modulations and metric shifts.
To evaluate tracking performance, we employed the F-measure. The F-measure is defined as the harmonic mean of precision and recall, where an estimated beat is considered a true positive if it falls within a predefined temporal tolerance window of a reference beat [
21]. While conventional beat-tracking evaluations in Western music typically use a ±70 ms window, our study adopts a wider tolerance of ±1.5 s to better reflect the expressive timing characteristics of
Pansori. Accordingly, a predicted downbeat was deemed correct if it occurred within this window of the ground-truth annotation. The metrics were computed using the
mir_eval Python library (Python 3.8.12 and mir_eval 0.8.2).
All experiments were run on an Ubuntu Linux server (x86_64 architecture, single CPU socket, 1 NUMA node) equipped with an Intel Core i9-9900K CPU (8 cores/16 threads, 3.6–5.0 GHz), 64 GB RAM, and an NVIDIA GeForce RTX 4080 GPU. Models were implemented in Python with PyTorch (PyTorch 1.7.1); audio preprocessing, feature extraction, baseline beat/downbeat tracking and evaluation pipelines relied on librosa and madmom.
We compared our models against the RNN-DBN model from
madmom [
20], using it as a baseline after adapting its decoder to match
jangdan-specific tempo constraints. For each rhythm pattern, we estimated beat BPMs by assuming 3-4 beats per bar and derived the corresponding min-max downbeat BPM (DBPM) ranges. These tempo priors were integrated into the DBN decoder to constrain the state transitions during inference.
4.1.2. Results
Table 3 shows the F-measures obtained by each model across the different
jangdan patterns. While the
madmom baseline performed reasonably well on faster rhythms such as
Jajinmori and
Jungjungmori (F = 0.724), it struggled with slower, expressive cycles such as
Jinyangjo (F = 0.257) and
Eotjungmori (F = 0.478). In contrast, the Offline TCN achieved F-measures above 0.96 for all patterns, and the Offline RoFormer peaked at 0.991 on
Jungjungmori.
Offline RoFormer consistently outperformed all other models, particularly in slow and expressive patterns such as Jinyangjo (0.990) and Jungjungmori (0.991). The Offline TCN also performed well, achieving 0.989 on Jungjungmori with a simpler architecture. Online models, though slightly less accurate due to causal limitations, often retained competitive performance though slightly lower than their offline counterparts.
Table 4 demonstrates the results using exactly the same setup, but only on isolated vocals.
Several jangdan patterns, including Hwimori, Eotmori, and Eotjungmori, were intentionally excluded from training and used exclusively for evaluation due to data availability constraints. This experimental design was adopted to explicitly assess cross-pattern generalization: we aimed to examine whether models trained on a limited subset of jangdan could transfer to unseen rhythmic structures encountered in real performance settings.
While we do not claim that such generalization will always hold, the empirical results indicate that the proposed framework maintains stable performance on these unseen patterns. This suggests that the model learns rhythm-invariant acoustic cues beyond memorizing pattern-specific templates. At the same time, we acknowledge that tempo priors embedded in the DBN decoder play a significant role in this transferability. By constraining decoding with plausible tempo ranges, the system avoids implausible predictions in rubato passages, which may partially account for the observed robustness. We emphasize that these findings reflect empirical observations on the available dataset. Further validation on larger and more diverse corpora will be required to confirm the generality of this behavior.
While evaluating multiple tolerance windows can provide additional insight into temporal accuracy, reporting such metrics for all model variants would substantially increase result complexity without proportional analytical benefit. We therefore conduct tolerance-window analysis only on the best-performing model. As shown in
Table 5, F-measure decreases monotonically as the tolerance narrows, indicating that the model captures coarse temporal structure rather than millisecond-level precision. The sharp drop at ±70 ms reflects substantial expressive deviation in
Pansori timing, making frame-accurate alignment unrealistic. Faster patterns such as
Hwimori and
Eotjungmori retain higher scores under strict tolerances, whereas slower patterns such as
Jinyangjo degrade rapidly due to stronger rubato. These results justify our use of wider tolerance windows and highlight fundamental differences between
Pansori and Western metrical timing, shaped by the underlying
jangdan structure.
4.2. Beat Level Energy Analysis
To analyze beat-level energy dynamics, we implemented A-weighted energy labeling. The frame-wise RMS energy was calculated using the STFT and weighted by a perceptual A-weighting filter. The beat-level energies were then aggregated using Gaussian smoothing. A pause threshold was defined using energy percentiles. Beats below this threshold were labeled as pauses; the others were binned into five energy levels (very low to very high) using log-scaled percentile edges. The per-beat distributions were computed by histogramming the frame energies within each beat. A pseudocode for the steps involved in this analysis is presented in Algorithm 1.
| Algorithm 1: A-weighted Beat Energy Labeling |
![Applsci 16 01235 i001 Applsci 16 01235 i001]() |
Figure 3 shows the distribution of vocal energy labels across relative beat positions within
jangdan units. Each stacked bar represents the average proportion of pause, VL, L, M, H, and VH labels at each beat position, where position 0 corresponds to the downbeat, i.e., the
jangdan boundary. The analysis is conducted on annotated
jangdan sections whose distribution across rhythmic patterns is detailed in
Table 6.
Dominant labels are annotated above the bars, revealing characteristic energy contours over the course of each jangdan cycle and highlighting contrasts between low-energy and accented beats. These contours reflect singers’ interpretive choices in aligning vocal intensity with narrative content and emotional intent within each jangdan. For example, prominent high-energy accents at specific beat positions in faster cycles such as Jajinmori and Hwimori often coincide with dramatic climaxes or emphatic textual delivery, whereas the broader low-to-mid energy distribution in slower cycles such as Jinyangjo supports sustained sorrowful or majestic expression. These patterns illustrate how performers systematically modulate vocal dynamics across a downbeat period equivalently, across a jangdan to enhance storytelling and affective impact.
The subdivision of each downbeat (jangdan) into beats is performed by uniformly dividing each jangdan interval into 18 parts for Jinyangjo and 12 parts for all other patterns, following established cultural priors.
This analysis was motivated by the observation that each jangdan exhibits a distinct pattern of lyrical alignment, in which syllables often correspond to specific beat positions with characteristic energy levels. Studying the overlap between vocal syllables and beat-level energy distributions may help to uncover the latent rhythmic structures or phrase boundaries inherent to each pattern. A detailed investigation of this alignment is left for future work.
4.3. Tempo Variation Across Downbeat Sections
Tempo analysis of Pansori performances is critical for understanding the dynamic interplay between jangdan rhythmic cycles and narrative progression. The jangdan patterns, such as Jinyangjo, Jajinmori, Jungmori, and Hwimori, exhibit distinct tempo characteristics that reflect the emotional and structural nuances of the music. These cycles, ranging from slow and meditative (e.g., Jinyangjo, typically 5–20 downbeats per minute) to fast and energetic (e.g., Hwimori, up to 60 downbeats per minute), introduce significant variability in tempo across downbeat sections, posing unique challenges for automated tracking systems.
To analyze tempo variations, our framework extracts downbeat-aligned tempo estimates from annotated Pansori recordings. We employ a multi-stage approach: Firstly, we track the downbeats using the best performing downbeat tracking model, then we identify the tempo estimation based on inter-downbeat intervals which we treat as a jangdan over the duration of the recording. Tempo was computed as the reciprocal of the median interval between consecutive downbeats within each section, expressed in downbeats per minute (DBPM). This approach accounts for the asymmetric and expressive timing inherent in Pansori, where microtiming deviations and vocal ornamentation can obscure regular pulse structures that simply measuring tempo in terms of beats per minute would not account for.
The tempo distributions in
Figure 4 present examples drawn from selected representative segments of the dataset, illustrating characteristic intra-cycle tempo fluctuations within individual
jangdan patterns. These examples are intended as qualitative exemplars rather than statistical norms.
Pansori performers exhibit substantial variability in tempo shaping depending on narrative context, emotional intent, and individual interpretative style; therefore, the displayed distributions should not be interpreted as population-level averages.
The proposed
jangdan-aware Dynamic Bayesian Network (DBN) further refines tempo estimation by incorporating rhythmic priors specific to each
jangdan type. By explicitly modeling expected tempo ranges and transition probabilities between cycle sections, the DBN mitigates estimation errors caused by irregular subdivisions and drummer interjections (
chuimsae). To avoid overgeneralization, we explicitly clarify that the samples shown in
Figure 4 correspond to specific selected passages rather than aggregated statistics across the full corpus. They serve as illustrative case studies of typical expressive tempo behavior within individual performances, and how the presented method can be applied.
This tempo analysis enables several downstream applications. First, it supports precise alignment of lyric boundary events (
buchimsae) by correlating tempo shifts with narrative transitions. Second, it facilitates the development of interactive performance systems, such as an artificial
gosu that adapts to real-time tempo variation. Finally, it contributes to musicological research by providing quantitative evidence of the expressive role of
jangdan tempo in
Pansori, supporting digital archiving and pedagogical applications. Although such a system is not implemented in the present study, the reported gains establish its technical feasibility. This direction is further supported by prior work such as
JukeDrummer [
22], which demonstrates that beat-embedding representations can successfully drive adaptive virtual percussion agents. Similar beat-aware accompaniment and generation systems have been explored in related literature, confirming the viability of rhythm-conditioned interactive music systems.
Finally, the statistically consistent tempo contours and intra-cycle variation patterns extracted across trials provide quantitative evidence of the expressive role of jangdan tempo in Pansori. While these applications are left for future work, the present results establish a validated analytical foundation for musicological research, as well as for digital archiving and pedagogical tools.
5. Conclusions
This study introduced a rhythm-aware downbeat tracking framework tailored for Korean Pansori music, effectively addressing its expressive timing, sparse accompaniment, and diverse jangdan cycles. By integrating Temporal Convolutional Networks (TCNs), RoFormer-based transformers, and a jangdan aware DBN, our approach achieved state-of-the-art performance, with the offline RoFormer model, significantly outperforming general-purpose trackers even after using jangdan aware post-processing. The A-weighted energy labeling method further revealed distinct intensity patterns, deepening insights into pansori’s rhythmic and expressive dynamics. In particular, the observed tempo and energy variations across jangdan cycles highlight singers’ interpretive strategies for aligning rhythmic structure with narrative content and emotional expression. This framework advances music information retrieval (MIR) for non-Western traditions, offering a robust foundation for musicological analysis, digital archiving, and interactive performance systems. By leveraging culturally informed priors, our work underscores the potential of MIR to preserve and promote intangible cultural heritage such as Pansori, paving the way for further innovations in culturally sensitive music processing.