Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

This paper describes an automatic drum transcription (ADT) method that directly estimates a tatum-level drum score from a music signal, in contrast to most conventional ADT methods that estimate the frame-level onset probabilities of drums. To estimate a tatum-level score, we propose a deep transcription model that consists of a frame-level encoder for extracting the latent features from a music signal and a tatum-level decoder for estimating a drum score from the latent features pooled at the tatum level. To capture the global repetitive structure of drum scores, which is difficult to learn with a recurrent neural network (RNN), we introduce a self-attention mechanism with tatum-synchronous positional encoding into the decoder. To mitigate the difficulty of training the self-attention-based model from an insufficient amount of paired data and improve the musical naturalness of the estimated scores, we propose a regularized training method that uses a global structure-aware masked language (score) model with a self-attention mechanism pretrained from an extensive collection of drum scores. Experimental results showed that the proposed regularized model outperformed the conventional RNN-based model in terms of the tatum-level error rate and the frame-level F-measure, even when only a limited amount of paired data was available so that the non-regularized model underperformed the RNN-based model.


Introduction
Automatic drum transcription (ADT) is one of the most important sub-tasks in automatic music transcription (AMT) because the drum part forms the rhythmic backbone of popular music. In this study, we deal with the three main instruments of the basic drum kit: bass drum (BD), snare drum (SD), and hi-hats (HH). Since these drums produce unpitched impulsive sounds, only the onset times are of interest in ADT. The standard approach to ADT is to estimate the activations (gains) or onset probabilities of each drum from a music spectrogram at the frame level and then determine the onset frames using an optimal path search algorithm based on some cost function [1]. Although the ultimate goal of ADT is to estimate a human-readable symbolic drum score, few studies have attempted to estimate the onset times of drums quantized at the tatum level. In this paper, the "tatum" is defined as a tick position on the sixteenth-note-level grid (four times finer than the "beat" on the quarter-note-level grid) and the tatum times are assumed to be estimated in advance [2].
Nonnegative matrix factorization (NMF) and deep learning have been used for frame-level ADT [3]. The time-frequency spectrogram of a percussive part, which can be separated from a music spectrogram [4,5], has a low-rank structure because it is composed of repeated drum sounds with varying gains. This has motivated the use of NMF or its convolutional variants for ADT [6][7][8][9][10], where the basis spectra or spectrogram templates of drums are prepared and their frame-level activations are estimated in a semi-supervised manner. The NMF-based approach is a physically reasonable choice, but the supervised DNN-based approach has recently gained much attention because of its superior performance. Comprehensive experimental comparison of DNN-and NMF-based ADT methods have been reported in [3]. Convolutional neural networks (CNNs), for example, have been used for extracting local time-frequency features from an input spectrogram [11][12][13][14][15]. Recurrent neural networks (RNNs) are expected to learn the temporal dynamics inherent in music and have successfully been used, often in combination with CNNs, for estimating the smooth onset probabilities of drum sounds at the frame level [16][17][18]. This approach, however, cannot learn musically meaningful drum patterns on the symbolic domain, and the tatum-level quantization of the estimated onset probabilities in an independent post-processing step often yields musically unnatural drum notes.
To solve this problem, Ishizuka et al. [19] attempted to use the encoder-decoder architecture [20,21] for frame-to-tatum ADT. The model consisted of a CNN-based frame-level encoder for extracting the latent features from a drum-part spectrogram and an RNN-based tatum-level decoder for estimating the onset probabilities of drums from the latent features pooled at the tatum level. This was inspired by the end-to-end approach to automatic speech recognition (ASR), where the encoder acted as an acoustic model to extract the latent features from speech signals and the decoder acted as a language model to estimate the grammatically coherent word sequences [22]. Unlike ASR models, the model used the temporal pooling and had no attention mechanism that connected the frame-level encoder to the tatum-level decoder, i.e., that aligned the frame-level acoustic features with the tatum-level drum notes, because the tatum times were given. Although the tatum-level decoder was capable of learning musically meaningful drum patterns and favored musically natural drum notes as its output, the performance of ADT was limited by the amount of paired data of music signals and drum scores.
Transfer learning [23][24][25] is a way of using external non-paired drum scores for improving the generalization capability of the encoder-decoder model. For example, the encoder-decoder model could be trained in a regularized manner such that the output score was close to the ground-truth drum score and at the same time was preferred by a language model pretrained from an extensive collection of drum scores [19]. More specifically, a repetition-aware bi-gram model and a gated recurrent unit (GRU) model were used as language models for evaluating the probability (musical naturalness) of a drum score. Assuming drum patterns were repeated with an interval of four beats as was often the case with the 4/4 time signature, the bi-gram model predicted the onset activations at each tatum by referring to those at the tatum four beats ago. The GRU model worked better than the bi-gram model because it had no assumption about the time signature and could learn the sequential dependency of tatum-level onset activations. Although the grammatical knowledge learned by the GRU model was expected to be transferred into the RNN-based decoder, such RNN-based models still could not learn the repetitive structure of drum patterns on the global time scale.
To overcome this limitation, in this paper we propose a global structure-aware frame-to-tatum ADT method based on an encoder-decoder model with a self-attention mechanism and transfer learning (Fig. 1), inspired by the success in sequence-to-sequence tasks such as machine translation and ASR. More specifically, our model involves a tatum-level decoder with a self-attention mechanism, where the architecture of the decoder is similar to that of the encoder of the transformer [26], because the input and output dimensions of the decoder are the same. To consider the temporal regularity of tatums for the self-attention computation, we propose a new type of positional encoding synchronized with the tatum times. Our model is trained in a regularized manner such that the model output (drum score) is preferred by a masked language model (MLM) with a self-attention mechanism that evaluates the pseudo-probability of the drum notes at each tatum based on both the forward and backward contexts. We experimentally validate the effectiveness of the self-attention mechanism used in the decoder and/or the language model and that of the tatum-synchronous positional encoding. We also investigate the computational efficiency of the proposed ADT method and compare it with that of the conventional RNN-based ADT method.
In Section 2 of this paper, we introduce related work on ADT and language modeling. Section 3 describes the proposed method, and Section 4 reports the experimental results. We conclude in Section 5 with a brief summary and mention of future work.

Related Work
This section reviews related work on ADT (Section 2.1), global structure-aware language models (Section 2.2), and evaluation metrics for transcribed musical scores (Section 2.3).

Automatic Drum Transcription (ADT)
Some studies have attempted to use knowledge learned from an extensive collection of unpaired drum scores to improve ADT. A language model can be trained from such data in an unsupervised manner and used to encourage a transcription model to estimate a musically natural drum pattern. Thompson et al. [27] used a template-based language model for classifying audio signals into a limited number of drum patterns with a support vector machine (SVM). Wu et al. [28] proposed a framework of knowledge distillation [29], one way to achieve transfer learning [30], in which an NMF-based teacher model was applied to a DNN-based student model.
Language models have been actively used in the field of ASR. In a classical ASR system consisting of independently-trained acoustic and language models, the language model is used in combination with the acoustic model in the decoding stage to generate syntactically-and semantically-natural word sequences. The implementation of the decoder, however, is highly complicated. In an end-to-end ASR system having no clear distinction of acoustic and language models, only paired data can be used for training an integrated model. Transfer learning is a promising way of making effective use of a language model trained from huge unpaired text data [23]. For example, a pretrained language model is used for softening the target word distribution of paired speech data such that not only ground-truth transcriptions but also their semantically-coherent variations are taken into account as target data in the supervised training [25].
A frame-level language model has been used in AMT. Raczyǹski et al. [31] used a deep belief network for modeling transitions of chord symbols and improved the chord recognition method based on NMF.
Sigtia et al. [32] used a language model for estimating the most likely chord sequence from the chord posterior probabilities estimated by an RNN-based chord recognition system. As pointed out in [33,34], however, language models can more effectively be formulated at the tatum level for learning musically meaningful structures. Korzeniowski et al. [35] used N-gram as a symbolic language model and improved a DNN-based chord recognition system. Korzeniowski et al. [36] used an RNN-based symbolic language model with a duration model based on the idea that a frame-level language model can only smooth the onset probabilities of chord symbols. Ycart et al. [37] investigated the predictive power of LSTM networks and demonstrated that a long short-term memory (LSTM) working at the level of 16th note timesteps could express musical structures such as note transitions.
A tatum-level language model has also been used in ADT. Ueda et al. [10] proposed a Bayesian approach using a DNN-based language model as a prior of drum scores. Ishizuka et al. [19] proposed a regularized training method with an RNN-based pretrained language model to output musically natural drum patterns. However, these tatum-level language models cannot learn global structures, although the drum parts exhibit repetitive structure in music signals.

Global Structure-Aware Language Model
The attention mechanism is a core technology for global structure-aware sequence-to-sequence learning. In the standard encoder-decoder architecture, the encoder extracts latent features from an input sequence and the decoder recursively generates a variable number of output symbols one by one while referring to the whole latent features with attention weights [38,39]. In general, the encoder and decoder are implemented as RNNs to consider the sequential dependency of input and output symbols. Instead, the self-attention mechanism that can extract global structure-aware latent features from a single sequence can be incorporated into the encoder and decoder, leading to a non-autoregressive model called the transformer [26] suitable for parallel computation in the training phase.
To represent the ordinal information of input symbols, positional encoding vectors in addition to the input sequence are fed to the self-attention mechanism. Absolute positional encoding vectors are used in a non-recursive sequence-to-sequence model based entirely on CNNs [40]. Predefined trigonometric functions with different frequencies were proposed for representing absolute position information [26]. Sine and cosine functions are expected to implicitly learn the positional relationships of symbols based on the hypothesis that there exists a linear transformation that arbitrarily changes the phase of the trigonometric functions. There are some studies on relative position embeddings [41][42][43].
Recently, various methods for pretraining global structure-aware language models have been proposed. Embeddings from language models (ELMo) [44] is a feature-based pretraining model that combines forward and backward RNNs at the final layer to use bidirectional contexts. However, the forward and backward inferences are separated, and the computation is time-consuming because of the recursive learning process. Generative pretrained transformer (GPT) [45] and bidirectional encoder representations from transformers (BERT) [46] are pretrained models based on fine tuning. GPT is a variant of the transformer trained by preventing the self-attention mechanism from referring to future information. However, the inference is limited to a single direction. BERT can jointly learn bidirectional contexts, and thus the masked language model (MLM) obtained by BERT is categorized as a bidirectional language model. Because the perplexity is difficult to calculate, the pseudo perplexity of inferred word sequences is computed as described in [47].
In music generation, music transformer [48] uses a relative attention mechanism to learn long-term structure along with a new algorithm to reduce the memory requirement. Pop music transformer [49] adopts transformer-XL to leverage longer-range information along with a new data representation that expresses the rhythmic and harmonic structure of music. Transformer variational autoencoder [50] enables the joint learning of local representation and global structure based on the hierarchical modeling. Harmony transformer [51] improves chord recognition to integrate chord segmentation with a non-autoregressive decoding method in the framework of musical harmony analysis. In ADT, however, very few studies have focused on learning long-term dependencies, even though a repetitive structure can uniquely be seen in drum patterns.

Evaluation Metrics for AMT
Most work in AMT have conducted frame-level evaluation for the detected onset times of target musical instruments. Poliner et al. [52] proposed comprehensive two metrics for frame-level piano transcription. The first one is the accuracy rate defined according to Dixon's work [53]. The second one is the frame-level transcription error score inspired by the evaluation metric used in multiparty speech activity detection.
Some recent studies have focused on tatum-and symbol-level evaluations. Nishikimi et al. [54] used the tatum-level error rate based on the Levenshtein distance in automatic singing transcription. Nakamura et al. [55] conducted symbol-level evaluation for a piano transcription system consisting of multi-pitch detection and rhythm quantization. In the multi-pitch detection stage, an acoustic model estimates note events represented by pitches, onset and offset times (in seconds), and velocities. In the rhythm quantization stage, a metrical HMM with Gaussian noise (noisy metrical HMM) quantizes the note events on a tatum grid, followed by note-value and hand-part estimation. McLeod et al. [56] proposed a quantitative metric called MV2H for both multipitch detection and musical analysis. Similar to Nakamura's work, this metric aims to evaluate a complete musical score with instrument parts, a time signature and metrical structure, note values, and harmonic information. The MV2H metric is based on the principle that a single error should be penalized once in the evaluation phase. In the context of ADT, in contrast, tatum-and symbol-level metrics have scarcely been investigated.

Proposed Method
Our goal is to estimate a drum scoreŶ ∈ {0, 1} M×N from the mel spectrogram of a target musical piece X ∈ R F×T + , where M is the number of drum instruments (BD, SD, and HH, i.e., M=3), N is the number of tatums, F is the number of frequency bins, and T is the number of time frames. We assume that all onset times are located on the tatum-level grid, and the tatum times B = {b n } N n=1 , where 1 ≤ b n ≤ T and b n < b n+1 , are estimated in advance.
In Section 3.1, we explain the configuration of the encoder-decoder-based transcription model. Section 3.2 describes the masked language model as a bidirectional language model with the bi-gram-and GRU-based language models as unidirectional language models. The regularization method is explained in Section 3.3.

Transcription Models
The transcription model is used for estimating the tatum-level onset probabilities φ ∈ [0, 1] M×N , where φ m,n represents the probability that drum m has an onset at tatum n. The drum score Y is obtained by binarizing φ with a threshold δ ∈ [0, 1].
The encoder of the transcription model is implemented with a CNN. The mel spectrogram X is converted to latent features F ∈ R D F ×T , where D F is the feature dimension. The frame-level latent features F are then summarized into tatum-level latent features G ∈ R D F ×N through a max-pooling layer referring to the tatum times B as follows: where b 0 =b 1 and b N+1 =b N are introduced for the brief expression.
The decoder of the transcription model is implemented with a bidirectional GRU (BiGRU) or a self-attention mechanism (SelfAtt) followed by a fully connected layer. The intermediate features G are directly converted to the onset probabilities at the tatum level. In the self-attention-based decoder, the onset probabilities are estimated without recursive computation. To learn the sequential dependency and global structure of drum scores, the positional encoding E ∈ R D F ×N are fed into the latent features G to obtain extended latent features Z ∈ R D F ×N . The standard positional encodings proposed in [26] are given by In this paper, we propose tatum-synchronous positional encodings (denoted SyncPE): where [·] represents the floor function. As shown in Fig. 2, the non-linear stripes patterns appear in the encodings proposed in [26] because the period of the trigonometric functions increases exponentially with respect to the latent feature indices, whereas the proposed tatum-synchronous encodings exhibit the linear stripes patterns. As shown in Fig. 3, the extended features Z are converted to the onset probabilities φ through a stack of L self-attention mechanisms with I heads [26] and the layer normalization (Pre-Norm) [57] proposed for the simple and stable training of the transformer models [58,59]. For each head i (1 ≤ i ≤ I), let be query, key, and value matrices given by where D k is the feature dimension of each head (D k = D F I in this paper as in [26]), q i ∈ R D K , k i ∈ R D K , and v i ∈ R D K are query, key, and value vectors, respectively, W Let α ∈ R N×N be a self-attention matrix consisting of the degrees of self-relevance of the extended latent features Z, which is given by The extended latent features and the extracted features H with Dropout (p = 0.1) [60] are then fed into a feed forward network (FFN) with a rectified linear unit (ReLU) as follows: where σ(·) is a sigmoid function, W ∈ R M×N is a bias vector.

Language Models
The language model is used for estimating the generative probability (musical naturalness) of an arbitrary existing drum scoreỸ 1 . In this study, we use unidirectional language models such as the repetition-aware bi-gram model and GRU-based model proposed in [19] and a masked language model (MLM), a bidirectional language model proposed for pretraining in BERT [46] (Fig. 4).
The unidirectional language model is trained beforehand in an unsupervised manner such that the following negative log-likelihood L uni lang (Ỹ) is minimized: log p(Ỹ :,n |Ỹ :,1:n−1 ), where "i:j" represents a set of indices from i to j and ":" represents all possible indices. In the repetition-aware bi-gram model (top figure in Fig. 4) assuming that target musical pieces have the 4/4 time signature, the repetitive structure of a drum score is formulated as follows: 1 For brevity, we assume that only one drum score is used as training data. In practice, a sufficient amount of drum scores are used.
where π A,B (A, B ∈ {0, 1}) represents the transition probability from A to B. Note that this model assumes the independence of the M drums. In the GRU model (middle figure in Fig. 4), p(Ỹ :,n |Ỹ :,1:n−1 ) is directly calculated using an RNN. The MLM is capable of learning the global structure of drum scores (bottom figure in Fig. 4). In the training phase, drum activations at randomly selected 15% of tatums inỸ are masked and the MLM is trained such that those activations are predicted as accurately as possible. The loss function L bi lang (Ỹ) to be minimized is given byp (Ỹ n ) = p(Ỹ :,n |Ỹ :,1:n−1 ,Ỹ :,n+1:N ), (13) L bi

Regularized Training
To consider the musical naturalness of the estimated scoreŶ obtained by binarizing φ, we use the language model-based regularized training method [19] that minimizes whereŶ is a ground-truth score, γ > 0 is a weighting factor, the symbol * denotes "uni" or "bi", and L tran (φ|Ŷ) is the modified negative log-likelihood given by where β m > 0 is a weighting factor compensating for the imbalance of the numbers of onset and non-onset tatums.
To use backpropagation for optimizing the transcription model, the binary score Y should be obtained from the soft representation φ in a differentiable manner instead of simply binarizing φ with a threshold. We thus use a differentiable sampler called the Gumbel-sigmoid trick [61], as follows: where k=1, 2, and τ > 0 is a temperature (τ=0.2 in this paper). Note that the pretrained language model is used as a fixed regularizer in the training phase and is not used in the prediction phase.

Evaluation
This section reports the comparative experiments conducted for evaluating the proposed ADT method and investigates the effectiveness of the self-attention mechanism and that of the MLM-based regularized training.

Evaluation Data
We used the Slakh2100-split2 (Slakh) [62] and the RWC Popular Music Database (RWC) [63] for evaluation because these datasets include ground-truth beat times. The Slakh dataset contains 2100 musical pieces whose audio signals were synthesized from the Lakh MIDI dataset [64] using professional-grade virtual instruments, and the RWC dataset contains 100 Japanese popular songs. All music signals were sampled at 44.1 kHz. The onset times of BD, SD, and HH (M = 3) were extracted as ground-truth data from the synchronized MIDI files provided for these datasets. To make ground-truth drum scores, each onset time was quantized to the closest tatum time (the justification of this approach is discussed in Section 4.4). Only musical pieces whose drum onset times and tatum times had been annotated correctly as ground-truth data were used for evaluation.
As to the Slakh dataset, we used 2010 pieces, which were split into 1449, 358, and 203 pieces as training, validation, and test data, respectively. As to the RWC dataset, we used 65 songs for 10-fold cross validation, where 15% of training data was used as validation data in each fold. Since we here aim to validate the effectiveness of the language model-based regularized training for self-attention-based transcription models on the same dataset (Slakh or RWC), investigation of the cross-corpus generalization capability (portability) of language models is beyond the scope of the paper and left as future work.
For each music signal, a drum signal was separated with Spleeter [5] and the tatum times were estimated with madmom [2] or given as oracle data. The spectrogram of a music or drum signal was obtained using short-time Fourier transform (STFT) with a Hann window of 2048 points (46 ms) and a shifting interval of 441 points (10 ms). We used mel-spectrograms as input features because they have successfully been used for onset detection [11] and CNN-based ADT [14]. The mel-spectrogram was computed using a mel-filter bank with 80 bands from 20 Hz to 20,000 Hz and normalized so that the maximum volume was 0 db. A stack of music and drum mel-spectrograms was fed into a transcription model.

Model Configurations
The configurations of the two transcription models (CNN-BiGRU and CNN-SelfAtt(-SyncPE) described in Section 3.1) are shown in Figure 5 and Table 1. The encoders were the same CNN consisting of four convolutional layers with a kernel size of 3 × 3 and the decoders were based on the BiGRU and the multi-head self-attention mechanism. The influential hyperparameters were automatically determined with a Bayesian optimizer called Optuna [65] for the validation data in the Slakh dataset or with 10-fold cross validation in the RWC dataset under a condition that D FFN = 4D F . As a result, CNN-SelfAtt had about twice as many parameters as CNN-BiGRU. In the training phase, the batch size was set to 10, and the max length was set to 256 for CNN-SelfAtt. We used the AdamW optimizer [66] with an initial learning  rate of 10 −3 . The learning rate of CNN-SelfAtt was changed according to [26], and warmup_steps was set to 4000. To prevent over-fitting, we used weight regularization (λ = 10 −4 ), drop-out just before all the fully connected layers of CNN-BiGRU (p = 0.2) and each layer of CNN-SelfAtt (p = 0.1), and tatum-level SpecAugment [67] for the RWC dataset, where 15% of all tatums were masked in the training phase. The weights of the convolutional and BiGRU layers were initialized based on [68], the fully connected layer was initialized by the sampling from Uniform(0, 1), and the biases were initialized to 0. In the testing phase, the average of the ten parameters before and after the epoch that achieved the smallest loss for the validation data was used in CNN-SelfAtt. The threshold for φ was set to δ = 0.2. The configurations of the three language models (bi-gram, GRU, and MLM(-SyncPE) described in Section 3.2) are shown in Table 2. Each model was trained with 512 external drum scores of Japanese popular songs and Beatles songs. To investigate the impact of the data size for predictive performance, each model was also trained by using only randomly selected 51 scores. The influential hyperparameters of the neural language models, i.e., the number of layers and the dimension of hidden states in the GRU model and h, l, D F , and D FFN in the MLM, were automatically determined with Optuna [65] via 3-fold cross validation with the 512 scores under a condition that D FFN = 4D F . As a result, the MLM had about twice as many parameters as the GRU model. The bi-gram model was defined by only two parameters π 0,1 and π 1,1 .

Evaluation Metrics
The performance of ADT was evaluated at the frame and tatum levels. The frame-level F-measure (F ) is defined as the harmonic mean of the precision rate P and the recall rate R: where N E , N G , and N C are the number of estimated onset times, that of ground-truth onset times, and that of correctly-estimated onset times, respectively, and the error tolerance was set to 50 ms. Note that F = 100% means perfect transcription. For the tatum-level evaluation, we propose a tatum-level error rate (TER) based on the Levenshtein distance. Note that all the estimated drum scores were concatenated and then the frame-level F-measure and TER were computed for the whole dataset. As shown in Fig. 6, the TER between a ground-truth score Y Y 1:N ∈ {0, 1} M×N with N tatums and an estimated scoreŶ where S(Y n ,Ŷn) represents the sum of the Manhattan distances between the ground-truth activations Y n at tatum n and the estimated activationsŶn at tatumn. Note thatN might be different from N when the tatum times were estimated with madmom and that TER(Y 1:N ,Ŷ 1:N ) = 0 does not mean perfect transcription as discussed in Section 4.4. The comprehensive note-level evaluation measures were proposed for AMT [55,56], but were not used in our experiment because ADT focuses on only the onset times of note events.

Justification of Tatum-Level Drum Transcription
We validated the appropriateness of tatum-level ADT because some kinds of actual onset times cannot be detected in principle under an assumption that the onset times of each drum are exclusively located on the sixteenth-note-level grid. As shown in Fig. 7, such undetectable onsets are (doubly) categorized into two groups: conflict and far. If multiple onset times are close to the same tatum time, only one onset time can be detected, i.e., the other onset times are undetectable and categorized into the conflict group. Onset times that are not within 50 ms from the closest tatum times are categorized into the far group. In the tatum-level ADT, the onset times of these groups remain undetected even when TER(Y 1:N ,Ŷ 1:N ) = 0. Table 3 shows the ratio of undetectable onset times in each group to the total number of actual onset times when the estimated or ground-truth beat times were used for quantization. The total ratio of undetectable onset times was sufficiently low. This justifies the sixteenth-note-level quantization of onset times, at least for the majority of typical popular songs used in our experiment. Note that our model cannot deal with triplet notes.
Since the tatum times are assumed to be given, we evaluated the beat tracking performance of madmom [2] in terms of the F-measure in the same way as Eq. (20). The mir_eval library was used for computing P, R, and F . The F-measure for the 203 pieces of the test data in the Slakh dataset was 92.5% and that for the 65 songs in the RWC dataset was 96.4%.

Evaluation of Language Modeling
We evaluated the three language models (bi-gram, GRU, and MLM(-SyncPE) described in Section 3.2) in terms of the perplexities for the 203 pieces in the Slakh dataset and the 65 songs in the RWC dataset. The perplexity for a drum scoreỸ is defined as follows: where "*" denotes "uni" or "bi" and L uni lang (Ỹ) and L bi lang (Ỹ) are given by Eqs. (11) and (14), respectively. Since L bi lang (Ỹ) based on the MLM does not exactly give the likelihood forỸ because of the bidirectional nature, unlike L uni lang (Ỹ) based on the bi-gram or GRU model with the autoregressive nature, PPL bi (Ỹ) can be only roughly compared with PPL uni (Ỹ).
As shown in Table 4, the predictive capability of the GRU model was significantly better than that of the bi-gram model because the bi-gram model is based on the strong assumption that drum patterns are repeated with the 4/4 time signature. The MLM using the proposed positional encodings, denoted by MLM-SyncPE, slightly outperformed the MLM using the conventional encodings. The larger the training dataset, the lower (better) the perplexity. The pseudo-perplexity calculated by the MLM was close to 1, meaning that the MLM accurately predicted the activations at masked tatums from the forward and backward contexts.

Evaluation of Drum Transcription
We evaluated the two transcription models (CNN-BiGRU and CNN-SelfAtt-SyncPE described in Section 3.1) that were trained with and without the regularization mechanism based on each of the three language models (bi-gram, GRU, and MLM-SyncPE described in Section 3.2). For comparison, we tested the conventional frame-level ADT method based on a CNN-BiGRU model [18] that had the same architecture as our tatum-level CNN-BiGRU model (Fig. 5) except that the max-pooling layers were not used. It was trained such that the following frame-level cross entropy was minimized: where φ , Y ∈ R M×T are the estimated onset probabilities and the ground-truth binary activations, respectively, and β > 0 is a weighting factor. For each drum m, a frame t was picked as an onset if whereδ was a threshold, w 1:5 were interval parameters, and t prev was the previous onset frame. These were set toδ = 0.2, w 1 = w 3 = w 5 = 2, and w 2 = w 4 = 0, as in [18]. The weighting factors β, γ, and β were optimized for the validation data in the Slakh dataset and by 10-fold cross validation in the  RWC dataset, as shown in Table 5. To measure the tatum-level transcription performance, the estimated frame-level onset times were quantized at the tatum level with reference to the estimated or ground-truth tatum times.
As shown in Table 6, CNN-SelfAtt-SyncPE worked best for the Slakh dataset and CNN-BiGRU with the MLM-SyncPE-based regularization worked best for the RWC dataset in terms of the F-measure and TER. This suggests that a sufficient amount of paired data are required to draw the full potential of the self-attention mechanism. CNN-SelfAtt-SyncPE tended to yield higher or comparable F-measures for BD and SD in the RWC dataset and lower F-measures for HH in both the datasets. This was because percussive sounds in the high frequency range were not cleanly separated by the Spleeter even with some noises, or SelfAtt-based model had a tendency to detect other instruments similar to HH such as Maracas. The MLM-SyncPE-based regularization made a little improvement for the Slakh dataset because even the non-regularized model worked well on the synthesized dataset with the limited acoustic and timbral variety. In contrast, it made a significant improvement over the bi-gram-or GRU-based regularization for the RWC dataset, but required much longer training time because every iteration costs O(C × N), where C represents the batch size. Note that if enough memory is available, the MLM-based regularization can be calculated in a parallel manner by O(1). CNN-BiGRU with the MLM-based regularization required the longest training time in spite of the highest performance. The frame-level CNN-BiGRU model [18] required much longer training time than our frame-to-tatum model.

Investigation of Self-Attention Mechanism
We further investigated the behaviors of the self-attention mechanisms used in the transcription and language models. To validate the effectiveness of the proposed tatum-synchronous positional encodings (SyncPE), we compared two versions of the proposed transcription model, denoted by CNN-SelfAtt and CNN-SelfAtt-SyncPE, in terms of the F-measure and TER. As shown in Table 7, CNN-SelfAtt-SyncPE always outperformed CNN-SelfAtt by a large margin. To investigate the impact of the data size used for training the transcription models (CNN-BiGRU and CNN-SelfAtt-SyncPE), we compared the performances obtained by using 1/32, 1/16, 1/4, 1/2, and all of the training data in the Slakh or RWC dataset. As shown in Fig. 8, CNN-SelfAtt-SyncPE was severely affected by the data size and CNN-BiGRU worked better than CNN-SelfAtt when only a small amount of paired data were available. To investigate the impact of the data size used for training the language models (bi-gram, GRU, and MLM-SyncPE), we compared the performances obtained by CNN-SelfAtt-SyncPE that was regularized with a language model pretrained with 512 or 51 external drum scores. As shown in Table 8, the pretrained language models with a larger number of drum scores achieved higher performances. The effect of the MLM-SyncPE-based regularization severely depends on the data size, whereas the bi-gram model was scarcely affected by the data size.
We confirmed that CNN-SelfAtt-SyncPE with the MLM-SyncPE-based regularization learned the global structures of drum scores through attention matrices and yielded globally-coherent drum scores. Fig. 9 shows examples of attention matrices. In Slakh-Track01930, both the global structure of drums and the repetitive structure of BD were learned successfully. In RWC-MDB-P-2001 No. 25, the repetitive structures of SD and HH were captured. These examples demonstrate that the attention matrices at each layer and head can capture the different structural characteristics of drums. Fig. 10 shows examples of estimated drum scores. In RWC-MDB-P-2001 No. 25, the MLM-SyncPE-based regularization improved PPL bi (musical unnaturalness) and encouraged CNN-SelfAtt-SyncPE to learn the repetitive structures of BD, SD, and HH. In RWC-MDB-P-2001 No. 40, in contrast, although the MLM-SyncPE-based regularization also improved PPL bi , it yielded an oversimplified score of HH.

Conclusion
In this paper, we described a global structure-aware frame-to-tatum ADT method based on self-attention mechanisms. The transcription model consists of a frame-level convolutional encoder for extracting the latent features of music signals and a tatum-level self-attention-based decoder for considering musically meaningful global structure, and is trained in a regularized manner based on an pretrained MLM. Experimental results showed that the proposed regularized model outperformed the conventional RNN-based model in terms of the tatum-level error rate and the frame-level F-measure, even when only a limited amount of paired data were available so that the non-regularized model underperformed the RNN-based model. The attention matrices revealed that the self-attention-based model could learn the global and repetitive structure of drums at each layer and head. In future work, we plan to deal with more sophisticated and/or non-regular drum patterns (e.g., fill-ins) played using various kinds of percussive instruments (e.g., cymbals and toms). Considering that beat and downbeat times are closely related to drum patterns, it would be beneficial to integrate beat tracking into ADT in a multi-task learning framework.