Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams

Matsunaga, Tomoki; Saito, Hiroaki

doi:10.3390/signals7010012

Open AccessArticle

Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams

by

Tomoki Matsunaga

^*

and

Hiroaki Saito

Department of Information and Computer Science, Keio University, Yokohama-shi 223-8522, Japan

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(1), 12; https://doi.org/10.3390/signals7010012

Submission received: 2 December 2025 / Revised: 10 January 2026 / Accepted: 20 January 2026 / Published: 2 February 2026

Download

Browse Figures

Versions Notes

Abstract

Multitrack music transcription is the task of converting music recordings into symbolic music representations that are assigned to individual instruments. This task requires simultaneous transcription of note onset and offset events for individual instruments. In addition, the limited resources of many transcription datasets make multitrack music transcription challenging. Thus, even state-of-the-art transcription systems are inadequate for applications requiring high accuracy. In this paper, we propose a framework to jointly transcribe onsets and frames for multiple instruments by integrating a deep learning architecture based on U-Net with an architecture based on Perceiver, which is a variant of the Transformer architecture. The proposed framework effectively detects the pitches of different instruments by employing the multi-layer combined frequency and periodicity (ML-CFP) with multilayered frequency-domain and quefrency-domain features as the input data representation. Our experiments demonstrate that the proposed multitrack music transcription system outperforms existing systems on five transcription datasets, including low-resource datasets. Furthermore, we evaluate the proposed system in terms of instrument type and show that the system provides high-quality transcription results for the predominant instruments.

Keywords:

automatic music transcription; joint learning; deep learning; music signal processing

1. Introduction

Automatic music transcription (AMT), which aims to create symbolic music representations from music recordings, is a fundamental task in the field of music signal processing. A successful AMT system has the potential to enable a wide range of interactions between humans and music, e.g., automatic instrument tutoring, automatic music accompaniment, and music content visualization [1]. Multitrack music transcription is a core task in AMT; the goal of this task is to estimate note attributes such as pitch, onset, and offset and group them into streams, where each stream corresponds to one instrument. However, several factors make multitrack AMT highly challenging. First, note attributes need to be transcribed simultaneously from a mixture of multiple music sources with different timbres, volumes, and recording environments. Second, polyphonic music signals have complex spectral patterns due to the interference of different signal components. Third, annotated datasets for multitrack AMT are scarce due to the difficulty of data collection; furthermore, the available datasets are typically highly imbalanced in terms of instruments and can be incorrectly labeled.

Early AMT studies typically limited their scope to the estimation of framewise pitch events and employed transcription systems that are applicable and adaptable to a wide range of instruments. However, the performance of these systems is clearly lower than that of human experts [2]. Deep learning techniques have had a significant impact on AMT, as well as various other tasks in the fields of image processing, speech processing, and natural language processing. AMT systems based on deep learning typically jointly predict framewise pitch events and note onset events. Most deep learning-based AMT studies have focused on transcribing solo piano recordings with abundant annotated data to achieve high performance [3,4,5,6,7]. In addition, several studies have transcribed music recordings containing multiple instruments without distinguishing the different instruments [8,9,10,11,12,13,14]. Although recent studies have begun to address multitrack AMT [15,16,17,18], even state-of-the-art AMT systems have not reached the level of performance provided by human experts.

The progress of AMT with deep learning has largely been driven by the adoption of model architectures developed in the fields of image processing and natural language processing [19]. Representative examples include deep learning architectures based on a hybrid of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for object recognition [20], U-Net for image segmentation [21], Transformer for machine translation [6], T5 for sequence-to-sequence tasks [22], and Perceiver for arbitrary modalities [18]. CNN-based architectures, including U-Net [23], have been used to extract time–frequency features from music signals. Furthermore, RNNs are effective networks for capturing temporal continuity and contextual information in music. Transformer-based architectures, including T5 [24] and Perceiver [25], enable AMT models to capture long-term dependencies in the temporal and spectral domains. Another key factor in learning note attributes from music recordings is the input data representation. To capture frequency-domain features based on human pitch perception, log-Mel spectrograms [3,16] and constant-Q spectrograms [5,20] are widely used as input data representations for the AMT task. Other examples include the combined frequency and periodicity (CFP) [15] and harmonic constant-Q transform (HCQT) [10]. CFP enables effective pitch tracking by linking frequency-domain and quefrency-domain features [26]. In contrast, HCQT captures harmonic relationships by aligning harmonics-related frequencies [27]. Utilizing an input data representation with multiple features, such as CFP and HCQT, can improve the AMT performance [28].

In this paper, we propose a multitrack AMT framework that integrates a residual U-Net (ResUnet) with a hierarchical Perceiver. ResUnet facilitates the propagation of time–frequency information from an input data representation without degradation. The hierarchical Perceiver is based on Perceiver TF [18], which contains three Transformer-style modules to extract spectral, temporal, and note attribute information. The integration of these architectures enables the proposed framework to effectively convert an input data representation into a piano roll representation. The proposed framework employs a multi-layer CFP (ML-CFP) [29] with multilayered frequency-domain and quefrency-domain features as the input data representation. ML-CFP can emphasize pitch saliency with high reliability even for stacked harmonics, as it has feature layers according to pitch height. The proposed AMT model is trained jointly to predict the onsets and frames for each stream.

We evaluate the proposed AMT system in comparison with MT3 [16], which is a state-of-the-art multitrack AMT system. This experiment uses a piano music transcription dataset (i.e., the MAESTRO dataset [30]), a guitar music transcription dataset (i.e., the GuitarSet dataset [31]), and three multitrack AMT datasets (i.e., the MusicNet [32], URMP [33], and Slakh [34] datasets) containing 11, 13, and 34 instrument classes, respectively. The results demonstrate that the proposed AMT system outperforms the baseline system. We also perform ablation studies to investigate the effectiveness of the model architecture and the input data representation. Furthermore, we evaluate the proposed AMT system in terms of instrument type to show the differences in AMT performance for the different resources of each instrument.

The main contributions of this paper are summarized as follows:

The proposed framework integrates the ResUnet and hierarchical Perceiver architectures into a multitrack AMT system.
ML-CFP is employed as the input data representation for the deep learning architecture.
The proposed AMT system achieves state-of-the-art performance compared with an existing multitrack AMT system on five AMT datasets containing different sizes, styles, and instrument types.
The evaluation of the proposed AMT system in terms of instrument type ensures high-quality transcription for predominant instruments.

The rest of this paper is organized as follows: Section 2 reviews the related work on multitrack AMT. Section 3 presents the proposed multitrack AMT system. Section 4 evaluates the performance of the proposed system. Finally, Section 5 concludes the paper.

2. Related Work

Many existing AMT systems are designed to convert an input data representation into a piano roll representation with the same time resolution. One example is the piano transcription system [20] based on a convolutional recurrent neural network comprising an acoustic model that identifies the pitches in a frame of audio and a music language model that captures the temporal structure of music. In particular, the deep learning model [3], which focuses on both framewise pitch events and note onset events, has had a strong influence on AMT research. For example, state-of-the-art piano transcription systems [5,6,7] predict frames, onsets, offsets, and velocities separately. As a similar example of multitrack AMT, the multi-instrument AMT system [15] predicts onsets and frames based on the idea that AMT is an image segmentation task on time–frequency images. In addition, Perceiver TF [18] jointly predicts onsets and frames based on the augmented Perceiver architecture [25], which introduces a hierarchical extension with an additional Transformer layer to model temporal coherence. The clustering-based AMT system [35] estimates the piano rolls of arbitrary instrument parts using a joint spectrogram and pitchgram clustering method for multitrack AMT of unknown instruments. In this work, we employ a piano roll-based multitrack AMT system that converts ML-CFP into a piano roll representation.

Different from piano roll-based AMT systems, some AMT systems have been designed to convert an input data representation into a sequence of symbolic tokens inspired by the MIDI specification. For example, the piano transcription system [22] predicts note tokens, velocity tokens, time tokens, and end-of-sequence (EOS) tokens based on T5 [24]. MT3 [16] predicts additional symbolic tokens such as instrument tokens and end-of-tie-section tokens to handle the multitrack AMT task. Several multitrack AMT frameworks derived from MT3 have also been developed [36,37,38].

Various data augmentation strategies have been investigated to address the lack of annotated datasets for multitrack AMT. Perceiver TF [18] adopts the random-mixing augmentation technique, which aims to separate each instrument stem from the input audio mixture. MT3 [16] incorporates intra-stem augmentation that selectively mutes stems within a multitrack recording in the Slakh dataset to generate several variations. The AMT framework [36], which is derived from MT3, combines recordings of real monophonic music to create artificial and musically incoherent mixtures. MR-MT3 [37] applies the token shuffling technique, which shuffles the sequence of symbolic tokens while preserving the same transcription. YourMT3+ [38] incorporates cross-dataset stem augmentation that creates a new mixture of stems from multiple datasets, as well as intra-stem augmentation. In addition, several self-supervised and semi-supervised models have been developed to leverage the huge amount of available unlabeled music recordings. ReconVAT [9] uses spectrogram reconstruction and virtual adversarial training (VAT) to improve transcription accuracy with unlabeled data and a limited amount of labeled data.

{Note}_{E M}

[17] uses synthetic data and unaligned supervision to train a transcription model and align the scores to their corresponding performances simultaneously through the Expectation Maximization (EM) algorithm. DiffRoll [39] is pre-trained on unpaired datasets where only piano rolls are available to generate realistic-looking piano rolls from pure Gaussian noise conditioned on spectrograms. The annotation-free AMT system [40] exploits adversarial domain confusion using scalable synthetic and unannotated real audio. We demonstrate that the proposed system achieves outstanding performance without relying on data augmentation strategies or additional unlabeled data.

3. Transcription System

The goal of multitrack AMT is to predict streamwise note attributes from music recordings. First, the proposed multitrack AMT system transforms a given music recording into a data representation for a deep learning model. Then, the deep learning model converts the data representation into two types of representations with the same temporal resolution, i.e., onset streams and frame streams. During inference, the system predicts a sequence of note events containing pitch, onset, offset, and instrument class from the onset and frame streams.

3.1. Data Representation

The proposed system employs a data representation based on ML-CFP [29], which is a multi-channel feature representation that captures the frequency, periodicity, and harmonicity information of music signals for reliable discrimination of stacked harmonics. The key idea of ML-CFP is to divide the pitch range of a music signal into several parts and construct a single frequency and periodicity representation for each part, which allows the proposed model to effectively constrain the harmonicity by selecting feature layers according to the target pitch height. Let

U, V \in R^{N \times M}

be the sound pressure level (SPL)-based magnitude spectrogram and cepstrogram, respectively, derived from the zero-padded short-time Fourier transform (STFT) of a music signal normalized to the range

[- 1, 1]

, where N and M denote the number of frequency or quefrency bins after zero padding and the number of time frames, respectively. Let ML-CFP have L layers, then the feature representations

Z_{f}^{(l)}, Z_{q}^{(l)}, Z \in R^{K \times M} (l = 1, 2, \dots, L)

with K log-frequency bins are derived from

U

and

V

:

\begin{matrix} Z_{f}^{(l)} & : = M_{f} (F W_{p}^{(l)} V), \end{matrix}

(1)

\begin{matrix} Z_{q}^{(l)} & : = M_{q} (\frac{N}{2 α [l] + 1} F^{- 1} W_{u}^{(l)} P_{u}^{(l)} (F W_{p}^{(l)} V)), \end{matrix}

(2)

\begin{matrix} Z & : = M_{f} (U), \end{matrix}

(3)

where

F \in C^{N \times N}

is an N-point discrete Fourier transform matrix,

α \in Z_{\geq 0}^{L}

is a vector of scaling factors, and

P_{u}^{(l)} (\cdot) (l = 1, 2, \dots, L)

is a peak-picking operator.

W_{p}^{(l)}, W_{u}^{(l)} \in R^{N \times N} (l = 1, 2, \dots, L)

are a low-pass lifter and a band-pass filter that remove the quefrency and frequency components outside the target pitch range of the l-th layer, respectively.

M_{f} (\cdot)

and

M_{q} (\cdot)

are operators that map the frequency and quefrency domains to a log-frequency domain with

δ

bins per semitone, ranging from a quarter tone below C-1 (≈7.94 Hz) to a quarter tone above G9 (≈12,911 Hz), respectively (i.e.,

K = 128 δ

). In short,

Z_{f}^{(l)}

and

Z_{q}^{(l)}

are the multi-layer frequency and multi-layer periodicity representations that capture the frequency and periodicity components in the target pitch range of a music signal, respectively, and

Z

is the frequency representation based on SPL. We concatenate the transposed

Z_{f}^{(l)}, Z_{q}^{(l)} (l = 1, 2, \dots, L)

, and

Z

along the channel axis to obtain a data representation of size

M \times K \times (2 L + 1)

for the deep learning model. To train the deep learning model efficiently, the data representation is rescaled to the range

[0, 1]

for each channel.

3.2. Model

Figure 1 illustrates the overall architecture of the proposed multitrack AMT model. First, an ML-CFP segment with T time frames is abstracted into a C-channel time–frequency feature map through a

3 \times 3

convolution layer. This feature map is then passed to a ResUnet model consisting of four encoder blocks, one bottleneck block, four decoder blocks, and four skip connections. The output of ResUnet is passed to three hierarchical Perceiver blocks containing spectral cross-attention, temporal Transformer, and latent Transformer modules with a learnable latent array

L^{0} \in R^{2 S \times D}

, where S denotes the number of instrument classes, including the “others” class, and D is the channel dimension of the latent array. The projection block shown in Figure 2 aggregates the output of the hierarchical Perceiver blocks into onset and frame streams of size

T \times S \times P

, where P (=128) is the number of MIDI notes. Inspired by the Onsets and Frames model [3], the predicted onset streams are used as additional input to predict the frame streams.

ResUnet [41] is a neural network developed for image semantic segmentation. This network facilitates information propagation without degradation through residual connections within residual blocks and skip connections between the encoder and decoder, which enables the design of deeper networks. The basic unit of ResUnet is a residual block containing two

3 \times 3

convolution layers with batch normalization and ReLU activation. To prevent overfitting, dropout is inserted between the convolution layers in the residual block [42]. Figure 3 shows the contraction block used in the encoder and the connection and expansion blocks used in the decoder. The contraction and extension blocks have symmetric structures; the contraction block downsamples the time–frequency feature map by a factor of two, while the expansion block upsamples the time–frequency feature map by a factor of two. The transposed convolution layer in the expansion block effectively increases the resolution of the feature map [23]. The connection block combines the upsampled feature map from the expansion block with the downsampled feature map from the contraction block. Note that the output size remains the same as the input size for all residual and connection blocks.

The hierarchical Perceiver block is based on Perceiver TF [18]. Algorithm 1 summarizes the process of a hierarchical Perceiver block. The spectral cross-attention module shown in Figure 4a projects the spectral information from ResUnet into the latent array

L^{h}

. Note that the initial latent array

L^{0}

is expanded to a shape of

B \times T \times 2 S \times D

before it is passed to the spectral cross-attention module. The temporal Transformer module exchanges temporal information in the latent array

L^{h}

. The latent Transformer module exchanges onset, pitch, and instrument information in the latent array

L^{h}

. Both modules are based on the Transformer architecture shown in Figure 4b. By applying the spectral cross-attention, temporal Transformer, and latent Transformer modules sequentially, the proposed multitrack AMT model can learn the global dependencies of the features in the time–frequency domain. The proposed model employs a multi-head attention mechanism [43] with eight attention heads and 32-dimensional queries, keys, and values in all modules.

Algorithm 1 Hierarchical Perceiver Block.

Input: ResUnet output

S

with a shape of

B \times T \times K \times C

, h-th latent array

L^{h}

with a shape

of

B \times T \times 2 S \times D

, where B is the batch size.

Output:

(h + 1)

-th latent array

L^{h + 1}

with a shape of

B \times T \times 2 S \times D

.

1:: Reshape $S$ into $(B \times T) \times K \times C$ and $L^{h}$ into $(B \times T) \times 2 S \times D$ .
2:: Apply the spectral cross-attention module to $S$ and $L^{h}$ .
3:: Rearrange the above output to $(B \times 2 S) \times T \times D$ .
4:: Apply the temporal Transformer module.
5:: Rearrange the above output to $(B \times T) \times 2 S \times D$ .
6:: Apply the latent Transformer module.
7:: Reshape the above output into $B \times T \times 2 S \times D$ .

The proposed multitrack AMT model is trained using a loss function that sums the binary cross-entropy losses for the onset and frame streams. We assign a weight

ω \in R_{> 0}

to the loss for the onset streams to effectively detect the note onsets where most of the framewise label values are 0. While several previous studies have employed onset label smoothing [4,15] to suppress the effects of misalignment between music recordings and labels, we use standard binary onset stream and frame stream labels in the proposed system.

3.3. Inference

During the inference process, we input an ML-CFP representation transformed from a music recording into the trained multitrack AMT model to compute the onset and frame streams, which represent the framewise onset and pitch activation probabilities for each instrument class, respectively. Since the distribution of the onset and frame streams can vary across instrument classes, we apply note detection processing based on the procedure described in Wu et al. [15] to predict a sequence of note events.

The note detection processing consists of five steps: frame reconstruction, global normalization, instrument selection, local normalization, and note inference. The frame reconstruction step shifts the window by a step size

t_{s t e p} \in [1, T] \cap Z

to partition the ML-CFP representation into segments with T time frames. Then, the onset and frame streams computed from the segments are reconstructed into piano roll representations with the same temporal resolution as the ML-CFP representation by concatenating them while averaging out overlapping parts. The global normalization step standardizes the onset and frame streams across all instrument classes except the “others” class. The instrument selection step considers an instrument class s to be absent in the recording if the average of the standard deviations of the onset and frame streams for class s is less than the threshold

θ_{i n s}

. The local normalization step standardizes the onset and frame streams for each selected instrument class. Finally, the note inference step estimates the onsets and durations of the notes from the standardized onset and frame streams. A note onset is detected at the peak position in the onset streams with a height of at least

θ_{o n}

and a prominence of at least

θ_{p e a k}

, where the minimum time distance between the peaks is set to

η

. The note offset is then determined by searching for the activated positions above the threshold

θ_{f r}

in the frame streams corresponding to the detected onset position and locating the position just before a silence interval of at least

ξ

is detected. If the search range reaches a position corresponding to an adjacent peak in the onset streams, the position just before this position is selected as the note offset.

4. Experiments

4.1. Datasets

To evaluate the performance of the proposed multitrack AMT system, we use the MAESTRO, Slakh, MusicNet, GuitarSet, and URMP datasets, which vary in terms of size and instrument type. Table 1 compares the properties of these five datasets. The MAESTRO dataset [30] consists of piano performances captured with fine alignment between note labels and audio waveforms. The MIDI data specifying note information includes key strike velocities and sustain/sostenuto/una corda pedal positions. Our experiments use the train/validation/test split from V3.0.0 (https://magenta.tensorflow.org/datasets/maestro (accessed on 19 January 2026)). The Slakh dataset [34] consists of high-quality renderings of instrument mixtures and corresponding stems generated from the Lakh MIDI dataset [44] using professional-grade sample-based virtual instruments. Many of these instruments have built-in effects, e.g., reverb, EQ, and compression. In addition, all files in the Slakh dataset contain drums, which are not the focus of this work. Our experiments use the Slakh-redux dataset (https://zenodo.org/records/4599666 (accessed on 19 January 2026)), which omits the duplicate files included in the original Slakh. Note that we map the Slakh-specific instrument classes to MIDI program numbers in the same manner as described in Gardner et al. [16]. The MusicNet dataset [32] consists of freely licensed classical music recordings together with MIDI scores aligned to the recordings using dynamic time warping. Since the standard train/test split for MusicNet does not include a validation set, we use the train/validation/test split shown in Table 2. The GuitarSet dataset [31] consists of high-quality guitar recordings from a hexaphonic pickup with time-aligned annotations. To create the GuitarSet dataset, six guitarists each play two versions (comping and soloing) of the same 30 lead sheets, which are generated from a combination of five styles (Rock, Singer-Songwriter, Bossa Nova, Jazz, and Funk), three progressions (12-bar blues, Autumn Leaves, and Pachelbel’s Canon), and two tempi (slow and fast). Since there is no official train/validation/test split for the GuitarSet dataset, we use the last progression in the Jazz style for validation and the last progression in the Funk style for test. The URMP dataset [33] consists of simple multi-instrument classical music pieces assembled from coordinated but separately recorded performances of individual tracks. This dataset includes both video recordings and MIDI scores, which are not used in our experiments. Since there is no official train/validation/test split for the URMP dataset, we use the split shown in Table 2.

4.2. Settings

The proposed multitrack AMT system transforms all music recordings into ML-CFP representations according to the parameter setting suggested by Matsunaga et al. [29]; for example, the window size and hop size of the STFT are 240 ms and 20 ms, respectively. The proposed model takes an ML-CFP segment of size

256 \times 256 \times 9

(i.e.,

T = 256

,

K = 256

, and

L = 4

) as input, which corresponds to a

5.12

-s audio clip. The input channel dimension in ResUnet and the channel dimension of the latent array are set to

C = 64

and

D = 128

, respectively. All dropout layers in ResUnet and the hierarchical Perceiver have a dropout rate of

0.1

. We implement the proposed model using TensorFlow. For training, we use the AdamW optimizer [45] with an initial learning rate of

1 \times 10^{- 5}

, a weight decay rate of

0.1

, and a learning rate scheduler with a 5-epoch warmup to

1 \times 10^{- 4}

. The loss weight for the onset streams is set to

ω = 10

. The training and validation processes run for 120 epochs with 8000 and 2000 steps per epoch, respectively. A mini-batch for each training or validation step is formed by 4 randomly selected ML-CFP segments. We train the model on an NVIDIA Quadro RTX 8000 GPU with 48 GB VRAM. Training on all five datasets takes around 12 days.

Table 3 lists the parameter settings used for the inference process described in Section 3.3. All the parameters are tuned from the validation set. We empirically set all the parameters to give better overall AMT performance. The average inference time is approximately 0.02 times the length of the original recording.

4.3. Baselines

We compare the proposed system based on joint learning of onset and frame streams (referred to as OaFS) with MT3 [16], which is a state-of-the-art multitrack AMT system. MT3 is based on T5 [24], which is a generic Transformer architecture for sequence-to-sequence tasks. We replicate MT3 using the Colab notebook (https://github.com/magenta/mt3/blob/main/mt3/colab/music_transcription_with_transformers.ipynb (accessed on 19 January 2026)) provided by the authors. This baseline model is trained on the Cerberus4 dataset [16,46], which is derived from the Slakh dataset, in addition to the MAESTRO, Slakh, MusicNet, GuitarSet, and URMP datasets. Note that the training process for the baseline system on the Slakh dataset chooses 10 random subsets of at least four instruments from each of the MIDI files, expanding the number of training samples by a factor of 10; OaFS does not employ such a data augmentation strategy. In addition, to improve readability in the piano roll representation, OaFS ignores audio effects, e.g., the sustain pedal, pitch wheel, and modulation wheel. This setting influences the differences in the transcription results between OaFS and MT3 (see Section 4.5).

To investigate the effectiveness of the model architecture and input data representation employed in the OaFS system, we consider the following two scenarios:

OaFS-ResUnet: Using an AMT model based on only ResUnet.
OaFS-CFP: Using CFP [26] as the input data representation.

OaFS-ResUnet is an OaFS system that does not use hierarchical Perceiver blocks. This system aggregates the feature map from ResUnet into onset and frame streams of size

T \times P \times S

through a

1 \times 1

convolution layer with a sigmoid activation function and a

1 \times 2

max pooling layer. Similar to the projection block shown in Figure 2, the predicted onset streams are used as additional input to predict the frame streams. OaFS-CFP sets the parameters of CFP according to the multi-instrument AMT system [15]. Since the multi-instrument AMT system predicts onsets and frames based on a U-Net architecture that employs CFP as the input data representation, OaFS-CFP can be considered an extension of this system. We also train the OaFS model on each dataset to investigate the contribution of the dataset mixing; this scenario is referred to as OaFS-single. Table 4 shows the number of trainable parameters and the number of instrument classes, excluding the “others” class, for each of the compared systems.

4.4. Evaluation Metrics

We evaluate the multitrack AMT systems using five evaluation metrics: Frame F1, Onset F1, Multi-Instrument Frame F1, Multi-Instrument Onset F1, and Instrument Detection F1. The Frame F1 score and Onset F1 score represent the correctness of pitches for each frame and the correctness of both pitch and onset for each note, respectively. According to the evaluation metric commonly used in AMT, a tolerance of 50 ms is used for the onset evaluation. The Multi-Instrument Frame F1 score and Multi-Instrument Onset F1 score are the Frame F1 score and Onset F1 score with an additional requirement for the correctness of instrument classes. The Instrument Detection F1 score, which is an evaluation metric introduced in Tan et al. [37], represents the correctness of the instrument classes contained in a music piece. For a music piece with a set of predicted instrument classes

S_{p}

and a set of ground-truth instrument classes

S_{g}

, the Instrument Detection F1 score is calculated from the precision and recall:

\begin{matrix} Precision & = \frac{| S_{p} \cap S_{g} |}{| S_{p} |}, \end{matrix}

(4)

\begin{matrix} Recall & = \frac{| S_{p} \cap S_{g} |}{| S_{g} |}, \end{matrix}

(5)

\begin{matrix} F 1 & = \frac{2 \times Precision \times Recall}{Precision + Recall} . \end{matrix}

(6)

All five metrics are computed by counting the number of true positives, false positives, and false negatives over all music pieces in the test set of each dataset. Our evaluation ignores note offsets because they can be ambiguous [1]; instead, the Frame F1 score provides insight into the prediction performance in terms of note duration. Note that while unpitched instruments are outside the scope of this work, our evaluation includes drums for completeness; however, we ignore drums when computing the Frame F1 scores and Multi-Instrument Frame F1 scores, as the concept of drum duration is ambiguous.

4.5. Results

Table 5 compares the AMT performance of five systems (i.e., MT3, OaFS, OaFS-ResUnet, OaFS-CFP, and OaFS-single) on five datasets (i.e., MAESTRO, Slakh, MusicNet, GuitarSet, and URMP). The results show that OaFS outperforms MT3 in most evaluation metrics on each dataset. This is noteworthy given that MT3 uses a larger model and is trained on larger datasets compared with OaFS. In particular, OaFS achieves significantly higher Instrument Detection F1 scores than MT3 on all datasets, which indicates that the instrument selection step in the inference process works effectively. OaFS also obtains high Frame F1 scores but relatively low Onset F1 scores across all datasets. This is probably because OaFS predicts the presence or absence of pitch and onset events for each frame, while MT3 directly predicts the pitch, onset, and offset for each note. Note that the large difference in the Frame F1 scores between OaFS and MT3 on the MAESTRO dataset primarily occurs because MT3 includes the sustain pedal when predicting note events.

The results shown in Table 5 demonstrate that OaFS achieves significantly better performance than OaFS-ResUnet on the Slakh dataset, which indicates that the hierarchical Perceiver improves the performance of OaFS on complex multitrack AMT datasets. In contrast, OaFS-ResUnet shows comparable performance with OaFS on the other four datasets. In particular, OaFS-ResUnet tends to outperform OaFS on the MAESTRO and URMP datasets. Thus, OaFS-ResUnet remains competitive on simple AMT datasets. In addition, OaFS outperforms OaFS-CFP in most evaluation metrics on each dataset, which suggests that ML-CFP is an effective feature representation for multitrack AMT. In contrast, OaFS underperforms OaFS-single in most evaluation metrics on the MAESTRO and Slakh datasets because OaFS-single is individually optimized for each dataset. The extremely low performance of OaFS-single on the URMP dataset is due to overfitting to the training set. These results indicate that OaFS improves its generalizability and provides strong performance even on low-resource datasets by training the model on multiple datasets.

4.6. Illustration

Figure 5 shows the ground truth, MT3 output, and OaFS output for three music pieces selected from the MAESTRO, Slakh, and URMP datasets, where the different colors represent different instrument classes. Note that drums are omitted in the illustration for the second piece to avoid confusion between the pitched and unpitched instruments. The results for the first piece demonstrate that OaFS captures the global context of the ground-truth piano roll without the sustain pedal. The results for the third piece show that OaFS effectively recognizes the instrument class for each note. Notably, incorrect instrument switching in the main melody is observed in the MT3 output but not in the OaFS output. In contrast, the results for the second piece reveal several errors, including non-detection of pitches, false detection of harmonics and subharmonics of true pitches, and incorrect instrument classification. These findings indicate that OaFS struggles with note detection and instrument recognition when multiple instruments are played simultaneously.

4.7. Instrument-Wise Evaluation

To evaluate the AMT performance for individual instruments, Table 6 shows the Frame F1 and Onset F1 scores for each instrument class on the Slakh dataset, where the instrument classes are categorized according to related instrument families. The results show that OaFS outperforms MT3 for most instrument classes in both evaluation metrics. Comparing the performance of OaFS across different instruments, we can see that OaFS achieves relatively high AMT performance for the majority instrument classes, such as acoustic piano, acoustic guitar, and electric bass; however, OaFS obtains low AMT performance for the minority instrument classes. In particular, OaFS fails to recognize the 12 instrument classes with a small number of notes in the Slakh Dataset; these instruments are not detected during the instrument selection step in the inference process. (Since the test set of the Slakh dataset does not include contrabass, the number of unrecognized instrument classes that should be recognized is 11.) In addition, OaFS-single achieves the best performance for most instrument classes in both evaluation metrics. Notably, OaFS-single recognizes several instrument classes that are not recognized by OaFS, such as the orchestral harp, trombone, baritone sax, and oboe. This is probably because the number of minority instrument samples increases relatively when using only the Slakh dataset rather than multiple datasets. These results indicate that the minority instrument samples should be expanded to improve the overall AMT performance across different instruments.

4.8. Instrument Family-Wise Evaluation

Since accurately assigning an instrument class label to each note in a music piece is generally a non-trivial task, instruments are often grouped into related instrument families. To investigate the effect of instrument label granularity, we arrange the 34 instrument classes shown in Table 6 into the 13 MIDI instrument families shown in Table 7 following Gardner et al. [16]. Table 7 shows the Frame F1 and Onset F1 scores for each MIDI instrument family on the Slakh dataset, where “All” represents the Multi-Instrument Frame F1 and Multi-Instrument Onset F1 scores based on the MIDI instrument families. The results show that OaFS outperforms MT3 for all instrument families except drums in both evaluation metrics. By comparing the Multi-Instrument Frame F1 and Multi-Instrument Onset F1 scores based on the MIDI instrument families shown in Table 7 with those based on the instrument classes shown in Table 5, we reveal improvements of

9.8 %

and

4.4 %

for MT3 and

5.8 %

and

1.4 %

for OaFS, respectively. The performance differences for OaFS are smaller than those for MT3, which indicates that OaFS is robust to the granularity of instrument labels.

5. Discussion and Conclusions

We have proposed a multitrack AMT system that integrates the ResUnet and hierarchical Perceiver architectures and employs ML-CFP as the input data representation for the deep learning architecture to predict onsets and frames for each instrument class. Our experiments show that the proposed system advances the state of the art in multitrack AMT, notably in terms of instrument discrimination. In addition, the results demonstrate that ResUnet, the hierarchical Perceiver, and ML-CFP contribute to improving AMT performance. In particular, the experimental results indicate that piano roll-based AMT systems, e.g., OaFS, tend to obtain higher Frame F1 scores but lower Onset F1 scores compared with token-based AMT systems, e.g., MT3. A higher Frame F1 score increases the accuracy of the piano roll based on the transcription result; in contrast, a higher Onset F1 score improves the perceptual quality of the audio generated from the transcription result. This implies that the prioritization of the Frame F1 score or Onset F1 score depends on the application.

The illustration of the transcription results obtained through the proposed system indicates that the number of simultaneous pitches contained in a music piece influences the AMT performance. This result reveals that simultaneous pitches across multiple instruments lead to non-detection or false detection of notes and incorrect instrument classification. In addition, the instrument-wise evaluation shows the differences in AMT performance due to the number of instrument samples. In particular, the AMT performance for minority instrument classes is significantly lower than that for majority instrument classes, primarily due to extreme imbalance among instrument classes. These findings highlight the need for further training on various combinations of simultaneous pitches and different instruments. In other words, the proposed system requires large-scale multitrack AMT datasets with evenly distributed simultaneous pitches and instrument classes to improve AMT performance. However, it is highly challenging to create such datasets from real music recordings. Datasets created using music generation models, e.g., the CocoChorales [47] and AAM [48] datasets, have the potential to address this issue. Furthermore, data augmentation strategies and unlabeled data could mitigate the imbalance in the number of instrument samples. Thus, future work should focus on creating a multitrack AMT dataset with evenly distributed simultaneous pitches and instrument classes based on music generation models, developing data augmentation strategies for AMT datasets with imbalanced instrument classes, and extending the proposed system to utilize unlabeled data.

Our work ignores audio effects such as the sustain pedal and pitch wheel. Since the sustain pedal and pitch wheel alter note duration and pitch, respectively, these effects should be detected for applications requiring high accuracy. To our knowledge, no AMT system addresses both of these effects. However, the high-resolution piano transcription system [4] enables sustain pedal detection by separately learning note attributes and sustain pedal and then combining them into a unified model. Such a unified method could serve as an effective approach for designing a multitrack AMT system with audio effect detection mechanisms.

The limitations of the proposed system include poor AMT performance on low-quality music recordings that contain audio degradation effects such as noise, clipping, and filtering and high computational costs for predicting note attributes from music recordings. In particular, ML-CFP, whose runtime is approximately 0.6 times the length of the original recording [29], needs to reduce its computational cost to achieve real-time multitrack AMT.

Author Contributions

Conceptualization, T.M. and H.S.; methodology, T.M.; software, T.M.; validation, T.M.; formal analysis, T.M.; investigation, T.M.; resources, H.S.; data curation, T.M.; writing—original draft preparation, T.M.; writing—review and editing, T.M. and H.S.; visualization, T.M.; supervision, H.S.; project administration, T.M. and H.S.; funding acquisition, T.M. and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JST SPRING, Grant Number JPMJSP2123.

Data Availability Statement

The MAESTRO, Slakh, MusicNet, GuitarSet, and URMP datasets used in this study are publicly available at https://magenta.tensorflow.org/datasets/maestro, https://zenodo.org/records/4599666, https://zenodo.org/records/5120004, https://zenodo.org/records/3371780, and https://labsites.rochester.edu/air/projects/URMP.html, respectively (accessed on 19 January 2026). The original data presented in the study are openly available at https://github.com/TomokiMatsunaga/OaFS (accessed on 19 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Benetos, E.; Dixon, S.; Duan, Z.; Ewert, S. Automatic Music Transcription: An Overview. IEEE Signal Process. Mag. 2019, 36, 20–30. [Google Scholar] [CrossRef]
Benetos, E.; Dixon, S.; Giannoulis, D.; Kirchhoff, H.; Klapuri, A. Automatic music transcription: Challenges and future directions. J. Intell. Inf. Syst. 2013, 41, 407–434. [Google Scholar] [CrossRef]
Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.; Raffel, C.; Engel, J.; Oore, S.; Eck, D. Onsets and Frames: Dual-Objective Piano Transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018; pp. 50–57. [Google Scholar] [CrossRef]
Kong, Q.; Li, B.; Song, X.; Wan, Y.; Wang, Y. High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3707–3717. [Google Scholar] [CrossRef]
Wei, W.; Li, P.; Yu, Y.; Li, W. HPPNet: Modeling the Harmonic Structure and Pitch Invariance in Piano Transcription. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bengaluru, India, 4–8 December 2022; pp. 709–716. [Google Scholar] [CrossRef]
Toyama, K.; Akama, T.; Ikemiya, Y.; Takida, Y.; Liao, W.H.; Mitsufuji, Y. Automatic Piano Transcription With Hierarchical Frequency-Time Transformer. In Proceedings of the 24th Conference of the International Society for Music Information Retrieval (ISMIR), Milan, Italy, 5–9 November 2023; pp. 215–222. [Google Scholar] [CrossRef]
Wang, Q.; Liu, M.; Bao, C.; Jia, M. Harmonic-Aware Frequency and Time Attention for Automatic Piano Transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3492–3506. [Google Scholar] [CrossRef]
Elowsson, A. Polyphonic pitch tracking with deep layered learning. J. Acoust. Soc. Am. 2020, 148, 446–468. [Google Scholar] [CrossRef] [PubMed]
Cheuk, K.W.; Herremans, D.; Su, L. ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 3918–3926. [Google Scholar] [CrossRef]
Bittner, R.M.; Bosch, J.J.; Rubinstein, D.; Meseguer-Brocal, G.; Ewert, S. A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 781–785. [Google Scholar] [CrossRef]
Wu, Y.; Zhao, J.; Yu, Y.; Li, W. MFAE: Masked frame-level autoencoder with hybrid-supervision for low-resource music transcription. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1109–1114. [Google Scholar] [CrossRef]
Wei, H.; Yuan, J.; Zhang, R.; Chen, Y.; Wang, G. JEPOO: Highly Accurate Joint Estimation of Pitch, Onset and Offset for Music Information Retrieval. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, 19–25 August 2023; pp. 4892–4902. [Google Scholar] [CrossRef]
Cwitkowitz, F.; Cheuk, K.W.; Choi, W.; Martínez-Ramírez, M.A.; Toyama, K.; Liao, W.H.; Mitsufuji, Y. Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1291–1295. [Google Scholar] [CrossRef]
Wu, Y.; Wei, W.; Li, D.; Li, M.; Yu, Y.; Gao, Y.; Li, W. Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Wu, Y.T.; Chen, B.; Su, L. Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2796–2809. [Google Scholar] [CrossRef]
Gardner, J.; Simon, I.; Manilow, E.; Hawthorne, C.; Engel, J. MT3: Multi-task multitrack music transcription. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Maman, B.; Bermano, A.H. Unaligned Supervision for Automatic Music Transcription in The Wild. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MA, USA, 17–23 July 2022; pp. 14918–14934. [Google Scholar]
Lu, W.T.; Wang, J.C.; Hung, Y.N. Multitrack Music Transcription with a Time-Frequency Perceiver. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.Y.; Sainath, T. Deep Learning for Audio Signal Processing. IEEE J. Sel. Top. Signal Process. 2019, 13, 206–219. [Google Scholar] [CrossRef]
Sigtia, S.; Benetos, E.; Dixon, S. An End-to-End Neural Network for Polyphonic Piano Music Transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 927–939. [Google Scholar] [CrossRef]
Wu, Y.T.; Chen, B.; Su, L. Polyphonic Music Transcription with Semantic Segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 166–170. [Google Scholar] [CrossRef]
Hawthorne, C.; Simon, I.; Swavely, R.; Manilow, E.; Engel, J. Sequence-to-Sequence Piano Transcription with Transformers. In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), Online, 7–12 November 2021; pp. 246–253. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Jaegle, A.; Gimeno, F.; Brock, A.; Vinyals, O.; Zisserman, A.; Carreira, J. Perceiver: General Perception with Iterative Attention. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; Volume 139, pp. 4651–4664. [Google Scholar]
Su, L.; Yang, Y.H. Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1600–1612. [Google Scholar] [CrossRef]
Bittner, R.M.; McFee, B.; Salamon, J.; Li, P.; Bello, J.P. Deep Salience Representations for F0 Estimation in Polyphonic Music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017; pp. 63–70. [Google Scholar] [CrossRef]
Wu, Y.T.; Chen, B.; Su, L. Automatic Music Transcription Leveraging Generalized Cepstral Features and Deep Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 401–405. [Google Scholar] [CrossRef]
Matsunaga, T.; Saito, H. Multi-Layer Combined Frequency and Periodicity Representations for Multi-Pitch Estimation of Multi-Instrument Music. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3171–3184. [Google Scholar] [CrossRef]
Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.Z.A.; Dieleman, S.; Elsen, E.; Engel, J.; Eck, D. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Xi, Q.; Bittner, R.; Pauwels, J.; Ye, X.; Bello, J.P. GuitarSet: A Dataset for Guitar Transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, 23–27 September 2018; pp. 453–460. [Google Scholar] [CrossRef]
Thickstun, J.; Harchaoui, Z.; Kakade, S. Learning Features of Music from Scratch. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Li, B.; Liu, X.; Dinesh, K.; Duan, Z.; Sharma, G. Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications. IEEE Trans. Multimed. 2019, 21, 522–535. [Google Scholar] [CrossRef]
Manilow, E.; Wichern, G.; Seetharaman, P.; Le Roux, J. Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 45–49. [Google Scholar] [CrossRef]
Tanaka, K.; Nakatsuka, T.; Nishikimi, R.; Yoshii, K.; Morishima, S. Multi-instrument music transcription based on deep spherical clustering of spectrograms and pitchgrams. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), Montreal, QC, Canada, 11–16 October 2020; pp. 327–334. [Google Scholar] [CrossRef]
Simon, I.; Gardner, J.; Hawthorne, C.; Manilow, E.; Engel, J. Scaling Polyphonic Transcription with Mixtures of Monophonic Transcriptions. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bangalore, India, 4–8 December 2022; pp. 44–51. [Google Scholar] [CrossRef]
Tan, H.H.; Cheuk, K.W.; Cho, T.; Liao, W.H.; Mitsufuji, Y. MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage. arXiv 2024, arXiv:2403.10024. [Google Scholar]
Chang, S.; Benetos, E.; Kirchhoff, H.; Dixon, S. YourMT3+: Multi-Instrument Music Transcription with Enhanced Transformer Architectures and Cross-Dataset STEM Augmentation. In Proceedings of the IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), London, UK, 22–24 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
Cheuk, K.W.; Sawata, R.; Uesaka, T.; Murata, N.; Takahashi, N.; Takahashi, S.; Herremans, D.; Mitsufuji, Y. Diffroll: Diffusion-Based Generative Music Transcription with Unsupervised Pretraining Capability. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Sato, G.; Akama, T. Annotation-Free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Q.; Wang, Y. Road Extraction by Deep Residual U-Net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Zagoruyko, S.; Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference (BMVC), York, UK, 19–22 September 2016; pp. 87.1–87.12. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. Ph.D. Thesis, Columbia University, New York, NY, USA, 2016. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Manilow, E.; Seetharaman, P.; Pardo, B. Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 771–775. [Google Scholar] [CrossRef]
Wu, Y.; Gardner, J.; Manilow, E.; Simon, I.; Hawthorne, C.; Engel, J. The Chamber Ensemble Generator: Limitless High-Quality MIR Data via Generative Modeling. arXiv 2022, arXiv:2209.14458. [Google Scholar] [CrossRef]
Ostermann, F.; Vatolkin, I.; Ebeling, M. AAM: A dataset of Artificial Audio Multitracks for diverse music information retrieval tasks. EURASIP J. Audio Speech Music. Process. 2023, 2023, 13. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed multitrack AMT model based on ResUnet and the hierarchical Perceiver.

Figure 2. Projection block.

Figure 3. (a) Contraction block, (b) connection block, and (c) expansion block.

Figure 4. (a) Spectral cross-attention and (b) Transformer.

Figure 5. Illustration of the ground truth (first row), MT3 output (second row), and OaFS output (third row) for the first 30 s of ‘MIDI-Unprocessed_01_R1_2006_01-09_ORIG_MID–AUDIO_01_R1_2006_02_Track02_wav.wav’ in the MAESTRO dataset (first column), the first 30 s of ‘Track01876/mix.wav’ in the Slakh dataset (second column), and the first 30 s of ‘AuMix_01_Jupiter_vn_vc.wav’ in the URMP dataset (third column).

Table 1. Properties of the datasets used in the experiments.

Dataset	Total Length (h:m:s)	Number of Pieces in Train/Validation/Test Sets	Number of Instruments	Avg. Number of Instruments
MAESTRO	$199 : 13 : 52$	$962 / 137 / 177$	1	1
Slakh	$118 : 15 : 36$	$1289 / 270 / 151$	34	$8.3$
MusicNet	$34 : 07 : 49$	$300 / 15 / 15$	11	$2.0$
GuitarSet	$3 : 02 : 48$	$312 / 24 / 24$	1	1
URMP	$1 : 20 : 12$	$26 / 9 / 9$	13	$2.7$

Table 2. Validation/test sets for the MusicNet and URMP datasets.

Dataset	Piece ID for Validation	Piece ID for Test
MusicNet	1733, 1765, 1790, 1818, 2160, 2198, 2289, 2300, 2308, 2315, 2336, 2466, 2477, 2504, 2611	1729, 1776, 1813, 1893, 2118, 2186, 2296, 2431, 2432, 2487, 2497, 2501, 2507, 2537, 2621
URMP	3, 8, 9, 11, 17, 21, 29, 37, 43	1, 2, 12, 13, 24, 25, 31, 38, 39

Table 3. Parameters used in inference.

Parameter	Value
$t_{s t e p}$ : step size	240
$θ_{i n s}$ : instrument selection threshold	$0.99$
$θ_{o n}$ : onset stream threshold	$6.0$
$θ_{f r}$ : frame stream threshold	$2.5$
$θ_{p e a k}$ : peak prominence threshold	$5.0$
$η$ : peak distance threshold	$0.06$ s
$ξ$ : silence interval threshold	$0.08$ s

Table 4. Details of compared systems.

System	Number of Trainable Parameters	Number of Instrument Classes Excluding “Others”
MT3	$93.7$ M	34
OaFS	$64.9$ M	34
OaFS-ResUnet	$62.6$ M	34
OaFS-CFP	$64.9$ M	34
OaFS-single	$64.9$ M	Dataset-dependent

Table 5. Transcription results of different systems on different datasets.

Metric	System	Dataset
Metric	System	MAESTRO	Slakh	MusicNet	GuitarSet	URMP
Frame F1 (%)	MT3	$55.30$	$77.48$	$68.18$	$86.98$	$79.70$
	OaFS	$76.10$	$81.64$	$72.61$	$90.10$	$90.35$
	OaFS-ResUnet	$77.59$	$80.23$	$73.67$	$89.35$	$90.22$
	OaFS-CFP	$75.32$	$80.87$	$71.73$	$89.75$	$89.64$
	OaFS-single	$79.98$	$82.55$	$72.41$	$89.47$	$73.26$
Onset F1 (%)	MT3	$95.16$	$75.02$	$47.34$	$88.32$	$75.49$
	OaFS	$94.17$	$71.16$	$49.98$	$90.06$	$78.53$
	OaFS-ResUnet	$94.59$	$66.21$	$50.63$	$89.27$	$78.72$
	OaFS-CFP	$93.68$	$69.35$	$48.50$	$89.43$	$74.03$
	OaFS-single	$96.03$	$73.92$	$47.85$	$88.56$	$68.52$
Multi-Instrument Frame F1 (%)	MT3	$55.30$	$48.49$	$59.29$	$86.98$	$61.29$
	OaFS	$76.10$	$63.76$	$62.13$	$90.10$	$71.71$
	OaFS-ResUnet	$77.59$	$59.04$	$62.36$	$89.35$	$74.87$
	OaFS-CFP	$75.32$	$61.84$	$59.66$	$89.75$	$74.54$
	OaFS-single	$79.98$	$67.42$	$64.85$	$89.47$	$62.88$
Multi-Instrument Onset F1 (%)	MT3	$95.16$	$60.74$	$41.05$	$88.32$	$62.98$
	OaFS	$94.17$	$66.02$	$45.44$	$90.06$	$69.75$
	OaFS-ResUnet	$94.59$	$59.03$	$44.99$	$89.27$	$71.98$
	OaFS-CFP	$93.68$	$63.66$	$43.57$	$89.43$	$68.68$
	OaFS-single	$96.03$	$69.97$	$43.65$	$88.56$	$61.61$
Instrument Detection F1 (%)	MT3	$99.72$	$68.26$	$74.51$	$100$	$75.41$
	OaFS	$100$	$84.88$	$97.30$	$100$	$88.89$
	OaFS-ResUnet	$100$	$80.65$	$96.10$	$100$	$91.67$
	OaFS-CFP	$100$	$83.33$	$94.87$	$100$	$93.62$
	OaFS-single	$100$	$85.39$	$98.67$	$100$	$93.33$

The bold values represent the best for each evaluation metric on each dataset.

Table 6. Transcription results for different instruments on the Slakh dataset.

Class	Total Number of Notes	Frame F1 (%)			Onset F1 (%)
Class	Total Number of Notes	MT3	OaFS	OaFS-Single	MT3	OaFS	OaFS-Single
Acoustic Piano	$974, 080$	$56.90$	$65.90$	$70.05$	$66.08$	$71.37$	$76.59$
Electric Piano	$739, 257$	$56.12$	$65.46$	$70.35$	$60.54$	$66.72$	$70.66$
Chromatic Percussion	$98, 537$	$24.76$	$41.06$	$45.49$	$32.14$	$47.21$	$50.47$
Organ	$183, 932$	$33.07$	$57.73$	$62.91$	$31.47$	$45.23$	$53.50$
Acoustic Guitar	$1, 330, 661$	$65.65$	$69.99$	$73.54$	$68.58$	$69.88$	$74.32$
Clean Electric Guitar	$922, 457$	$52.12$	$59.08$	$64.50$	$55.55$	$61.56$	$66.97$
Distorted Electric Guitar	$359, 611$	$46.65$	$55.16$	$61.41$	$51.42$	$54.34$	$64.27$
Acoustic Bass	$19, 004$	$4.49$	$86.20$	$87.83$	$4.98$	$78.03$	$85.01$
Electric Bass	$745, 094$	$75.97$	$86.79$	$89.17$	$75.37$	$86.62$	$89.87$
Violin	$14, 157$	$7.55$	$30.44$	0	$12.09$	$32.71$	0
Viola	$5889$	0	0	0	0	0	0
Cello	$5983$	0	0	0	0	0	0
Contrabass	$1323$	0	–	–	0	–	–
Orchestral Harp	$12, 033$	0	0	$61.01$	0	0	$69.67$
Timpani	$3143$	0	0	0	0	0	0
String Ensemble	$328, 624$	$30.50$	$62.60$	$64.62$	$25.13$	$41.88$	$47.62$
Synth Strings	$77, 269$	$17.53$	$14.13$	$11.64$	$10.09$	$10.44$	$9.66$
Choir and Voice	$217, 777$	$49.86$	$71.23$	$74.59$	$40.33$	$55.74$	$62.07$
Trumpet	$27, 279$	$8.45$	$37.34$	$29.57$	$7.57$	$29.20$	$20.73$
Trombone	$19, 670$	$12.29$	0	$5.42$	$9.14$	0	$12.30$
Tuba	3089	$28.93$	0	0	$50.20$	0	0
French Horn	$11, 243$	$10.46$	0	0	$10.94$	0	0
Brass Section	$71, 023$	$29.26$	$42.34$	$38.11$	$39.97$	$43.84$	$33.93$
Soprano/Alto Sax	$67, 373$	$9.11$	$31.08$	$50.60$	$10.42$	$36.29$	$46.22$
Tenor Sax	$31, 967$	$9.34$	$25.23$	$22.64$	$10.81$	$26.31$	$19.14$
Baritone Sax	$6443$	$1.10$	0	$37.56$	$2.12$	0	$41.42$
Oboe	$16, 626$	$2.15$	0	$44.36$	$6.55$	0	$25.85$
English Horn	$1232$	0	0	0	0	0	0
Bassoon	$1651$	0	0	0	0	0	0
Clarinet	$19, 695$	$31.90$	$61.75$	$55.07$	$29.70$	$43.93$	$37.58$
Pipe	$136, 601$	$32.39$	$57.15$	$64.52$	$35.65$	$47.55$	$54.94$
Synth Lead	$117, 685$	$26.28$	$43.38$	$50.92$	$28.99$	$50.03$	$54.66$
Synth Pad	$137, 921$	$30.74$	$50.85$	$50.35$	$18.83$	$26.65$	$26.85$
Drums	$3, 159, 216$	–	–	–	$74.16$	$71.23$	$72.85$

The bold values represent the best for each instrument class in each evaluation metric.

Table 7. Transcription results for different instrument families on the Slakh dataset.

Family	Frame F1 (%)		Onset F1 (%)
Family	MT3	OaFS	MT3	OaFS
Piano	$63.12$	$70.36$	$69.99$	$72.10$
Chromatic Percussion	$24.76$	$41.06$	$32.14$	$47.21$
Organ	$33.07$	$57.73$	$31.47$	$45.23$
Guitar	$65.80$	$68.98$	$66.14$	$66.35$
Bass	$78.96$	$87.74$	$79.46$	$86.47$
Strings	$4.22$	$19.26$	$5.50$	$25.11$
Ensemble	$57.00$	$73.30$	$46.54$	$49.59$
Brass	$23.41$	$33.16$	$28.37$	$33.93$
Reed	$18.10$	$39.41$	$21.64$	$31.18$
Pipe	$32.39$	$57.15$	$35.65$	$47.55$
Synth Lead	$26.28$	$43.38$	$28.99$	$50.03$
Synth Pad	$30.74$	$50.85$	$18.83$	$26.65$
Drums	–	–	$74.16$	$71.23$
All	$58.24$	$69.58$	$65.18$	$67.39$

The bold values represent the best for each instrument famliy in each evaluation metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Matsunaga, T.; Saito, H. Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams. Signals 2026, 7, 12. https://doi.org/10.3390/signals7010012

AMA Style

Matsunaga T, Saito H. Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams. Signals. 2026; 7(1):12. https://doi.org/10.3390/signals7010012

Chicago/Turabian Style

Matsunaga, Tomoki, and Hiroaki Saito. 2026. "Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams" Signals 7, no. 1: 12. https://doi.org/10.3390/signals7010012

APA Style

Matsunaga, T., & Saito, H. (2026). Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams. Signals, 7(1), 12. https://doi.org/10.3390/signals7010012

Article Menu

Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams

Abstract

1. Introduction

2. Related Work

3. Transcription System

3.1. Data Representation

3.2. Model

3.3. Inference

4. Experiments

4.1. Datasets

4.2. Settings

4.3. Baselines

4.4. Evaluation Metrics

4.5. Results

4.6. Illustration

4.7. Instrument-Wise Evaluation

4.8. Instrument Family-Wise Evaluation

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI