Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams
Abstract
1. Introduction
- The proposed framework integrates the ResUnet and hierarchical Perceiver architectures into a multitrack AMT system.
- ML-CFP is employed as the input data representation for the deep learning architecture.
- The proposed AMT system achieves state-of-the-art performance compared with an existing multitrack AMT system on five AMT datasets containing different sizes, styles, and instrument types.
- The evaluation of the proposed AMT system in terms of instrument type ensures high-quality transcription for predominant instruments.
2. Related Work
3. Transcription System
3.1. Data Representation
3.2. Model
| Algorithm 1 Hierarchical Perceiver Block. |
| Input: ResUnet output with a shape of , h-th latent array with a shape |
| of , where B is the batch size. |
| Output: -th latent array with a shape of . |
|
3.3. Inference
4. Experiments
4.1. Datasets
4.2. Settings
4.3. Baselines
- OaFS-ResUnet: Using an AMT model based on only ResUnet.
- OaFS-CFP: Using CFP [26] as the input data representation.
4.4. Evaluation Metrics
4.5. Results
4.6. Illustration
4.7. Instrument-Wise Evaluation
4.8. Instrument Family-Wise Evaluation
5. Discussion and Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Benetos, E.; Dixon, S.; Duan, Z.; Ewert, S. Automatic Music Transcription: An Overview. IEEE Signal Process. Mag. 2019, 36, 20–30. [Google Scholar] [CrossRef]
- Benetos, E.; Dixon, S.; Giannoulis, D.; Kirchhoff, H.; Klapuri, A. Automatic music transcription: Challenges and future directions. J. Intell. Inf. Syst. 2013, 41, 407–434. [Google Scholar] [CrossRef]
- Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.; Raffel, C.; Engel, J.; Oore, S.; Eck, D. Onsets and Frames: Dual-Objective Piano Transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018; pp. 50–57. [Google Scholar] [CrossRef]
- Kong, Q.; Li, B.; Song, X.; Wan, Y.; Wang, Y. High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3707–3717. [Google Scholar] [CrossRef]
- Wei, W.; Li, P.; Yu, Y.; Li, W. HPPNet: Modeling the Harmonic Structure and Pitch Invariance in Piano Transcription. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bengaluru, India, 4–8 December 2022; pp. 709–716. [Google Scholar] [CrossRef]
- Toyama, K.; Akama, T.; Ikemiya, Y.; Takida, Y.; Liao, W.H.; Mitsufuji, Y. Automatic Piano Transcription With Hierarchical Frequency-Time Transformer. In Proceedings of the 24th Conference of the International Society for Music Information Retrieval (ISMIR), Milan, Italy, 5–9 November 2023; pp. 215–222. [Google Scholar] [CrossRef]
- Wang, Q.; Liu, M.; Bao, C.; Jia, M. Harmonic-Aware Frequency and Time Attention for Automatic Piano Transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3492–3506. [Google Scholar] [CrossRef]
- Elowsson, A. Polyphonic pitch tracking with deep layered learning. J. Acoust. Soc. Am. 2020, 148, 446–468. [Google Scholar] [CrossRef] [PubMed]
- Cheuk, K.W.; Herremans, D.; Su, L. ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 3918–3926. [Google Scholar] [CrossRef]
- Bittner, R.M.; Bosch, J.J.; Rubinstein, D.; Meseguer-Brocal, G.; Ewert, S. A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 781–785. [Google Scholar] [CrossRef]
- Wu, Y.; Zhao, J.; Yu, Y.; Li, W. MFAE: Masked frame-level autoencoder with hybrid-supervision for low-resource music transcription. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1109–1114. [Google Scholar] [CrossRef]
- Wei, H.; Yuan, J.; Zhang, R.; Chen, Y.; Wang, G. JEPOO: Highly Accurate Joint Estimation of Pitch, Onset and Offset for Music Information Retrieval. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, 19–25 August 2023; pp. 4892–4902. [Google Scholar] [CrossRef]
- Cwitkowitz, F.; Cheuk, K.W.; Choi, W.; Martínez-Ramírez, M.A.; Toyama, K.; Liao, W.H.; Mitsufuji, Y. Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1291–1295. [Google Scholar] [CrossRef]
- Wu, Y.; Wei, W.; Li, D.; Li, M.; Yu, Y.; Gao, Y.; Li, W. Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Wu, Y.T.; Chen, B.; Su, L. Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2796–2809. [Google Scholar] [CrossRef]
- Gardner, J.; Simon, I.; Manilow, E.; Hawthorne, C.; Engel, J. MT3: Multi-task multitrack music transcription. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
- Maman, B.; Bermano, A.H. Unaligned Supervision for Automatic Music Transcription in The Wild. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MA, USA, 17–23 July 2022; pp. 14918–14934. [Google Scholar]
- Lu, W.T.; Wang, J.C.; Hung, Y.N. Multitrack Music Transcription with a Time-Frequency Perceiver. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.Y.; Sainath, T. Deep Learning for Audio Signal Processing. IEEE J. Sel. Top. Signal Process. 2019, 13, 206–219. [Google Scholar] [CrossRef]
- Sigtia, S.; Benetos, E.; Dixon, S. An End-to-End Neural Network for Polyphonic Piano Music Transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 927–939. [Google Scholar] [CrossRef]
- Wu, Y.T.; Chen, B.; Su, L. Polyphonic Music Transcription with Semantic Segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 166–170. [Google Scholar] [CrossRef]
- Hawthorne, C.; Simon, I.; Swavely, R.; Manilow, E.; Engel, J. Sequence-to-Sequence Piano Transcription with Transformers. In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), Online, 7–12 November 2021; pp. 246–253. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Jaegle, A.; Gimeno, F.; Brock, A.; Vinyals, O.; Zisserman, A.; Carreira, J. Perceiver: General Perception with Iterative Attention. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; Volume 139, pp. 4651–4664. [Google Scholar]
- Su, L.; Yang, Y.H. Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1600–1612. [Google Scholar] [CrossRef]
- Bittner, R.M.; McFee, B.; Salamon, J.; Li, P.; Bello, J.P. Deep Salience Representations for F0 Estimation in Polyphonic Music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017; pp. 63–70. [Google Scholar] [CrossRef]
- Wu, Y.T.; Chen, B.; Su, L. Automatic Music Transcription Leveraging Generalized Cepstral Features and Deep Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 401–405. [Google Scholar] [CrossRef]
- Matsunaga, T.; Saito, H. Multi-Layer Combined Frequency and Periodicity Representations for Multi-Pitch Estimation of Multi-Instrument Music. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3171–3184. [Google Scholar] [CrossRef]
- Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.Z.A.; Dieleman, S.; Elsen, E.; Engel, J.; Eck, D. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Xi, Q.; Bittner, R.; Pauwels, J.; Ye, X.; Bello, J.P. GuitarSet: A Dataset for Guitar Transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, 23–27 September 2018; pp. 453–460. [Google Scholar] [CrossRef]
- Thickstun, J.; Harchaoui, Z.; Kakade, S. Learning Features of Music from Scratch. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Li, B.; Liu, X.; Dinesh, K.; Duan, Z.; Sharma, G. Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications. IEEE Trans. Multimed. 2019, 21, 522–535. [Google Scholar] [CrossRef]
- Manilow, E.; Wichern, G.; Seetharaman, P.; Le Roux, J. Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (WASPAA), New Paltz, NY, USA, 20–23 October 2019; pp. 45–49. [Google Scholar] [CrossRef]
- Tanaka, K.; Nakatsuka, T.; Nishikimi, R.; Yoshii, K.; Morishima, S. Multi-instrument music transcription based on deep spherical clustering of spectrograms and pitchgrams. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), Montreal, QC, Canada, 11–16 October 2020; pp. 327–334. [Google Scholar] [CrossRef]
- Simon, I.; Gardner, J.; Hawthorne, C.; Manilow, E.; Engel, J. Scaling Polyphonic Transcription with Mixtures of Monophonic Transcriptions. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bangalore, India, 4–8 December 2022; pp. 44–51. [Google Scholar] [CrossRef]
- Tan, H.H.; Cheuk, K.W.; Cho, T.; Liao, W.H.; Mitsufuji, Y. MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage. arXiv 2024, arXiv:2403.10024. [Google Scholar]
- Chang, S.; Benetos, E.; Kirchhoff, H.; Dixon, S. YourMT3+: Multi-Instrument Music Transcription with Enhanced Transformer Architectures and Cross-Dataset STEM Augmentation. In Proceedings of the IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), London, UK, 22–24 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Cheuk, K.W.; Sawata, R.; Uesaka, T.; Murata, N.; Takahashi, N.; Takahashi, S.; Herremans, D.; Mitsufuji, Y. Diffroll: Diffusion-Based Generative Music Transcription with Unsupervised Pretraining Capability. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Sato, G.; Akama, T. Annotation-Free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Zhang, Z.; Liu, Q.; Wang, Y. Road Extraction by Deep Residual U-Net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
- Zagoruyko, S.; Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference (BMVC), York, UK, 19–22 September 2016; pp. 87.1–87.12. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. Ph.D. Thesis, Columbia University, New York, NY, USA, 2016. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Manilow, E.; Seetharaman, P.; Pardo, B. Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 771–775. [Google Scholar] [CrossRef]
- Wu, Y.; Gardner, J.; Manilow, E.; Simon, I.; Hawthorne, C.; Engel, J. The Chamber Ensemble Generator: Limitless High-Quality MIR Data via Generative Modeling. arXiv 2022, arXiv:2209.14458. [Google Scholar] [CrossRef]
- Ostermann, F.; Vatolkin, I.; Ebeling, M. AAM: A dataset of Artificial Audio Multitracks for diverse music information retrieval tasks. EURASIP J. Audio Speech Music. Process. 2023, 2023, 13. [Google Scholar] [CrossRef]





| Dataset | Total Length (h:m:s) | Number of Pieces in Train/Validation/Test Sets | Number of Instruments | Avg. Number of Instruments |
|---|---|---|---|---|
| MAESTRO | 1 | 1 | ||
| Slakh | 34 | |||
| MusicNet | 11 | |||
| GuitarSet | 1 | 1 | ||
| URMP | 13 |
| Dataset | Piece ID for Validation | Piece ID for Test |
|---|---|---|
| MusicNet | 1733, 1765, 1790, 1818, 2160, 2198, 2289, 2300, 2308, 2315, 2336, 2466, 2477, 2504, 2611 | 1729, 1776, 1813, 1893, 2118, 2186, 2296, 2431, 2432, 2487, 2497, 2501, 2507, 2537, 2621 |
| URMP | 3, 8, 9, 11, 17, 21, 29, 37, 43 | 1, 2, 12, 13, 24, 25, 31, 38, 39 |
| Parameter | Value |
|---|---|
| : step size | 240 |
| : instrument selection threshold | |
| : onset stream threshold | |
| : frame stream threshold | |
| : peak prominence threshold | |
| : peak distance threshold | s |
| : silence interval threshold | s |
| System | Number of Trainable Parameters | Number of Instrument Classes Excluding “Others” |
|---|---|---|
| MT3 | M | 34 |
| OaFS | M | 34 |
| OaFS-ResUnet | M | 34 |
| OaFS-CFP | M | 34 |
| OaFS-single | M | Dataset-dependent |
| Metric | System | Dataset | ||||
|---|---|---|---|---|---|---|
| MAESTRO | Slakh | MusicNet | GuitarSet | URMP | ||
| Frame F1 (%) | MT3 | |||||
| OaFS | ||||||
| OaFS-ResUnet | ||||||
| OaFS-CFP | ||||||
| OaFS-single | ||||||
| Onset F1 (%) | MT3 | |||||
| OaFS | ||||||
| OaFS-ResUnet | ||||||
| OaFS-CFP | ||||||
| OaFS-single | ||||||
| Multi-Instrument Frame F1 (%) | MT3 | |||||
| OaFS | ||||||
| OaFS-ResUnet | ||||||
| OaFS-CFP | ||||||
| OaFS-single | ||||||
| Multi-Instrument Onset F1 (%) | MT3 | |||||
| OaFS | ||||||
| OaFS-ResUnet | ||||||
| OaFS-CFP | ||||||
| OaFS-single | ||||||
| Instrument Detection F1 (%) | MT3 | |||||
| OaFS | ||||||
| OaFS-ResUnet | ||||||
| OaFS-CFP | ||||||
| OaFS-single | ||||||
| Class | Total Number of Notes | Frame F1 (%) | Onset F1 (%) | ||||
|---|---|---|---|---|---|---|---|
| MT3 | OaFS | OaFS-Single | MT3 | OaFS | OaFS-Single | ||
| Acoustic Piano | |||||||
| Electric Piano | |||||||
| Chromatic Percussion | |||||||
| Organ | |||||||
| Acoustic Guitar | |||||||
| Clean Electric Guitar | |||||||
| Distorted Electric Guitar | |||||||
| Acoustic Bass | |||||||
| Electric Bass | |||||||
| Violin | 0 | 0 | |||||
| Viola | 0 | 0 | 0 | 0 | 0 | 0 | |
| Cello | 0 | 0 | 0 | 0 | 0 | 0 | |
| Contrabass | 0 | – | – | 0 | – | – | |
| Orchestral Harp | 0 | 0 | 0 | 0 | |||
| Timpani | 0 | 0 | 0 | 0 | 0 | 0 | |
| String Ensemble | |||||||
| Synth Strings | |||||||
| Choir and Voice | |||||||
| Trumpet | |||||||
| Trombone | 0 | 0 | |||||
| Tuba | 3089 | 0 | 0 | 0 | 0 | ||
| French Horn | 0 | 0 | 0 | 0 | |||
| Brass Section | |||||||
| Soprano/Alto Sax | |||||||
| Tenor Sax | |||||||
| Baritone Sax | 0 | 0 | |||||
| Oboe | 0 | 0 | |||||
| English Horn | 0 | 0 | 0 | 0 | 0 | 0 | |
| Bassoon | 0 | 0 | 0 | 0 | 0 | 0 | |
| Clarinet | |||||||
| Pipe | |||||||
| Synth Lead | |||||||
| Synth Pad | |||||||
| Drums | – | – | – | ||||
| Family | Frame F1 (%) | Onset F1 (%) | ||
|---|---|---|---|---|
| MT3 | OaFS | MT3 | OaFS | |
| Piano | ||||
| Chromatic Percussion | ||||
| Organ | ||||
| Guitar | ||||
| Bass | ||||
| Strings | ||||
| Ensemble | ||||
| Brass | ||||
| Reed | ||||
| Pipe | ||||
| Synth Lead | ||||
| Synth Pad | ||||
| Drums | – | – | ||
| All | ||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Matsunaga, T.; Saito, H. Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams. Signals 2026, 7, 12. https://doi.org/10.3390/signals7010012
Matsunaga T, Saito H. Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams. Signals. 2026; 7(1):12. https://doi.org/10.3390/signals7010012
Chicago/Turabian StyleMatsunaga, Tomoki, and Hiroaki Saito. 2026. "Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams" Signals 7, no. 1: 12. https://doi.org/10.3390/signals7010012
APA StyleMatsunaga, T., & Saito, H. (2026). Multitrack Music Transcription Based on Joint Learning of Onset and Frame Streams. Signals, 7(1), 12. https://doi.org/10.3390/signals7010012

