PF2N: Periodicity–Frequency Fusion Network for Multi-Instrument Music Transcription
Abstract
:1. Introduction
- Neural Frequency–Periodicity Feature Fusion: Inspired by traditional combined frequency and periodicity (CFP) methods [28], the PF2N reformulates spectral–cepstral alignment as a learnable pattern extraction mechanism.
- Compact and Modular Architecture: The proposed model is lightweight in both size and computational cost. It is also a modular neural component that can be integrated into various baseline architectures.
- Fair Comparisons with Benchmarks: Previous studies used different combinations of training datasets but evaluated them separately, leading to inconsistent benchmarks. To ensure fairness, we train and evaluate each model independently on individual datasets.
2. Related Works
2.1. Multi-Instrument Music Transcription
2.2. Combined Frequency and Periodicity
3. Proposed Method: PF2N
3.1. PF2N Overview
3.2. Joint Feature Extractor
3.3. Feature Fusion
3.4. Baseline Differences and PF2N Adaptability
4. Experiment
4.1. Settings
4.1.1. Dataset
4.1.2. Preprocessing
4.1.3. Evaluation Metrics
4.2. Main Results
4.3. Ablation Studies
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Benetos, E.; Dixon, S.; Duan, Z.; Ewert, S. Automatic Music Transcription: An Overview. IEEE Signal Process. Mag. 2019, 36, 20–30. [Google Scholar] [CrossRef]
- Huang, C.A.; Vaswani, A.; Uszkoreit, J.; Simon, I.; Hawthorne, C.; Shazeer, N.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck, D. Music Transformer: Generating Music with Long-Term Structure. In Proceedings of the 7th International Conference on Learning Representations, ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Dong, H.W.; Chen, K.; Dubnov, S.; McAuley, J.; Berg-Kirkpatrick, T. Multitrack Music Transformer. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
- Armentano, M.G.; De Noni, W.A.; Cardoso, H.F. Genre classification of symbolic pieces of music. J. Intell. Inf. Syst. 2017, 48, 579–599. [Google Scholar] [CrossRef]
- Tsai, T.; Ji, K. Composer style classification of piano sheet music images using language model pretraining. In Proceedings of the 21th International Society for Music Information Retrieval Conference, ISMIR, Online, 7–12 November 2021; pp. 176–183. [Google Scholar]
- Simonetta, F.; Chacón, C.E.C.; Ntalampiras, S.; Widmer, G. A Convolutional Approach to Melody Line Identification in Symbolic Scores. In Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR, Delft, The Netherlands, 4–8 November 2019; pp. 924–931. [Google Scholar]
- Rafii, Z.; Liutkus, A.; Stöter, F.R.; Mimilakis, S.I.; FitzGerald, D.; Pardo, B. An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1307–1335. [Google Scholar] [CrossRef]
- Poliner, G.E.; Ellis, D.P. A discriminative model for polyphonic piano transcription. EURASIP J. Adv. Signal Process. 2006, 2007, 1–9. [Google Scholar] [CrossRef]
- Costantini, G.; Perfetti, R.; Todisco, M. Event based transcription system for polyphonic piano music. Signal Process. 2009, 89, 1798–1811. [Google Scholar] [CrossRef]
- O’Hanlon, K.; Plumbley, M.D. Polyphonic piano transcription using non-negative Matrix Factorisation with group sparsity. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3112–3116. [Google Scholar]
- Gao, L.; Su, L.; Yang, Y.H.; Lee, T. Polyphonic piano note transcription with non-negative matrix factorization of differential spectrogram. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 291–295. [Google Scholar]
- Wu, H.; Marmoret, A.; Cohen, J.E. Semi-supervised convolutive NMF for automatic piano transcription. arXiv 2022, arXiv:2202.04989. [Google Scholar]
- Marolt, M. Transcription of polyphonic piano music with neural networks. In Proceedings of the 2000 10th Mediterranean Electrotechnical Conference, Information Technology and Electrotechnology for the Mediterranean Countries, Proceedings. MeleCon 2000 (Cat. No.00CH37099), Lemesos, Cyprus, 29–31 May 2000; Volume 2, pp. 512–515. [Google Scholar]
- Van Herwaarden, S.; Grachten, M.; De Haas, W.B. Predicting expressive dynamics in piano performances using neural networks. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR, International Society for Music Information Retrieval, Taipei, Taiwan, 27–31 October 2014; pp. 45–52. [Google Scholar]
- Sigtia, S.; Benetos, E.; Dixon, S. An End-to-End Neural Network for Polyphonic Piano Music Transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 927–939. [Google Scholar] [CrossRef]
- Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.; Raffel, C.; Engel, J.H.; Oore, S.; Eck, D. Onsets and Frames: Dual-Objective Piano Transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR, Paris, France, 23–27 September 2018; pp. 50–57. [Google Scholar]
- Kong, Q.; Li, B.; Song, X.; Wan, Y.; Wang, Y. High-resolution piano transcription with pedals by regressing onset and offset times. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3707–3717. [Google Scholar] [CrossRef]
- Cheuk, K.W.; Herremans, D.; Su, L. ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data. In Proceedings of the MM ’21: ACM Multimedia Conference, Virtual Event, 20–24 October 2021; pp. 3918–3926. [Google Scholar]
- Manilow, E.; Seetharaman, P.; Pardo, B. Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 771–775. [Google Scholar]
- Tanaka, K.; Nakatsuka, T.; Nishikimi, R.; Yoshii, K.; Morishima, S. Multi-Instrument Music Transcription Based on Deep Spherical Clustering of Spectrograms and Pitchgrams. In Proceedings of the 21th International Society for Music Information Retrieval Conference, ISMIR, Montreal, QC, Canada, 11–16 October 2020; pp. 327–334. [Google Scholar]
- Lin, L.; Xia, G.; Kong, Q.; Jiang, J. A unified model for zero-shot music source separation, transcription and synthesis. In Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR, Online, 7–12 November 2021; pp. 381–388. [Google Scholar]
- Wu, Y.T.; Chen, B.; Su, L. Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2796–2809. [Google Scholar] [CrossRef]
- Cheuk, K.W.; Choi, K.; Kong, Q.; Li, B.; Won, M.; Hung, A.; Wang, J.; Herremans, D. Jointist: Joint Learning for Multi-instrument Transcription and Its Applications. arXiv 2022, arXiv:2206.10805. [Google Scholar]
- Gardner, J.; Simon, I.; Manilow, E.; Hawthorne, C.; Engel, J.H. MT3: Multi-Task Multitrack Music Transcription. In Proceedings of the The Tenth International Conference on Learning Representations, ICLR, Virtual Event, 25–29 April 2022. [Google Scholar]
- Chang, S.; Dixon, S.; Benetos, E. YourMT3: A toolkit for training multi-task and multi-track music transcription model for everyone. In Proceedings of the Digital Music Research Network One-day Workshop (DMRN+ 17), London, UK, 20 December 2022. [Google Scholar]
- Lu, W.T.; Wang, J.; Hung, Y. Multitrack Music Transcription with a Time-Frequency Perceiver. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Chang, S.; Benetos, E.; Kirchhoff, H.; Dixon, S. YourMT3+: Multi-Instrument Music Transcription with Enhanced Transformer Architectures and Cross-Dataset STEM Augmentation. In Proceedings of the 34th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2024, London, UK, 22–25 September 2024; pp. 1–6. [Google Scholar]
- Su, L.; Yang, Y.H. Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1600–1612. [Google Scholar] [CrossRef]
- Manilow, E.; Wichern, G.; Seetharaman, P.; Le Roux, J. Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity. In Proceedings of the Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019. [Google Scholar]
- Thickstun, J.; Harchaoui, Z.; Kakade, S.M. Learning Features of Music From Scratch. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
- Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.A.; Dieleman, S.; Elsen, E.; Engel, J.H.; Eck, D. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In Proceedings of the 7th International Conference on Learning Representations, ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Su, L.; Chuang, T.; Yang, Y. Exploiting Frequency, Periodicity and Harmonicity Using Advanced Time-Frequency Concentration Techniques for Multipitch Estimation of Choir and Symphony. In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR, New York, NY, USA, 7–11 August 2016; pp. 393–399. [Google Scholar]
- Su, L. Vocal Melody Extraction Using Patch-Based CNN. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Calgary, AB, Canada, 15–20 April 2018; pp. 371–375. [Google Scholar]
- Wu, Y.T.; Chen, B.; Su, L. Automatic Music Transcription Leveraging Generalized Cepstral Features and Deep Learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 401–405. [Google Scholar]
- Matsunaga, T.; Saito, H. Multi-Layer Combined Frequency and Periodicity Representations for Multi-Pitch Estimation of Multi-Instrument Music. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3171–3184. [Google Scholar] [CrossRef]
Model | Slakh2100 | MusicNet | MAESTRO |
---|---|---|---|
Frame F1 (%) | |||
Melodyne | 47.0 | 13.0 | 41.0 |
ReconVAT [18] | - | 48.0 | - |
MT3 [24] | 78.0 | 60.0 | 88.0 |
YMT3 * [25] | 66.0 | 57.5 | 82.7 |
YMT3 w/ PF2N | 66.4 | 58.8 | 83.6 |
+0.4 | +1.3 | +0.9 | |
YPTF * [27] | 82.8 | 63.7 | 90.6 |
YPTF w/ PF2N | 83.4 | 64.5 | 90.8 |
+0.6 | +0.8 | +0.2 | |
Onset F1 (%) | |||
Melodyne | 30.0 | 4.0 | 52.0 |
ReconVAT [18] | - | 29.0 | - |
Jointist [23] | 58.4 | - | - |
MT3 [24] | 76.0 | 39.0 | 96.0 |
Perceiver TF [26] | 80.8 | - | 96.7 |
YMT3 * [25] | 53.8 | 37.5 | 88.1 |
YMT3 w/ PF2N | 57.1 | 39.6 | 88.8 |
+3.3 | +2.1 | +0.7 | |
YPTF * [27] | 80.9 | 46.1 | 96.9 |
YPTF w/PF2N | 81.6 | 47.5 | 96.9 |
+0.7 | +1.4 | +0.0 | |
Onset + Offset F1 (%) | |||
Melodyne | 10.0 | 1.0 | 6.0 |
ReconVAT [18] | - | 11.0 | - |
Jointist [23] | 26.3 | - | - |
MT3 [24] | 57.0 | 21.0 | 84.0 |
YMT3 * [25] | 33.4 | 21.2 | 69.1 |
YMT3 w/ PF2N | 37.2 | 23.0 | 72.8 |
+3.8 | +1.8 | +3.7 | |
YPTF * [27] | 65.6 | 29.0 | 87.4 |
YPTF w/ PF2N | 66.7 | 30.7 | 87.7 |
+1.1 | +1.7 | +0.3 |
Slakh2100 | MusicNet | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | #Param | Piano | Bass | Drums | Guitar | Strings | Brass | Organ | Pipe | Reed | S.lead | S.pad | C.perc. | Piano | Strings | Winds |
YMT3 | 44.9 M | 51.6 | 64.6 | 62.3 | 45.8 | 34.8 | 22.3 | 19.4 | 26.1 | 28.7 | 25.4 | 16.0 | 18.5 | 40.7 | 28.5 | 30.4 |
YMT3 w/ PF2N | 45.2 M | 57.6 | 65.4 | 64.0 | 53.0 | 40.9 | 24.8 | 26.4 | 32.7 | 35.6 | 30.9 | 15.9 | 26.9 | 44.4 | 30.8 | 32.7 |
+0.3 M | +6.0 | +0.8 | +1.7 | +7.2 | +6.1 | +2.5 | +7.0 | +6.6 | +6.9 | +5.5 | −0.1 | +8.4 | +3.7 | +2.3 | +2.3 | |
YPTF | 96.4 M | 83.9 | 92.2 | 84.0 | 78.3 | 70.1 | 69.2 | 66.8 | 67.9 | 75.3 | 79.4 | 40.4 | 66.1 | 53.3 | 40.7 | 46.0 |
YPTF w/ PF2N | 97.5 M | 84.9 | 92.5 | 83.5 | 79.5 | 70.5 | 73.7 | 67.7 | 67.8 | 77.2 | 81.1 | 40.1 | 66.3 | 54.5 | 41.6 | 49.0 |
+1.1 M | +1.0 | +0.3 | +0.5 | +0.8 | +0.4 | +4.5 | +0.9 | +0.1 | +1.9 | +1.7 | −0.3 | +0.2 | +1.2 | +0.9 | +3.0 |
Slakh2100 | MusicNet | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | #Param | Piano | Bass | Drums | Guitar | Strings | Brass | Organ | Pipe | Reed | S.lead | S.pad | C.perc. | Piano | Strings | Winds |
YMT3 | 44.9 M | 27.2 | 48.3 | - | 28.8 | 20.7 | 14.4 | 10.8 | 16.3 | 18.3 | 15.9 | 0.1 | 0.1 | 23.2 | 15.8 | 15.8 |
YMT3 w/ PF2N | 45.2 M | 32.0 | 49.6 | - | 35.7 | 24.1 | 16.5 | 16.1 | 20.8 | 24.2 | 21.4 | 0.1 | 0.1 | 25.5 | 17.5 | 18.4 |
+0.3 M | +4.8 | +1.3 | - | +6.9 | +3.4 | +2.1 | +5.3 | +4.5 | +5.9 | +5.5 | +0.0 | +0.0 | +2.3 | +1.7 | +2.6 | |
YPTF | 96.4 M | 61.8 | 83.8 | - | 63.6 | 55.1 | 62.3 | 55.4 | 57.2 | 65.9 | 69.0 | 27.1 | 36.4 | 32.6 | 25.6 | 32.3 |
YPTF w/ PF2N | 97.5 M | 63.0 | 84.3 | - | 65.0 | 56.0 | 63.9 | 56.5 | 56.6 | 68.2 | 72.1 | 25.8 | 36.3 | 34.0 | 26.5 | 36.3 |
+1.1 M | +1.2 | +0.5 | - | +1.4 | +0.9 | +1.6 | +1.1 | −0.6 | +2.3 | +3.1 | −1.3 | −0.1 | +1.4 | +0.9 | +4.0 |
Model | Frame F1 (%) | Onset F1 (%) | Onset + Offset F1 (%) |
---|---|---|---|
P + F | 64.5 | 47.5 | 30.7 |
P only | 63.9 | 46.2 | 29.5 |
F only | 63.7 | 46.0 | 29.2 |
None | 63.7 | 46.1 | 29.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, T.; Kim, M.-J.; Ahn, C.W. PF2N: Periodicity–Frequency Fusion Network for Multi-Instrument Music Transcription. Mathematics 2025, 13, 1708. https://doi.org/10.3390/math13111708
Kim T, Kim M-J, Ahn CW. PF2N: Periodicity–Frequency Fusion Network for Multi-Instrument Music Transcription. Mathematics. 2025; 13(11):1708. https://doi.org/10.3390/math13111708
Chicago/Turabian StyleKim, Taehyeon, Man-Je Kim, and Chang Wook Ahn. 2025. "PF2N: Periodicity–Frequency Fusion Network for Multi-Instrument Music Transcription" Mathematics 13, no. 11: 1708. https://doi.org/10.3390/math13111708
APA StyleKim, T., Kim, M.-J., & Ahn, C. W. (2025). PF2N: Periodicity–Frequency Fusion Network for Multi-Instrument Music Transcription. Mathematics, 13(11), 1708. https://doi.org/10.3390/math13111708