DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription
Abstract
:1. Introduction
- We design a DAFE-MSGAT, which effectively captures time-frequency features and utilizes the interdependencies between notes, achieving outstanding performance in polyphonic piano transcription.
- We propose a DAFE to enhance the frequency and temporal characteristics of audio signals, overcoming the limitations of traditional convolutional neural networks in capturing audio signal details.
- We introduce the graph attention mechanism (GAT) into AMT for the first time, designing an MSGAT to model the implicit interdependencies between notes, thereby enhancing the richness and diversity of feature representations.
- Experimental results demonstrate that our proposed model performs exceptionally well on public datasets, accurately identifying the onset and offset times of notes. It shows competitive performance against existing methods in both frame-level and note-level metrics.
2. Related Work
2.1. Frame-Level and Note-Level Transcription
2.2. Deep Learning Methods for AMT
3. The Proposed Method
3.1. Architecture Overview
3.2. Dual-Attention Feature Extraction Module
3.3. Multi-Scale Graph Attention Network
3.4. Loss Function
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Implementation Details
4.3. Comparison with the SOTA
4.4. Transcription Results
4.5. Ablation Study
4.5.1. Effectiveness of the DAFE
4.5.2. Availability of the MSGAT
4.5.3. Validity of the Focal Loss
4.6. Effects of Different Threshold Values ,
4.7. Error Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Benetos, E.; Dixon, S.; Duan, Z.; Ewert, S. Automatic music transcription: An overview. IEEE Signal Process. Mag. 2018, 36, 20–30. [Google Scholar] [CrossRef]
- Raphael, C. Automatic Transcription of Piano Music. In Proceedings of the 2002 3rd International Conference on Music Information Retrieval, Paris, France, 13–17 October 2002. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Bello Correa, J.P. Towards the Automated Analysis of Simple Polyphonic Music: A Knowledge-Based Approach. Ph.D. Thesis, Queen Mary University of London, London, UK, 2003. [Google Scholar]
- Goto, M. A real-time music-scene-description system: Predominant-F0 estimation for detecting melody and bass lines in real-world audio signals. Speech Commun. 2004, 43, 311–329. [Google Scholar] [CrossRef]
- Klapuri, A.; Davy, M. Signal Processing Methods for Music Transcription; Springer: New York, NY, USA, 2007. [Google Scholar]
- Nam, J.; Ngiam, J.; Lee, H.; Slaney, M. A Classification-Based Polyphonic Piano Transcription Approach Using Learned Feature Representations. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), Miami, FL, USA, 24–28 October 2011; Citeseer: Princeton, NJ, USA, 2011; pp. 175–180. [Google Scholar]
- Böck, S.; Schedl, M. Polyphonic piano note transcription with recurrent neural networks. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; IEEE: New York, NY, USA, 2012; pp. 121–124. [Google Scholar]
- Sigtia, S.; Benetos, E.; Dixon, S. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 927–939. [Google Scholar] [CrossRef]
- Kelz, R.; Dorfer, M.; Korzeniowski, F.; Böck, S.; Arzt, A.; Widmer, G. On the potential of simple framewise approaches to piano transcription. arXiv 2016, arXiv:1612.05153. [Google Scholar]
- Kelz, R.; Böck, S.; Widnaer, C. Multitask learning for polyphonic piano transcription, a case study. In Proceedings of the 2019 International Workshop on Multilayer Music Representation and Processing (MMRP), Milan, Italy, 23–24 January 2019; IEEE: New York, NY, USA, 2019; pp. 85–91. [Google Scholar]
- Benetos, E.; Dixon, S.; Giannoulis, D.; Kirchhoff, H.; Klapuri, A. Automatic music transcription: Challenges and future directions. J. Intell. Inf. Syst. 2013, 41, 407–434. [Google Scholar] [CrossRef]
- Gardner, J.; Simon, I.; Manilow, E.; Hawthorne, C.; Engel, J. MT3: Multi-task multitrack music transcription. arXiv 2021, arXiv:2111.03017. [Google Scholar]
- Duan, Z.; Temperley, D. Note-level Music Transcription by Maximum Likelihood Sampling. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), Taipei, Taiwan, 27–31 October 2014; Citeseer: Princeton, NJ, USA, 2014; pp. 181–186. [Google Scholar]
- Fernandez, A. Onsets and Velocities: Affordable Real-Time Piano Transcription Using Convolutional Neural Networks. In Proceedings of the 2023 31st European Signal Processing Conference (EUSIPCO), Helsinki, Finland, 4–8 September 2023; IEEE: New York, NY, USA, 2023; pp. 151–155. [Google Scholar]
- Meng, Z.; Chen, W. Automatic music transcription based on convolutional neural network, constant Q transform and MFCC. J. Phys. Conf. Ser. 2020, 1651, 012192. [Google Scholar] [CrossRef]
- Aljamea, H.H.; Mattar, E.A. Automatic music transcription using CNN neural networks on segmented audio. In Proceedings of the 4th Smart Cities Symposium (SCS 2021), Online Conference, 21–23 November 2021; IET: London, UK, 2021; Volume 2021, pp. 333–337. [Google Scholar]
- Benetos, E.; Dixon, S.; Giannoulis, D.; Kirchhoff, H.; Klapuri, A. Automatic music transcription: Breaking the glass ceiling. In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR 2012), Porto, Portugal, 8–12 October 2012. [Google Scholar]
- Emiya, V.; Badeau, R.; David, B. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Trans. Audio Speech Lang. Process. 2009, 18, 1643–1654. [Google Scholar] [CrossRef]
- Duan, Z.; Pardo, B.; Zhang, C. Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 2121–2133. [Google Scholar] [CrossRef]
- Smaragdis, P.; Brown, J.C. Non-negative matrix factorization for polyphonic music transcription. In Proceedings of the 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No. 03TH8684), New Paltz, NY, USA, 19–22 October 2003; IEEE: New York, NY, USA, 2003; pp. 177–180. [Google Scholar]
- Vincent, E.; Bertin, N.; Badeau, R. Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans. Audio Speech Lang. Process. 2009, 18, 528–537. [Google Scholar] [CrossRef]
- Su, L.; Yang, Y.H. Combining spectral and temporal representations for multipitch estimation of polyphonic music. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1600–1612. [Google Scholar] [CrossRef]
- Sigtia, S.; Benetos, E.; Cherla, S.; Weyde, T.; Garcez, A.; Dixon, S. An RNN-based music language model for improving automatic music transcription. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), Taipei, Taiwan, 27–31 October 2014; Society for Music Information Retrieval, 2014; pp. 53–58. [Google Scholar]
- Sturm, B.L.; Santos, J.F.; Ben-Tal, O.; Korshunova, I. Music transcription modelling and composition using deep learning. arXiv 2016, arXiv:1604.08723. [Google Scholar]
- Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.Z.A.; Dieleman, S.; Elsen, E.; Engel, J.; Eck, D. Enabling factorized piano music modeling and generation with the MAESTRO dataset. arXiv 2018, arXiv:1810.12247. [Google Scholar]
- Hawthorne, C.; Simon, I.; Swavely, R.; Manilow, E.; Engel, J. Sequence-to-sequence piano transcription with transformers. arXiv 2021, arXiv:2107.09142. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Raguraman, P.; Mohan, R.; Vijayan, M. Librosa based assessment tool for music information retrieval systems. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 28–30 March 2019; IEEE: New York, NY, USA, 2019; pp. 109–114. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Zhang, W.; Chen, Z.; Yin, F. Multi-Pitch Estimation of Polyphonic Music Based on Pseudo Two-Dimensional Spectrum. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2095–2108. [Google Scholar] [CrossRef]
- Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.; Raffel, C.; Engel, J.; Oore, S.; Eck, D. Onsets and frames: Dual-objective piano transcription. arXiv 2017, arXiv:1710.11153. [Google Scholar]
- Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online; 2020; pp. 79–91. [Google Scholar]
- Klapuri, A. Multiple fundamental frequency estimation by summing harmonic amplitudes. In Proceedings of the 7th International Conference on Music Information Retrieval, Victoria, BC, Canada, 8–12 October 2006; pp. 216–221. [Google Scholar]
- Benetos, E.; Weyde, T. Multiple-F0 estimation and note tracking for Mirex 2015 using a sound state-based spectrogram factorization model. In Proceedings of the 11th Annual Music Information Retrieval eXchange (MIREX’15), Malaga, Spain, 26–30 October 2015; pp. 1–2. [Google Scholar]
- Kong, Q.; Li, B.; Song, X.; Wan, Y.; Wang, Y. High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3707–3717. [Google Scholar] [CrossRef]
- Wei, W.; Li, P.; Yu, Y.; Li, W. Hppnet: Modeling the harmonic structure and pitch invariance in piano transcription. arXiv 2022, arXiv:2208.14339. [Google Scholar]
- Xiao, Z.; Chen, X.; Zhou, L. Polyphonic piano transcription based on graph convolutional network. Signal Process. 2023, 212, 109134. [Google Scholar] [CrossRef]
Methods | Years | Note | Frame | Note w/Offset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | ||||
SHA | 2006 | 64.75% | 72.21% | 68.53% | 53.27% | 71.65% | 62.38% | 36.32% | 41.65% | 39.44% | ||
SPNRM | 2010 | 70.47% | 68.29% | 69.38% | 67.83% | 59.81% | 63.52% | 42.62% | 40.27% | 41.34% | ||
S3F | 2015 | 76.78% | 67.24% | 72.06% | 78.33% | 52.26% | 65.74% | 47.78% | 41.26% | 44.53% | ||
CBLSTM | 2017 | 83.92% | 80.16% | 81.78% | 88.23% | 70.39% | 77.84% | 51.09% | 48.77% | 50.34% | ||
S2S | 2021 | 84.91% | 83.75% | 84.34% | 91.18% | 76.33% | 83.75% | 57.28% | 53.76% | 55.52% | ||
HRT | 2022 | 86.25% | 83.82% | 85.03% | 87.64% | 75.97% | 81.85% | 56.64% | 52.28% | 54.46% | ||
CR-GCN | 2023 | 84.30% | 84.65% | 84.48% | 90.42% | 77.92% | 83.51% | 55.17% | 54.82% | 54.98% | ||
OURS | 84.14% | 83.94% | 84.36% | 84.06% | 73.21% | 79.63% | 55.76% | 51.41% | 52.64% |
Methods | Years | Note | Frame | Note w/Offset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | ||||
CBLSTM | 2017 | 97.42% | 92.37% | 94.84% | 91.13% | 88.76% | 89.19% | 81.84% | 77.66% | 79.67% | ||
S2S | 2021 | 98.11% | 95.89% | 96.95% | 91.61% | 93.29% | 92.43% | 84.57% | 83.23% | 83.82% | ||
HRT | 2021 | 98.43% | 94.81% | 96.61% | 88.91% | 90.28% | 89.51% | 83.81% | 80.70% | 82.26% | ||
HPPNet | 2022 | 98.31% | 96.18% | 97.21% | 92.36% | 93.46% | 92.86% | 85.36% | 83.54% | 84.41% | ||
CR-GCN | 2023 | 97.38% | 96.21% | 96.88% | 91.49% | 94.03% | 92.77% | 82.41% | 84.22% | 83.18% | ||
OURS | 97.24% | 95.92% | 96.52% | 91.06% | 95.21% | 92.79% | 84.76% | 83.32% | 82.21% |
Component | Note | Frame | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
DAFE | MSGAT | Focal Loss | Precision | Recall | F1 | Precision | Recall | F1 | ||
√ | √ | 94.36% | 93.76% | 94.14% | 89.83% | 90.16% | 87.65% | |||
√ | √ | 91.38% | 89.06% | 88.87% | 85.47% | 87.92% | 85.86% | |||
√ | √ | 63.21% | 62.15% | 62.75% | 50.52% | 57.28% | 52.26% | |||
√ | √ | √ | 97.24% | 95.92% | 96.52% | 91.06% | 95.21% | 92.79% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cao, R.; Liang, Z.; Yan, Z.; Liu, B. DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription. Electronics 2024, 13, 3939. https://doi.org/10.3390/electronics13193939
Cao R, Liang Z, Yan Z, Liu B. DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription. Electronics. 2024; 13(19):3939. https://doi.org/10.3390/electronics13193939
Chicago/Turabian StyleCao, Rui, Zushuang Liang, Zheng Yan, and Bing Liu. 2024. "DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription" Electronics 13, no. 19: 3939. https://doi.org/10.3390/electronics13193939
APA StyleCao, R., Liang, Z., Yan, Z., & Liu, B. (2024). DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription. Electronics, 13(19), 3939. https://doi.org/10.3390/electronics13193939