SpectTrans: Joint Spectral–Temporal Modeling for Polyphonic Piano Transcription via Spectral Gating Networks
Abstract
1. Introduction
2. Related Works
2.1. Automatic Piano Transcription
2.2. Spectral Transformers
3. Methods
3.1. Acoustics Spectral Gating Network
3.2. Acoustics Spectral Attention
3.3. SpectTrans
4. Results
4.1. Datasets and Evaluation Metrics
4.2. Experiment Setup
4.3. Evaluation Results of Piano Transcription
4.4. Evaluation Results of Pedal Piano Transcription
4.5. Ablation Studies
4.6. Computational Efficiency
4.7. Visualization of Frame Activations
5. Error Analysis and Limitation
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| AMT | Automatic Music Transcription |
| MIR | Music Information Retrieval |
| SGN | Spectral Gating Network |
| A-SGN | Acoustic Spectral Gating Network |
| MHSA | Multi-Head Self-Attention |
| FFT/IFFT | Fast Fourier Transform/Inverse FFT |
| CNN | Convolutional Neural Network |
| RNN | Recurrent Neural Network |
| MLM | Music Language Model |
| FNO | Fourier Neural Operator |
References
- Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound event detection: A tutorial. IEEE Signal Process. Mag. 2021, 38, 67–83. [Google Scholar] [CrossRef]
- Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.Z.A.; Dieleman, S.; Elsen, E.; Engel, J.; Eck, D. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Kim, J.; Bello, J.P. Adversarial learning for improved onsets and frames music transcription. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, 4–8 November 2019; pp. 670–677. [Google Scholar]
- Kelz, R.; Böck, S.; Widmer, G. Deep polyphonic ADSR piano note transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 246–250. [Google Scholar]
- Kelz, R.; Dorfer, M.; Korzeniowski, F.; Böck, S.; Arzt, A.; Widmer, G. On the potential of simple framewise approaches to piano transcription. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), New York, NY, USA, 7–11 August 2016; pp. 475–481. [Google Scholar]
- Kwon, T.; Jeong, D.; Nam, J. Polyphonic piano transcription using autoregressive multi-state note model. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), Virtual, 11–16 October 2020; pp. 454–461. [Google Scholar]
- Wang, Q.; Zhou, R.; Yan, Y. Polyphonic piano transcription with a note-based music language model. Appl. Sci. 2018, 8, 470. [Google Scholar] [CrossRef]
- Böck, S.; Schedl, M. Polyphonic Piano Note Transcription with Recurrent Neural Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 121–125. [Google Scholar]
- Hawthorne, C.; Elsen, E.; Song, J.; Roberts, A.; Simon, I.; Raffel, C.; Engel, J.; Oore, S.; Eck, D. Onsets and Frames: Dual-Objective Piano Transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, 23–27 September 2018; pp. 50–57. [Google Scholar]
- Kong, Q.; Li, B.; Song, X.; Wan, Y.; Wang, Y. High-Resolution Piano Transcription with Pedals by Regressing Onsets and Offsets. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3707–3717. [Google Scholar] [CrossRef]
- Xiong, J.; Liu, G.; Huang, L.; Wu, C.; Wu, T.; Mu, Y.; Yao, Y.; Shen, H.; Wan, Z.; Huang, J.; et al. Autoregressive models in vision: A survey. arXiv 2024, arXiv:2411.05902. [Google Scholar] [CrossRef]
- Wei, F.; Yoshii, K. Streaming Piano Transcription with Causal Attention and Dual Decoders. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 806–810. [Google Scholar]
- Wang, Y.; Wu, J.; Zhang, L. Harmonic-Aware Frequency Attention for Piano Transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 301–305. [Google Scholar]
- Wu, J.; Wang, Y.; Duan, Z. Harmonic Attention Networks for Music Transcription. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), Virtual, 11–16 October 2020; pp. 111–118. [Google Scholar]
- Medsker, L.R.; Jain, L. Recurrent neural networks: Design and applications. Neural Netw. 2001, 5, 2. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
- Wei, F.; Yoshii, K. Hierarchical Sparse Attention for Long-Sequence Music Transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 716–720. [Google Scholar]
- Cao, R.; Liang, Z.; Yan, Z.; Liu, B. DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for polyphonic piano transcription. Electronics 2024, 13, 3939. [Google Scholar] [CrossRef]
- Zhang, X.; Liu, Y.; Duan, Z. Multi-Scale Graph Attention Networks for Piano Transcription. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bengaluru, India, 4–8 December 2022; pp. 410–417. [Google Scholar]
- Lee-Thorp, J.; Ainslie, J.; Eckstein, I.; Ontañón, S. FNet: Mixing Tokens with Fourier Transforms. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 213–223. [Google Scholar]
- Rao, Y.; Zhao, W.; Zhu, Z.; Lu, J.; Zhou, J. Global Filter Networks for Image Classification. Adv. Neural Inf. Process. Syst. 2021, 34, 980–993. [Google Scholar]
- Guibas, J.; Mardani, M.; Li, Z.; Tao, A.; Anandkumar, A.; Catanzaro, B. Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021; Available online: https://openreview.net/forum?id=EXHG-A3jlM (accessed on 24 January 2026).
- Li, Z.; Kovachki, N.; Azizzadenesheli, K.; Liu, B.; Bhattacharya, K.; Stuart, A.; Anandkumar, A. Fourier Neural Operator for Parametric Partial Differential Equations. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021; pp. 1–16. [Google Scholar]
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021; pp. 1–17. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Ngo, C.W.; Mei, T. Wave-ViT: Unifying Wavelet and Transformer for Visual Representation Learning. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 328–345. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
- Raffel, C.; Ellis, D.P.W. Intuitive analysis, creation and manipulation of MIDI data with pretty_midi. In Proceedings of the 15th International Conference on Music Information Retrieval Late Breaking and Demo Papers, Taipei, Taiwan, 27–31 October 2014. [Google Scholar]
- Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar]
- Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Hawthorne, C.; Simon, I.; Swavely, R.; Manilow, E.; Engel, J. Sequence-to-sequence piano transcription with transformers. arXiv 2021, arXiv:2107.09142. [Google Scholar]
- Kwon, T.; Jeong, D.; Nam, J. Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 5106–5116. [Google Scholar] [CrossRef]



| Methods | Frame | Note | Note w/ Offset | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| Onsets & frames | - | - | - | 92.6 | 98.3 | 95.3 | 78.2 | 83.0 | 80.5 |
| Adversarial | - | - | - | 93.2 | 98.1 | 95.6 | 79.3 | 83.5 | 81.3 |
| S2S | - | - | - | - | - | 96.0 | - | - | 83.5 |
| - | - | - | 94.9 | 97.2 | 96.0 | 83.7 | 85.7 | 84.7 | |
| HRT (baseline) | 87.4 | 90.1 | 82.7 | 98.3 | 97.4 | 96.1 | 92.7 | 88.9 | 89.6 |
| SpectTrans | 88.1 | 90.5 | 89.2 | 98.0 | 95.3 | 96.6 | 93.0 | 88.0 | 90.3 |
| Methods | Frame | Note | Note w/ Offset | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| HRT (baseline) | 85.2 | 80.2 | 82.5 | 78.9 | 87.4 | 82.8 | 54.0 | 59.9 | 56.2 |
| SpectTrans | 85.3 | 79.7 | 82.7 | 77.6 | 87.3 | 81.6 | 54.7 | 60.3 | 56.7 |
| Methods | Frame | Event | Event w/ Offset | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| HRT | 94.76 | 94.72 | 94.69 | 90.70 | 92.56 | 91.46 | 80.13 | 78.61 | 79.07 |
| SpectTrans | 95.04 | 94.91 | 94.91 | 97.42 | 92.37 | 94.80 | 81.84 | 77.66 | 79.67 |
| Methods | Parameters | FLOPs | Frame | ||
|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | |||
| Standard attention | 125.37 | 233.72 | - | - | - |
| FN | 106.47 | 214.83 | 87.9 | 87.8 | 89.3 |
| DCT | 109.51 | 231.82 | 88.3 | 89.6 | 89.1 |
| SpectTrans | 106.46 | 214.82 | 88.1 | 90.5 | 89.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Cao, R.; Liang, Y.; Feng, L.; Li, Y. SpectTrans: Joint Spectral–Temporal Modeling for Polyphonic Piano Transcription via Spectral Gating Networks. Electronics 2026, 15, 665. https://doi.org/10.3390/electronics15030665
Cao R, Liang Y, Feng L, Li Y. SpectTrans: Joint Spectral–Temporal Modeling for Polyphonic Piano Transcription via Spectral Gating Networks. Electronics. 2026; 15(3):665. https://doi.org/10.3390/electronics15030665
Chicago/Turabian StyleCao, Rui, Yan Liang, Lei Feng, and Yuanzi Li. 2026. "SpectTrans: Joint Spectral–Temporal Modeling for Polyphonic Piano Transcription via Spectral Gating Networks" Electronics 15, no. 3: 665. https://doi.org/10.3390/electronics15030665
APA StyleCao, R., Liang, Y., Feng, L., & Li, Y. (2026). SpectTrans: Joint Spectral–Temporal Modeling for Polyphonic Piano Transcription via Spectral Gating Networks. Electronics, 15(3), 665. https://doi.org/10.3390/electronics15030665

