RNN-Based F0 Estimation Method with Attention Mechanism
Abstract
1. Introduction
2. Preliminaries: State-of-the-Art F0 Estimation Methods
2.1. Regression Models
- Inspired by techniques for extracting F0 contours from noisy signals, this approach trains an RNN to regress sinusoidal representations of F0 frequency from waveform segments [22]. The aim is to approximate the smooth, underlying sinusoid that represents perceived F0, even when signal degradation is present. This method is well-suited for F0 tracking in real-world conditions where robustness to noise and temporal continuity are crucial, such as in singing voice analysis [23] or speech enhancement [24].
- This method employs a sequence-to-one regression architecture [25], where a single-valued target variable is produced using multiple time-series features. It uses LSTM layers with “last” output mode to capture temporal dependencies across the input, followed by a dense layer to produce a continuous scalar output. As a common baseline for sequence regression tasks, it provides a clear benchmark for evaluating more advanced models.
- To estimate the F0 frequency, this regression-based approach [26] utilises short speech segments and a deep NN, where each segment of the signal is processed as a separate input to an LSTM network, which produces a single continuous F0 value. Unlike classification methods, this approach avoids quantisation errors and offers smooth F0 trajectories, making it suitable for applications that require high-precision F0 tracking, such as expressive speech analysis or music transcription.
2.2. Classification Models
- Building on traditional signal processing techniques, this method first computes the Pseudo Wigner-Ville Distribution (PWVD) [27] of the input speech signal to generate a high-resolution time-frequency representation. The resulting spectro-temporal features are then fed into an LSTM-based classification network to estimate the F0 class. This hybrid model combines the interpretability and precision of PWVD in the time–frequency domain with the temporal modelling capabilities of LSTMs, making it resilient in noisy or complex acoustic environments [28].
- Inspired by the work of Wang et al. [29], this method introduces a multi-tier RNN framework where each tier predicts F0 frequency at different temporal resolution. Feedback connections between tiers improve the consistency of F0 contours across various time scales. The model outputs quantised F0 classes, making it especially suitable for text-to-speech synthesis, where smooth, quantised F0 transitions are essential for natural prosody.
2.3. Baseline Models
- The pYIN method [6] represents a mature, high-performance algorithm rooted in classical DSP, providing a crucial baseline for F0 estimation. The method comprises two steps: first, it generates multiple pitch candidates with associated probabilities based on a probabilistic distribution of the YIN threshold, followed by a hidden Markov model with Viterbi decoding to process these probabilistic candidates, resulting in a more robust and accurate final pitch track.
- The Convolutional Representation for Pitch Estimation [10] is a leading state-of-the-art deep learning model that treats F0 estimation as a classification task over 360 log-spaced pitch bins, operating directly on the raw audio waveform. The approach utilises a deep, six-layer CNN architecture to autonomously learn hierarchical spectral features, eliminating the need for traditional handcrafted feature extraction. Due to its architecture, CREPE achieves high accuracy in pitch detection and demonstrates robustness across diverse acoustic conditions and signals, including speech and music. Its superior performance and end-to-end design make it an essential state-of-the-art baseline model in the field, particularly for illustrating the trade-off between efficiency and accuracy.
3. RNN-Based F0 Estimation Method with Attention Mechanism
3.1. The Architecture
- Preprocessed signal in the form of a 20 ms frame serves as the input into the algorithm.
- Sequence input layer containing one layer that acts as the entry point for the signal is responsible for feeding the sequential data into the rest of the network, typically by mapping the input to a vector representation.
- Recurrent (LSTM) layer with 50 hidden units serves to capture temporal dependencies by maintaining a “memory” of past information in the sequence.
- Attention mechanism with 5 heads introduces 1300 parameters and helps the network to process long input sequences more efficiently by reducing the fixed-length summarisation burden from the LSTM alone. Moreover, for better performance, the signal frames are processed in bulk.
- Fully connected layers composed of two layers, each of which has two neurons, where every neuron from the previous layer is connected to every neuron in the current layer.
- Output layer consists of one layer with one neuron, a softmax activation that produces class probabilities used for the F0 estimation, which represents the model’s output.
- F0 (fundamental frequency), which is the lowest frequency of a periodic waveform, is the final output of the network.
3.2. Models’ Settings
4. Experiment
4.1. Training Dataset and Settings
4.2. Evaluation Metrics
- RMSE measures the average size of the error between predicted and actual values; thus, it provides an absolute measure of fit and is especially useful when significant errors are undesirable, as it penalises them more heavily due to squaring. Since RMSE shares the same unit as the output variable, it is easy to interpret and useful for evaluating continuous F0 estimation methods. While in the case of regression models, the use of RMSE is straightforward, assessing the performance of classification methods using the RMSE criterion involves using the NN output probabilities for F0 bins to compute a continuous F0 value. This is done by taking the weighted average of the bin centre frequencies, which is then compared to the ground-truth continuous F0. This experimental setup allows for a direct comparison of accuracy and efficiency across the tested methods, including those based on regression and those based on classification techniques, highlighting the strengths and limitations of each approach in the context of F0 tracking.
- FLOPs represent a basic measure of a machine learning model’s computational complexity, indicating the total number of floating-point calculations, such as additions, subtractions, multiplications, and divisions, needed for a single pass of the model, where fewer FLOPs indicate a more efficient approach. The total number of FLOPs in an RNN and CNN is calculated as the sum of FLOPs for both the forward and backward passes.
4.3. Performance Results and Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Rabiner, L.R. On the use of autocorrelation analysis for pitch detection. IEEE Trans. Acoust. Speech Signal Process. 1977, 25, 24–33. [Google Scholar] [CrossRef]
- Hansen, J.H.L.; Hasan, T. Speaker recognition from speech: A review of the past and present. IEEE Signal Process. Mag. 2015, 32, 74–99. [Google Scholar] [CrossRef]
- Drugman, T.; Kane, J.; Ritio, T.; Gobl, C. Prediction of creaky voice from contextual factors. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7967–7971. [Google Scholar]
- Schroeder, M.R. Period histogram and product spectrum: New methods for fundamental-frequency measurement. J. Acoust. Soc. Am. 1968, 43, 829–834. [Google Scholar] [CrossRef] [PubMed]
- Ross, M.; Shaffer, H.; Cohen, A.; Freudberg, R.; Manley, H. Average magnitude difference function pitch extractor. IEEE Trans. Acoust. Speech Signal Process. 1974, 22, 353–362. [Google Scholar] [CrossRef]
- Mauch, M.; Dixon, S. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 659–663. [Google Scholar]
- de Cheveigné, A.; Kawahara, H. YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 2002, 111, 1917–1930. [Google Scholar] [CrossRef]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Petropoulos, F.; Apiletti, D.; Assimakopoulos, V.; Babai, M.Z.; Barrow, D.K.; Taieb, S.B.; Bergmeir, C.; Bessa, R.J.; Bijak, J.; Boylan, J.E.; et al. Forecasting: Theory and practice. IEEE J. Sel. Areas Commun. 2022, 13, 705–871. [Google Scholar] [CrossRef]
- Kim, J.W.; Salamon, J.; Li, P.; Bello, J.P. CREPE: A Convolutional Representation for Pitch Estimation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 161–165. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Park, S.; Jeong, Y.; Kim, M.S.; Kim, H.S. Linear Prediction-based Dereverberation with Very Deep Convolutional Neural Networks for Reverberant Speech Recognition. In Proceedings of the 2018 International Conference on Elektronics, Information, and Communication (ICEIC), Honolulu, HI, USA, 24–27 January 2018; pp. 310–311. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Cho, K.; Merriënboer, B.V.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014; pp. 103–111. [Google Scholar]
- Subramani, K.; Valin, J.M.; Buthe, J.; Smaragdis, P.; Goodwin, M. Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1–5. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
- Michelsanti, D.; Tan, Z.H.; Zhang, S.X.; Xu, Y.; Yu, M.; Yu, D.; Jensen, J. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1368–1396. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems—31st Conference on Neural Information Processing Systems (NIPS2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 1–11. [Google Scholar]
- Zhang, J.; Tang, J.; Dai, L. RNN-BLSTM Based Multi-Pitch Estimation. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 1785–1789. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Mozilla. Common Voice. Available online: https://commonvoice.mozilla.org/ (accessed on 9 May 2025).
- Kato, A.; Kinnunen, T. Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 327–331. [Google Scholar]
- Ikemiya, Y.; Yoshii, K.; Itoyama, K. Singing voice analysis and editing based on mutually dependent F0 estimation and source separation. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 574–578. [Google Scholar]
- Hou, Z.; Lei, T.; Hu, Q.; Cao, Z.; Lu, J. SNR-Progressive Model with Harmonic Compensation for Low-SNR Speech Enhancement. IEEE Signal Process. Lett. 2025, 32, 476–480. [Google Scholar] [CrossRef]
- MathWorks. Sequence-to-One Regression Using Deep Learning. MathWorks Documentation. Available online: https://www.mathworks.com/help/deeplearning/ug/sequence-to-one-regression-using-deep-learning.html (accessed on 9 May 2025).
- Xu, S.; Shimodaira, H. Direct F0 estimation with neural-network-based regression. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1995–1999. [Google Scholar]
- Boashash, B. Time-Frequency Signal Analysis and Processing—A Comprehensive Reference, 2nd ed.; Academic Press: Oxford, UK, 2016. [Google Scholar]
- Liu, Y.; Wu, P.; Black, A.W.; Anumanchipalli, G.K. A Fast and Accurate Pitch Estimation Algorithm Based on the Pseudo Wigner-Ville Distribution. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Wang, X.; Takaki, S.; Yamagishi, J. An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1059–1063. [Google Scholar]
- Cunningham, P.; Cord, M.; Delany, S.J. Supervised learning. In Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval; Springer: Berlin/Heidelberg, Germany, 2008; pp. 21–49. [Google Scholar]
- Jadoul, Y.; Thompson, B.; de Boer, B. Introducing Parselmouth: A Python interface to Praat. J. Phon. 2018, 71, 1–15. [Google Scholar] [CrossRef]
- Boersma, P.; Weenink, D. Praat: Doing Phonetics by Computer [Computer Program]. 2021. Available online: http://www.praat.org/ (accessed on 9 May 2025).
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; p. 800. [Google Scholar]

| Hyperparameter | Value |
|---|---|
| Optimiser | Adam |
| Initial learning rate | 0.01 |
| Learning rate schedule | Constant |
| Number of epochs | 100 |
| Batch size | 128 frames |
| Loss function | Cross-entropy (continuous F0 weighted from class probabilities) |
| Gradient clipping | Threshold = 1 (MATLAB default) |
| Method | LibriSpeech | Common Voice | ||
|---|---|---|---|---|
| RMSE | FLOPs | RMSE | FLOPs | |
| Waveform-to-sinusoid regression | 0.0025 | 116,160 | 0.0028 | 116,800 |
| Sequence-to-one regression | 0.0023 | 80,800 | 0.0026 | 81,200 |
| Direct F0 estimation with neural regression | 0.0027 | 80,800 | 0.0029 | 81,000 |
| PWVD w/LSTM | 0.0026 | 32,800 | 0.0029 | 33,000 |
| Quantised F0 estimation w/MTF | 0.0029 | 29,280 | 0.0032 | 29,400 |
| RNN with attention mechanism | 0.0031 | 20,400 | 0.0035 | 20,600 |
| CREPE | 0.0026 | 131,300 | 0.0027 | 133,800 |
| pYin | 0.0036 | — | 0.0038 | — |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jandera, A.; Muzelak, M.; Skovranek, T. RNN-Based F0 Estimation Method with Attention Mechanism. Information 2025, 16, 1089. https://doi.org/10.3390/info16121089
Jandera A, Muzelak M, Skovranek T. RNN-Based F0 Estimation Method with Attention Mechanism. Information. 2025; 16(12):1089. https://doi.org/10.3390/info16121089
Chicago/Turabian StyleJandera, Ales, Martin Muzelak, and Tomas Skovranek. 2025. "RNN-Based F0 Estimation Method with Attention Mechanism" Information 16, no. 12: 1089. https://doi.org/10.3390/info16121089
APA StyleJandera, A., Muzelak, M., & Skovranek, T. (2025). RNN-Based F0 Estimation Method with Attention Mechanism. Information, 16(12), 1089. https://doi.org/10.3390/info16121089

