RNN-Based F0 Estimation Method with Attention Mechanism

Jandera, Ales; Muzelak, Martin; Skovranek, Tomas

doi:10.3390/info16121089

Open AccessArticle

RNN-Based F0 Estimation Method with Attention Mechanism

by

Ales Jandera

,

Martin Muzelak

and

Tomas Skovranek

^*

Faculty of BERG, Technical University of Kosice, Nemcovej 3, 04200 Kosice, Slovakia

^*

Author to whom correspondence should be addressed.

Information 2025, 16(12), 1089; https://doi.org/10.3390/info16121089

Submission received: 12 November 2025 / Revised: 28 November 2025 / Accepted: 5 December 2025 / Published: 7 December 2025

(This article belongs to the Special Issue Signal Processing and Machine Learning, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Fundamental frequency estimation, also known as F0 estimation, is a crucial task in speech processing and analysis, with significant applications in areas such as speech recognition, speaker identification, and emotion detection. Traditional algorithms, while effective, often encounter challenges in real-time environments due to computational limitations. Recent advances in deep learning, especially in the use of recurrent neural networks (RNNs), have opened new opportunities for enhancing F0 estimation accuracy and efficiency. This paper introduces a novel RNN-based F0 estimation method with an attention mechanism and evaluates its performance against selected state-of-the-art F0 estimation approaches, including standard baseline methods, as well as neural-network-based regression and classification models. By integrating attention mechanisms, the model eliminates the necessity for post-processing steps and enables a more efficient seq2scal estimation process. While the self-attention mechanism used in Transformers captures all pairwise temporal dependencies at a quadratic computational cost, the proposed method’s implementation of the attention mechanism enables it to selectively focus on the most relevant acoustic cues for F0 prediction, enhancing robustness without increasing the model’s complexity. Experimental results using the LibriSpeech and Common Voice datasets demonstrate superior computational efficiency of the proposed method compared to current state-of-the-art RNN-based seq2seq models, while maintaining comparable estimation accuracy. Furthermore, the proposed “RNN-based F0 estimation method with an attention mechanism” achieves the lowest computational complexity among all compared models, while maintaining high accuracy, making it suitable for low-latency, resource-limited deployments and competitive even with standard baseline methods, such as pYIN or CREPE. Finally, the performance of the developed RNN-based F0 estimation method with attention mechanism in terms of RMSE and FLOPs demonstrates the potential of attention mechanisms and sequence modelling in achieving high accuracy alongside lightweight F0 estimation suitable for modern speech processing applications, which aligns with the growing trend towards deploying intelligent systems on resource-constrained devices.

Keywords:

attention mechanism; F0 estimation; fundamental frequency; pitch-lag; recurrent neural network; speech processing

Graphical Abstract

1. Introduction

Pitch-lag estimation, also known as the fundamental frequency estimation or simply F0 estimation (as used throughout this paper), is a crucial task in speech processing, with applications in areas including speech recognition [1], speaker identification [2], and emotion detection [3]. Traditional approaches often depend on methods such as autocorrelation [1], harmonic product spectrum [4], and dynamic programming algorithms like the Viterbi algorithm [5]. For example, the probabilistic YIN—pYIN [6] is one of the most well-recognised advanced digital signal processing (DSP) algorithms for F0 estimation, representing an improvement over the original YIN algorithm [7]. Operating in the time domain, it calculates a cumulative mean normalised difference function to detect periodicity. Importantly, pYIN introduces probabilistic thresholds and employs a Viterbi algorithm for post-processing to produce a temporally smoothed F0 contour and voicing decision. While effective, the traditional DSP methods can be computationally expensive, particularly in real-time applications.

Recent advances in deep learning have led to significant improvements in F0 estimation, with neural networks (NN) demonstrating strong performance in this task. Convolutional Neural Networks (CNNs) have been successfully applied to F0 estimation [8,9], as seen in methods like Convolutional Representation for Pitch Estimation (CREPE) [10], which extracts features directly from raw waveforms without the need for handcrafted feature engineering. However, as a drawback, CNN-based models often require large datasets and substantial computational resources for training and inference. Recurrent Neural Networks (RNNs) offer an alternative training approach [11], particularly for tasks involving sequential data processing [12]. Compared to CNNs, RNN-based methods such as Long Short-Term Memory (LSTM) [13] and Gated Recurrent Unit (GRU) [14] architectures have demonstrated success in modelling temporal dependencies in speech signals. RNNs are especially effective for F0 estimation in speech because they maintain an internal memory through a hidden state shared across time steps. This allows them to utilise information from previous frames to inform predictions for the current frame, thus capturing the continuous and dynamic nature of the F0 contour. Specifically for F0, RNNs can learn how pitch varies over time, which helps produce smoother and more realistic F0 trajectories, as well as enhances robustness against noise and missing F0 measurements by incorporating context from surrounding speech segments [15].

Attention mechanisms [16] were proposed, in part, to address limitations of RNNs, such as capturing long-range dependencies and parallelisation, thereby enhancing NN models by dynamically weighting input features and allowing the model to “concentrate” on the most relevant information. They have been widely adopted in natural language processing and are increasingly used for speech-related tasks, enhancing the interpretability and performance of NN models [17]. Purely attention-based models, like Transformers [18], have achieved state-of-the-art performance in many sequence-to-sequence (seq2seq) tasks, but although powerful, they often involve quadratic complexity concerning sequence length in their self-attention layers, resulting in significant memory and computational demands. Although they have limitations, RNNs, particularly simpler variants like GRUs or LSTMs, offer linear complexity with respect to sequence length. Therefore, they are generally more memory-efficient and have a significantly lower number of parameters compared to their attention-only counterparts that achieve similar performance on moderately long sequences. To enhance the RNNs models, Multi-Pitch estimation [19], for example, employs RNNs combined with Viterbi decoding to track F0 contours across time. However, reliance on Viterbi decoding raises computational complexity, which limits the efficiency of these models in real-time scenarios.

This paper introduces a novel sequence-to-scalar (seq2scal) method for F0 estimation using an attention-enhanced RNN, employing an attention mechanism to directly predict F0 frequency values from speech signals, eliminating the need for post-processing steps and reducing computational demands while maintaining high accuracy. Unlike Transformer-style self-attention, which captures all pairwise temporal dependencies at the cost of quadratic complexity, making it less suitable for low-latency applications, the attention mechanism in the proposed method selectively emphasises only the most relevant acoustic cues for F0 prediction, enhancing robustness without increasing model complexity. Experimental results on the LibriSpeech [20] and Common Voice [21] datasets show that the proposed method delivers strong performance and is considerably less demanding on computational resources compared to state-of-the-art deep learning approaches. This aligns with the growing trend towards deploying intelligent systems on resource-constrained devices (e.g., edge devices, embedded systems, mobile platforms), where computational power, memory, and energy consumption are critical limiting factors.

2. Preliminaries: State-of-the-Art F0 Estimation Methods

To provide a comprehensive comparison of the results, this section briefly describes various state-of-the-art F0 estimation techniques that serve as benchmarks for the proposed classification model. The first three models employ a regression approach, while the latter two are based on a classification approach; the final two methods are well-established baseline techniques.

2.1. Regression Models

Waveform-to-sinusoid regression

Inspired by techniques for extracting F0 contours from noisy signals, this approach trains an RNN to regress sinusoidal representations of F0 frequency from waveform segments [22]. The aim is to approximate the smooth, underlying sinusoid that represents perceived F0, even when signal degradation is present. This method is well-suited for F0 tracking in real-world conditions where robustness to noise and temporal continuity are crucial, such as in singing voice analysis [23] or speech enhancement [24].

Sequence-to-one regression

This method employs a sequence-to-one regression architecture [25], where a single-valued target variable is produced using multiple time-series features. It uses LSTM layers with “last” output mode to capture temporal dependencies across the input, followed by a dense layer to produce a continuous scalar output. As a common baseline for sequence regression tasks, it provides a clear benchmark for evaluating more advanced models.

Direct F0 estimation with neural regression

To estimate the F0 frequency, this regression-based approach [26] utilises short speech segments and a deep NN, where each segment of the signal is processed as a separate input to an LSTM network, which produces a single continuous F0 value. Unlike classification methods, this approach avoids quantisation errors and offers smooth F0 trajectories, making it suitable for applications that require high-precision F0 tracking, such as expressive speech analysis or music transcription.

2.2. Classification Models

Pseudo Wigner-Ville distribution in combination with LSTM—PWVD

Building on traditional signal processing techniques, this method first computes the Pseudo Wigner-Ville Distribution (PWVD) [27] of the input speech signal to generate a high-resolution time-frequency representation. The resulting spectro-temporal features are then fed into an LSTM-based classification network to estimate the F0 class. This hybrid model combines the interpretability and precision of PWVD in the time–frequency domain with the temporal modelling capabilities of LSTMs, making it resilient in noisy or complex acoustic environments [28].

Quantized F0 estimation with multi-tier feedback

Inspired by the work of Wang et al. [29], this method introduces a multi-tier RNN framework where each tier predicts F0 frequency at different temporal resolution. Feedback connections between tiers improve the consistency of F0 contours across various time scales. The model outputs quantised F0 classes, making it especially suitable for text-to-speech synthesis, where smooth, quantised F0 transitions are essential for natural prosody.

2.3. Baseline Models

The probabilistic YIN—pYIN

The pYIN method [6] represents a mature, high-performance algorithm rooted in classical DSP, providing a crucial baseline for F0 estimation. The method comprises two steps: first, it generates multiple pitch candidates with associated probabilities based on a probabilistic distribution of the YIN threshold, followed by a hidden Markov model with Viterbi decoding to process these probabilistic candidates, resulting in a more robust and accurate final pitch track.

Convolutional Representation for Pitch Estimation—CREPE

The Convolutional Representation for Pitch Estimation [10] is a leading state-of-the-art deep learning model that treats F0 estimation as a classification task over 360 log-spaced pitch bins, operating directly on the raw audio waveform. The approach utilises a deep, six-layer CNN architecture to autonomously learn hierarchical spectral features, eliminating the need for traditional handcrafted feature extraction. Due to its architecture, CREPE achieves high accuracy in pitch detection and demonstrates robustness across diverse acoustic conditions and signals, including speech and music. Its superior performance and end-to-end design make it an essential state-of-the-art baseline model in the field, particularly for illustrating the trade-off between efficiency and accuracy.

3. RNN-Based F0 Estimation Method with Attention Mechanism

Unlike traditional RNNs, which typically use the seq2seq architecture and rely on additional algorithms (e.g., Viterbi) to obtain a final result in the form of a single scalar value, the proposed “RNN-based F0 estimation method with attention mechanism” employs an RNN architecture for the direct estimation of the F0 from sequential data (seq2scal). The input can be in the form of a raw speech signal or features, such as spectrogram representations or mel-frequency cepstral coefficients. To achieve this, an LSTM network is used, incorporating an attention mechanism instead of decoders. Therefore, the model outputs a single F0 value as a softmax combination of the weighted sum of the internal states and the weights provided by the attention mechanism.

3.1. The Architecture

The proposed method takes advantage of both RNNs and attention mechanisms. RNNs excel at F0 estimation by utilising their internal memory to process sequential speech data and capture F0 variations over time. When combined with attention mechanisms, RNNs can dynamically focus on the most relevant parts of the input sequence, enhancing accuracy and robustness by selectively emphasising important acoustic cues. This results in smoother and more precise F0 contour predictions, even across longer speech segments.

In Transformer, the self-attention mechanism captures all pairwise temporal dependencies across the entire input sequence by computing attention weights between every token (time step) and every other token in the sequence (via the query and key matrices), thereby effectively establishing a global receptive field. While this comprehensive dependency modelling enables the network to gather information from arbitrarily distant time steps, the highly valued capability in tasks requiring extensive context understanding, such as machine translation, comes at a significant cost. The calculation of these relationships results in a quadratic complexity of

O (T^{2})

concerning the sequence length, which, while effective for achieving maximum accuracy in offline tasks, becomes prohibitive for applications requiring low-latency processing or those dealing with very long acoustic sequences, where the computational cost rapidly becomes unmanageable.

The attention mechanism’s core function in the proposed method is to generate an attention weight distribution that selectively emphasises the most pertinent acoustic cues within the input feature, using an additive attention restricted to a small pre-defined temporal window (frame). By focusing exclusively on the cues with high predictive power in the form of distinct spectral envelopes, the model gains considerable robustness against noisy or irrelevant features, including unvoiced segments or background noise. This selective approach guarantees that the computational complexity

O (T)

remains linear relative to the input sequence length, T, ensuring low latency and making the mechanism highly suitable for real-time applications, where model size and throughput are limited.

This method results in approximately a 10% increase in the total number of trainable parameters compared to a standalone RNN. The proposed mechanism can be understood as a cumulative memory, and this type of architecture can be considered a sec2scal kind of RNN.

The architecture scheme of the proposed model, as presented in Figure 1, consists of the following:

Preprocessed signal in the form of a 20 ms frame serves as the input into the algorithm.
Sequence input layer containing one layer that acts as the entry point for the signal is responsible for feeding the sequential data into the rest of the network, typically by mapping the input to a vector representation.
Recurrent (LSTM) layer with 50 hidden units serves to capture temporal dependencies by maintaining a “memory” of past information in the sequence.
Attention mechanism with 5 heads introduces 1300 parameters and helps the network to process long input sequences more efficiently by reducing the fixed-length summarisation burden from the LSTM alone. Moreover, for better performance, the signal frames are processed in bulk.
Fully connected layers composed of two layers, each of which has two neurons, where every neuron from the previous layer is connected to every neuron in the current layer.
Output layer consists of one layer with one neuron, a softmax activation that produces class probabilities used for the F0 estimation, which represents the model’s output.
F0 (fundamental frequency), which is the lowest frequency of a periodic waveform, is the final output of the network.

3.2. Models’ Settings

The RNN model is trained using a supervised learning approach [30], providing the network with pairs of input speech signals and their corresponding ground-truth F0 frequencies. During the training process, the RNN refines its parameters (weights and biases) utilising the Adam optimiser, a stochastic gradient-based technique that modifies the learning rate for each parameter to achieve robust convergence. Training was carried out for up to 100 epochs, providing the NN sufficient iterations to minimise the loss function while avoiding excessive overfitting. To control gradient instability, a gradient threshold of 1 was set, thus preventing exploding gradients during back-propagation. The initial learning rate was set to 0.01, and a constant-learning-rate-schedule was employed, whereby the learning rate was reduced by a factor of 0.2 after 50 epochs to allow finer convergence as training advanced. The training process monitored a custom metric to evaluate performance, and verbose output was disabled to focus on essential results. The loss function involved the discrepancy between predicted and actual F0 frequencies, based on cross-entropy. The output represents the predicted F0 frequency for each time frame of the input speech signal.

4. Experiment

Several advanced NN techniques for F0 estimation, encompassing both regression and classification methodologies as detailed in Section 2, were implemented, and their performance was assessed based on Root Mean Square Error and Floating-Point Operations. The regression-based models include the “Direct F0 estimation method”, “Sequence-to-one regression”, and the “Waveform-to-sinusoid regression model”, while the classification-based models are represented by the “Pseudo Wigner-Ville distribution in combination with LSTM”, “Quantised F0 estimation with multi-tier feedback”, and the proposed “RNN-based F0 estimation method with attention mechanism”, thus encompassing three methods from each approach, regression and classification.

4.1. Training Dataset and Settings

Two datasets were used for the experiments; the first, the LibriSpeech Automatic Speech Recognition Corpus [20], contains approximately 1000 h of read English speech data, comprising recordings from various speakers with different accents, ages, and genders. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. The second dataset, Common Voice [21], is an open-source, multi-language collection of voices contributed by volunteers, representing an ongoing project by Mozilla that contains thousands of hours of validated speech in over a hundred languages.

The LibriSpeech served as a source of a prefiltered (“clean”) signal, while the Common Voice dataset, by its nature, represented a signal that contains noise. From both datasets, LibriSpeech and Common Voice, the English portions of voice signals were preprocessed using careful segmentation into two short, manageable audio clips of 11 s each (composed of 5.5-s-long audio clips from each dataset), representing a range of ages and genders. These two clips were processed using a standard framing method, with a frame length of 20 ms and a 10 ms hop size (50% overlap), followed by labelling each frame with its ground-truth F0 value, estimated using the Parselmouth library [31,32].

Each audio clip was divided into three subsets: an 80% training set, a 10% development (validation) set to tune hyperparameters and prevent overfitting, and a 10% holdout test set to provide an unbiased evaluation of the models’ generalisation capabilities. These subsets were used for training all the compared models.

All the models used in the experiments employed a common hyperparameters’ configuration (see Table 1). They were trained for up to 100 epochs using the Adam optimiser with an initial learning rate of 0.01, and a mini-batch size set to 128 frames, to balance convergence speed and computational efficiency. A gradient threshold of 1 (managed internally by MATLAB (https://www.mathworks.com/products/matlab.html, accessed on 9 May 2025)) was applied to all models to stabilise training and prevent exploding gradients in recurrent layers.

A constant-learning-rate scheduling strategy was employed across all neural network architectures evaluated in this study, consistently using the default MATLAB Deep Learning Toolbox solver settings. This ensured a standardised training configuration and prevented external learning rate adjustments from affecting the performance comparison. By maintaining a stable learning rate throughout training, the models were able to converge reliably without introducing additional scheduling-related variance across experiments.

Additionally, gradient clipping was implicitly managed during back-propagation by MATLAB’s dlnetwork framework, preventing exploding gradients, which often occur in recurrent architectures due to long-range temporal dependencies in speech sequences. This built-in stability control contributed to smoother convergence behaviour and helped maintain numerical stability across all tested models without manual threshold tuning.

4.2. Evaluation Metrics

The experiment utilises two performance criteria, Root Mean Squared Error for accuracy assessment and Floating-Point Operations to evaluate the complexity of the models.

Root Mean Squared Error (RMSE)

RMSE measures the average size of the error between predicted and actual values; thus, it provides an absolute measure of fit and is especially useful when significant errors are undesirable, as it penalises them more heavily due to squaring. Since RMSE shares the same unit as the output variable, it is easy to interpret and useful for evaluating continuous F0 estimation methods. While in the case of regression models, the use of RMSE is straightforward, assessing the performance of classification methods using the RMSE criterion involves using the NN output probabilities for F0 bins to compute a continuous F0 value. This is done by taking the weighted average of the bin centre frequencies, which is then compared to the ground-truth continuous F0. This experimental setup allows for a direct comparison of accuracy and efficiency across the tested methods, including those based on regression and those based on classification techniques, highlighting the strengths and limitations of each approach in the context of F0 tracking.

Floating-Point Operations (FLOPs)

FLOPs represent a basic measure of a machine learning model’s computational complexity, indicating the total number of floating-point calculations, such as additions, subtractions, multiplications, and divisions, needed for a single pass of the model, where fewer FLOPs indicate a more efficient approach. The total number of FLOPs in an RNN and CNN is calculated as the sum of FLOPs for both the forward and backward passes.

The number of FLOPs in an RNN for each pass (forward or backward) through the network can be calculated as [33]:

\begin{matrix} FLOPs_RNN = T . B (2 . U^{2} + U . I + 2 . U) + T . B (O . U + O), \end{matrix}

(1)

where T is the length of the input sequence, B the number of sequences in a batch, and U the number of hidden units in the RNN. The parameters I and O refer to the sizes of the input and output vectors, respectively.

In case of an CNN the number of FLOPs for a single pass is dominated by the convolution operation:

FLOPs_CNN = K_{h} \cdot K_{w} \cdot C_{i n} \cdot O_{h} \cdot O_{w} \cdot C_{o u t},

(2)

where

K_{h}

and

K_{w}

are the height and width of the kernel (filter size),

C_{i n}

is the number of input channels (the depth of the input feature map),

C_{o u t}

is the number of output channels (number of filters/kernels) and

O_{h}

and

O_{w}

are the output feature map height and width, respectively.

4.3. Performance Results and Discussion

To validate the generalisations of the used NN-based models, two additional 11-s-long audio clips from both datasets LibriSpeech and Common Voice (one audio clip for each), were preprocessed, using only data not employed in training. Frames with a length of 20 ms and a hop size of 10 ms (50% overlap) were generated, with each frame labelled with its ground-truth F0 value, estimated using the Parselmouth library [31].

In the case of classification methods, to effectively analyse and process the F0 frequency of the examined voice signal, converting the F0 value into discrete classes was required, a process known as quantisation. The recommended frequency range of 50–500 Hz, with a 5 Hz step, resulted in 91 classes. During training, continuous F0 values were mapped to their nearest class, while during inference, the model output a class ID, which was then dequantised back into a specific frequency. This systematic approach ensured robust and manageable F0 analysis.

The performance of the F0 estimation methods used in the experiment, in terms of accuracy (RMSE) and computational complexity (FLOPs), is shown in Table 2, which lists the total number of FLOPs for the entire audio clips and the average RMSE values for all 20 ms frames within the 11-s-long audio clips. The first three methods in Table 2, “Waveform-to-sinusoid regression”, “Sequence-to-one regression”, and “Direct F0 estimation with neural regression”, represent regression-based NN-approaches, which, as expected, offer high prediction accuracy at the expense of increased computational effort. Among these three methods, as well as across all compared approaches, the “Sequence-to-one regression” model achieved the lowest RMSE with both datasets (LibriSpeech and Common Voice), requiring a number of FLOPs comparable to that of the “Direct F0 estimation with neural regression” model.

Unlike the regression-based approaches, the latter three models, which are based on the classification approach, the “Pseudo Wigner-Ville distribution in combination with LSTM”, the “Quantised F0 estimation with multi-tier feedback (MTF)”, and the proposed “RNN-based F0 estimation method with attention mechanism”, require about four times less computational effort (in terms of FLOPs), while the RMSE remains comparable to the regression-based models. The proposed “RNN-based F0 estimation method with attention mechanism” achieved the best results in terms of FLOPs, with competitive RMSE performance compared to other classification-based methods, and even when compared to the regression-based techniques used in the experiment, making it the method with the lowest computational complexity among all evaluated approaches.

To provide a comprehensive comparison, the performance of the selected state-of-the-art deep learning models, including the one proposed in this paper, was benchmarked against two essential baselines: pYIN, which represents the classical digital signal processing method, and CREPE, a robust CNN benchmark. The results, as summarised in Table 2, reveal critical trade-offs between accuracy (RMSE) and computational efficiency (FLOPs). CREPE achieved an RMSE of 0.0026 and 0.0027 on the LibriSpeech and Common Voice datasets, respectively, improving accuracy by approximately 17–23% compared to the proposed “RNN-based F0 estimation method with attention mechanism”; however, this enhancement comes at a significant computational cost in the form of more than six times more FLOPs. Whereas the pYIN’s computational demand, although not measured in FLOPs due to its non-neural network architecture, is approximately ten times lower by design than that of the proposed method, but it does so at the expense of the lowest precision among all compared models.

5. Conclusions

RNNs offer a solid foundation for modelling temporal dynamics in sequential data such as speech signals, making them well-suited for F0 estimation tasks. This paper introduces a novel approach “RNN-based F0 estimation with an attention mechanism” which directly maps input sequences to scalar F0 values (seq2scal). To evaluate its performance, several state-of-the-art deep learning approaches were implemented and utilised for F0 estimation, including methods based on regression and classification architectures.

Based on experimental results using both the LibriSpeech and Common Voice datasets, the model demonstrated superior computational efficiency compared to current state-of-the-art RNN-based seq2seq models, while maintaining comparable estimation accuracy. By integrating attention mechanisms, the model was able to focus on the most relevant parts of the input signal, eliminating the necessity for post-processing steps and enabling a more efficient seq2scal estimation process. Furthermore, the proposed “RNN-based F0 estimation method with an attention mechanism” achieved the lowest computational complexity among all compared models while maintaining high accuracy, making it suitable for low-latency, resource-constrained deployments and competitive even with standard baseline methods such as pYIN or CREPE. This makes it not only highly accurate but also the most efficient model tested, confirming that attention-based classification methods can provide robust F0 estimation performance with minimal computational overhead, making them suitable for real-time or resource-limited speech processing applications.

Author Contributions

Conceptualization, A.J. and T.S.; methodology, A.J. and T.S.; software, A.J.; validation, A.J., M.M. and T.S.; formal analysis, A.J. and T.S.; investigation, A.J.; resources, T.S.; data curation, A.J. and M.M.; writing—original draft, A.J. and T.S.; writing—review and editing, A.J., M.M. and T.S.; visualization, A.J. and M.M.; supervision, T.S.; project administration, T.S.; funding acquisition, T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Army Research Office under grantno. W911NF-22-1-0264, by the Slovak Research and Development Agency under the contracts no. APVV-22-0508 and no. APVV-18-0526, by the Slovak Grant Agency for Science under grant no. VEGA 1/0674/23, and the Cultural and Educational Grant Agency of the Ministry of Education, Research, Development and Youth of the Slovak Republic under grant no. KEGA 006TUKE-4/2024.

Data Availability Statement

The data used in the frame of this work, including the LibriSpeech Automatic Speech Recognition Corpus [20] and the Common Voice dataset [21], are publicly available at https://www.openslr.org/12 (accessed on 9 May 2025) and https://commonvoice.mozilla.org/ (accessed on 9 May 2025), respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rabiner, L.R. On the use of autocorrelation analysis for pitch detection. IEEE Trans. Acoust. Speech Signal Process. 1977, 25, 24–33. [Google Scholar] [CrossRef]
Hansen, J.H.L.; Hasan, T. Speaker recognition from speech: A review of the past and present. IEEE Signal Process. Mag. 2015, 32, 74–99. [Google Scholar] [CrossRef]
Drugman, T.; Kane, J.; Ritio, T.; Gobl, C. Prediction of creaky voice from contextual factors. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7967–7971. [Google Scholar]
Schroeder, M.R. Period histogram and product spectrum: New methods for fundamental-frequency measurement. J. Acoust. Soc. Am. 1968, 43, 829–834. [Google Scholar] [CrossRef] [PubMed]
Ross, M.; Shaffer, H.; Cohen, A.; Freudberg, R.; Manley, H. Average magnitude difference function pitch extractor. IEEE Trans. Acoust. Speech Signal Process. 1974, 22, 353–362. [Google Scholar] [CrossRef]
Mauch, M.; Dixon, S. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 659–663. [Google Scholar]
de Cheveigné, A.; Kawahara, H. YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 2002, 111, 1917–1930. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Petropoulos, F.; Apiletti, D.; Assimakopoulos, V.; Babai, M.Z.; Barrow, D.K.; Taieb, S.B.; Bergmeir, C.; Bessa, R.J.; Bijak, J.; Boylan, J.E.; et al. Forecasting: Theory and practice. IEEE J. Sel. Areas Commun. 2022, 13, 705–871. [Google Scholar] [CrossRef]
Kim, J.W.; Salamon, J.; Li, P.; Bello, J.P. CREPE: A Convolutional Representation for Pitch Estimation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 161–165. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Park, S.; Jeong, Y.; Kim, M.S.; Kim, H.S. Linear Prediction-based Dereverberation with Very Deep Convolutional Neural Networks for Reverberant Speech Recognition. In Proceedings of the 2018 International Conference on Elektronics, Information, and Communication (ICEIC), Honolulu, HI, USA, 24–27 January 2018; pp. 310–311. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; Merriënboer, B.V.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014; pp. 103–111. [Google Scholar]
Subramani, K.; Valin, J.M.; Buthe, J.; Smaragdis, P.; Goodwin, M. Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1–5. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
Michelsanti, D.; Tan, Z.H.; Zhang, S.X.; Xu, Y.; Yu, M.; Yu, D.; Jensen, J. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1368–1396. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems—31st Conference on Neural Information Processing Systems (NIPS2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 1–11. [Google Scholar]
Zhang, J.; Tang, J.; Dai, L. RNN-BLSTM Based Multi-Pitch Estimation. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 1785–1789. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
Mozilla. Common Voice. Available online: https://commonvoice.mozilla.org/ (accessed on 9 May 2025).
Kato, A.; Kinnunen, T. Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 327–331. [Google Scholar]
Ikemiya, Y.; Yoshii, K.; Itoyama, K. Singing voice analysis and editing based on mutually dependent F0 estimation and source separation. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 574–578. [Google Scholar]
Hou, Z.; Lei, T.; Hu, Q.; Cao, Z.; Lu, J. SNR-Progressive Model with Harmonic Compensation for Low-SNR Speech Enhancement. IEEE Signal Process. Lett. 2025, 32, 476–480. [Google Scholar] [CrossRef]
MathWorks. Sequence-to-One Regression Using Deep Learning. MathWorks Documentation. Available online: https://www.mathworks.com/help/deeplearning/ug/sequence-to-one-regression-using-deep-learning.html (accessed on 9 May 2025).
Xu, S.; Shimodaira, H. Direct F0 estimation with neural-network-based regression. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1995–1999. [Google Scholar]
Boashash, B. Time-Frequency Signal Analysis and Processing—A Comprehensive Reference, 2nd ed.; Academic Press: Oxford, UK, 2016. [Google Scholar]
Liu, Y.; Wu, P.; Black, A.W.; Anumanchipalli, G.K. A Fast and Accurate Pitch Estimation Algorithm Based on the Pseudo Wigner-Ville Distribution. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wang, X.; Takaki, S.; Yamagishi, J. An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1059–1063. [Google Scholar]
Cunningham, P.; Cord, M.; Delany, S.J. Supervised learning. In Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval; Springer: Berlin/Heidelberg, Germany, 2008; pp. 21–49. [Google Scholar]
Jadoul, Y.; Thompson, B.; de Boer, B. Introducing Parselmouth: A Python interface to Praat. J. Phon. 2018, 71, 1–15. [Google Scholar] [CrossRef]
Boersma, P.; Weenink, D. Praat: Doing Phonetics by Computer [Computer Program]. 2021. Available online: http://www.praat.org/ (accessed on 9 May 2025).
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; p. 800. [Google Scholar]

Figure 1. Architecture scheme.

Table 1. Hyperparameters’ configuration employed for training all neural-network-based models.

Hyperparameter	Value
Optimiser	Adam
Initial learning rate	0.01
Learning rate schedule	Constant
Number of epochs	100
Batch size	128 frames
Loss function	Cross-entropy (continuous F0 weighted from class probabilities)
Gradient clipping	Threshold = 1 (MATLAB default)

Table 2. Comparison of the F0 estimation models, where bold numbers represent the best results.

Method	LibriSpeech		Common Voice
Method	RMSE	FLOPs	RMSE	FLOPs
Waveform-to-sinusoid regression	0.0025	116,160	0.0028	116,800
Sequence-to-one regression	0.0023	80,800	0.0026	81,200
Direct F0 estimation with neural regression	0.0027	80,800	0.0029	81,000
PWVD w/LSTM	0.0026	32,800	0.0029	33,000
Quantised F0 estimation w/MTF	0.0029	29,280	0.0032	29,400
RNN with attention mechanism	0.0031	20,400	0.0035	20,600
CREPE	0.0026	131,300	0.0027	133,800
pYin	0.0036	—	0.0038	—

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jandera, A.; Muzelak, M.; Skovranek, T. RNN-Based F0 Estimation Method with Attention Mechanism. Information 2025, 16, 1089. https://doi.org/10.3390/info16121089

AMA Style

Jandera A, Muzelak M, Skovranek T. RNN-Based F0 Estimation Method with Attention Mechanism. Information. 2025; 16(12):1089. https://doi.org/10.3390/info16121089

Chicago/Turabian Style

Jandera, Ales, Martin Muzelak, and Tomas Skovranek. 2025. "RNN-Based F0 Estimation Method with Attention Mechanism" Information 16, no. 12: 1089. https://doi.org/10.3390/info16121089

APA Style

Jandera, A., Muzelak, M., & Skovranek, T. (2025). RNN-Based F0 Estimation Method with Attention Mechanism. Information, 16(12), 1089. https://doi.org/10.3390/info16121089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RNN-Based F0 Estimation Method with Attention Mechanism

Abstract

1. Introduction

2. Preliminaries: State-of-the-Art F0 Estimation Methods

2.1. Regression Models

2.2. Classification Models

2.3. Baseline Models

3. RNN-Based F0 Estimation Method with Attention Mechanism

3.1. The Architecture

3.2. Models’ Settings

4. Experiment

4.1. Training Dataset and Settings

4.2. Evaluation Metrics

4.3. Performance Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI