Smart-Median: A New Real-Time Algorithm for Smoothing Singing Pitch Contours

: Pitch detection is usually one of the fundamental steps in audio signal processing. However, it is common for pitch detectors to estimate a portion of the fundamental frequencies incorrectly, especially in real-time environments and when applied to singing. Therefore, the estimated pitch contour usually has errors. To remove these errors, a contour smoother algorithm should be employed. However, because none of the current contour-smoother algorithms has been explicitly designed to be applied to contours generated from singing, they are often unsuitable for this purpose. Therefore, this article aims to introduce a new smoother algorithm that rectiﬁes this. The proposed smoother algorithm is compared with 15 other smoother algorithms over approximately 2700 pitch contours. Four metrics were used for the comparison. According to all the metrics, the proposed algorithm could smooth the contours more accurately than other algorithms. A distinct conclusion is that smoother algorithms should be designed according to the contour type and the result’s ﬁnal applications.


Introduction
Estimating the fundamental frequency is usually one of the main steps in audio signal processing algorithms. However, it is common for pitch detector algorithms to make some incorrect estimations, resulting in a pitch contour that is not smooth and includes errors. These errors are often due to doubling or halving estimates of the true pitch value, and are therefore impulsive in appearance rather than random [1][2][3][4][5]. Furthermore, incorrect pitch estimation often happens in real-time pitch detection, especially when the sound source is a human voice [5]. Therefore, a contour-smoother algorithm is necessary to filter the incorrectly estimated F0 before further analysis.
Generally, contour smoothers can be divided into two categories: 1-contour smoothing to show the data trend; and 2-contour smoothing to remove errors, noise, and outlier points.
There are several algorithms for showing contour trend, such as polynomial [6], spline [7,8], Gaussian [9], Locally Weighted Scatterplot Smoothing (LOWESS) [10,11], and seasonal decomposition [12]. One of the applications of trend detection using pitch contours is to find the similarity between melodies [13][14][15][16]. Other contour smoothers, such as moving average [17] and Median filter, function by attenuating or removing outliers in the contour [5]. None of these contour-smoother algorithms was explicitly designed for smoothing pitch contours; they can be used for any contour from any data series. They have been applied to the smoothing of pitch contours, such as in the study by Kasi and Zahorian [18] that used the Median filter. However, there are certain adjusted versions of 2 of 22 these algorithms for smoothing estimated pitches; for example, Okada et al. [19] and Jlassi et al. [20] introduced pitch contour algorithms based on the Median filter. In the following, some of these adjusted algorithms are discussed.
Zhao et al. [2] introduced a pitch smoothing method for the Mandarin language based on autocorrelation and cepstral F0 detection approaches. They first used two pitch estimation techniques to estimate two separate pitch contours, and then both were smoothed. Finally, combining the two estimated pitch contours created the smoothed contour. Generally, their approach was very similar to the idea of this paper, moving through a pitch contour to identify noisy estimates by comparing each point to its previous and succeeding points, and finally editing out the noise. However, their approach involved altering some correct parts of the data, which impacted peaks that were not incorrect. Moreover, in their evaluation, they only checked the error reduction capability of their algorithm for removing octave-doubling and sharp rises in estimated F0s. It would have been preferable to compare their smoothed contours with ground truth, to show how well their algorithm could adjust the estimated contour to make it similar to that of the ground truth.
Liu et al. [21] introduced a pitch-contour-smoother algorithm for Mandarin tone recognition. They used several thresholds for finding half, double, and triple errors by comparing each point with its previous point. Then, an incorrect frequency was doubled, halved, or divided by three, according to the type of error detected. They indicated that experiments should determine the threshold values, but did not provide any guidelines for selecting or adjusting these. In addition, the threshold values they used were not revealed. Therefore, it is unclear how one could change the thresholds to optimize the result. In addition, they tested their algorithm only on an isolated Mandarin syllabus, although realistically they should also have tried their approach on continuously spoken language. Moreover, they did not compare the accuracy of their algorithm with other contour-smoother algorithms to show how well their method performed compared to others.
The smoothing approach presented by Jlassi et al. [20] was designed for spoken English. Their smoothing system was based on the moving average filter. However, they only calculated the average of the two immediately previous F0 points for those points that showed more than a 30 Hz difference from their previous and following points. They compared their algorithm with the Median filter and Exponentially Weighted Moving Average (EWMA), and found improved accuracy using their approach. However, the dataset [22] used in their study was small (15 people reading a phonetically balanced text, "The North Wind Story"). Their results would have been much more convincing if they had evaluated their algorithm with a more extensive dataset generated by various pitch-detector algorithms. Moreover, several metrics could have been employed to measure how well they smoothed the errors. Furthermore, their algorithm considered a difference of more than 30 Hz from both the immediately previous and following points as an error; therefore, it was unable to identify and smooth any errors existing over more than one point on the contour.
Ferro and Tamburini [1] introduced another pitch-smoother technique for spoken English, based on Deep Neural Networks (DNN) and implemented explicitly as a Recurrent Neural Network (RNN). However, they did not provide a comparison between the improvement offered by their approach and that of any other method. In addition, comparison of their datasets and the mixture of datasets suggests that their DNN architecture may not work well with a new dataset.
As exemplified above, many pitch detection algorithms have been designed for and tested on speech. However, although both speech and singing are produced with the same human vocal system, because of the differences between speaking and singing, separate studies are required for pitch analysis of singing [23]. In addition, in real-time environments, the smoother algorithm should alter the contour with a reasonable delay, mainly based on previous data because future data is unavailable.
We believe that the smoother algorithm should be based on the features and applications of the contour, similar to the approach taken by Ferro [3] on smoothing contours generated from speech. In other words, expected error types in the pitch contours for the specific data type should be identified. Then, an investigation for a targeted contour-smoother algorithm to solve these errors should be made. In addition, the applications of the smoothed contour should also be considered. For example, when a highly accurate estimate of the F0 value at each point is required, the smoother algorithm should not change any data except those points identified as incorrectly estimated. Moreover, in real-time environments the smoother algorithm should not have a long delay. This paper, therefore, introduces a new contour-smoother algorithm based on the features and applications of pitch contours that are derived only from singing. For this purpose, after describing the methodology applied, several typical contour-smoother algorithms are described. Then, the proposed algorithm is explained in Section 4, followed by the results and discussion. Finally, a conclusion and suggestions for future work are provided in Section 7.

Dataset
The VocalSet dataset [24] was used to evaluate the algorithms' accuracy. This dataset includes more than 10 h of recordings of 20 (11 males and 9 females) professional singers. VocalSet includes a complete set of vowels and a diverse set of voices that exhibit many different vocal techniques, singing in contexts of scales, arpeggios, long tones, and melodic excerpts. For this study, a portion of VocalSet was selected; the scales and arpeggios sung across the vowels in loud slow and fast performances. The total number of files used from VocalSet was 511.

Ground Truth
In order to evaluate the accuracy of each of the smoother algorithms, ground truth pitch contours were required to compare the smoothed pitch contours. In other words, in this study, the best smoothing algorithm was considered the one that produced contours most similar to the ground truth. According to studies by Faghih and Timoney [4,5], a reliable offline pitch detector algorithm called PYin [24] was used. The pitch contours estimated by PYin were saved in several CSV files with two columns, time in seconds and F0. These were all plotted to ensure the accuracy of the pitch contours estimated by PYin. Those that included irrational jumps were considered incorrect and deleted. Therefore, after removing those contours, the number of the ground truth files remaining was 447.

Pitch Detection Algorithms to Generate Pitch Contours
To evaluate the proposed smoother algorithm, we used a similar approach as [1], employing several pitch contours with different random error (unsmoothed) points. As Faghih and Timoney's [5] study discussed, six real-time pitch detection algorithms with different estimated contours were employed to obtain the required contours. The pitch detector algorithms were Yin [25], spectral YIN or YIN Fast Fourier transform (YinFFT), Fast comb spectral model (FComb), Multi-comb spectral filtering (Mcomb), Schmitt trigger, and the spectral auto-correlation function (specacf). The implementation for these algorithms came from a Python library, Aubio (https://aubio.org/manual/latest/cli.html#aubiopitch, accessed on 10 June 2021) [26], a well-known library for music information retrieval. Since the focus of this paper is on smoothing pitch contours, descriptions of these algorithms are not provided in this paper but can be found in [5,27]. The reason for selecting these real-time pitch estimators was that, based on the study by Faghih and Timoney [5], none of them can estimate F0s without error in singing signals. In addition, the accuracy of these algorithms varies, which helped us evaluate the contour-smoother algorithms in different situations.
In addition, to compare the accuracy of the algorithms in conditions where the pitch contours included no or only a few errors, an offline pitch-detector algorithm provided in the Praat tool [28] based on the Boersma algorithm [29] was used. According to Faghih and Timoney's studies [4,5], the Praat and Pyin accuracies tend to be similar.
The settings used for pitch detection for women's voices were 44,100 for sample rate,  1024 for window size, and 512 for hop size. The related settings for men's voices were  44,100, 2048, and 1024 for sample rate, window size, and hop size, respectively. Therefore, the distance between two consecutive points in a pitch contour for women's voices was 11.61 milliseconds, and for men's voices was 23.22 milliseconds.
As shown in Figure 1, the contours generated by the different pitch detectors exhibited various errors. Therefore, the total number of contours used to evaluate the smoother algorithms was 2682 (corresponding to the six pitch detectors run on each of the 447 wav files). In addition, the accuracy of these algorithms varies, which helped us evaluate the contoursmoother algorithms in different situations.
In addition, to compare the accuracy of the algorithms in conditions where the pitch contours included no or only a few errors, an offline pitch-detector algorithm provided in the Praat tool [28] based on the Boersma algorithm [29] was used. According to Faghih and Timoney's studies [4,5], the Praat and Pyin accuracies tend to be similar.
The settings used for pitch detection for women's voices were 44,100 for sample rate, 1024 for window size, and 512 for hop size. The related settings for men's voices were 44,100, 2048, and 1024 for sample rate, window size, and hop size, respectively. Therefore, the distance between two consecutive points in a pitch contour for women's voices was 11.61 milliseconds, and for men's voices was 23.22 milliseconds.
As shown in Figure 1, the contours generated by the different pitch detectors exhibited various errors. Therefore, the total number of contours used to evaluate the smoother algorithms was 2682 (corresponding to the six pitch detectors run on each of the 447 wav files).    All the provided files, such as the dataset and codes, are available in a GitHub repository at https://github.com/BehnamFaghihMusicTech/Smart-Median, accessed on 6 July 2022.

Evaluation Method
Several evaluation metrics were used to compare the accuracy of the smoothing algorithms. The metrics used for the evaluations were R-squared (R 2 ), Root-Mean-Square Error (RMSE), Mean-Absolute-Error (MAE), and F0 Frame Error (FFE). A well-known Python library called Sklearn [30] was used for the metrics, except for the FFE metric that was created by this paper's authors. These metrics are explained in the following subsections.
The formula for this metric is as follows (1) [31]: where N is the total number of frames, GT is the ground truth contour, SM is the smoothed contour, and the mean(GT) = 1 In the best case, when all the points in the ground truth contour and the estimated contour are similar, R 2 is equal to 1; otherwise, R 2 is less than 1. A value closer to 1 means more similarity between the two contours.

Root-Mean-Square Error (RMSE)
This metric is calculated according to the following Formula (2): In the best case, when the two contours have precisely the same values, the RMSE is 0; otherwise, it is more significant than 0. Closer values to 0 mean more similarity between two contours.

Mean-Absolute-Error (MAE)
Equation (3) shows how to calculate this metric: MAE is similar to RMSE, but, because of the squared difference, RMSE can be considered a more significant penalty for points at a greater distance from corresponding points in the ground truth contour.

F0 Frame Error (FFE)
FFE is the proportion of frames within which an error is made. Therefore, FFE alone can provide an overall performance measure of the accuracy of the pitch detection algorithm [32]. This metric calculates the percentage of points in the estimated pitch contour that are within a Threshold distance of corresponding points in the ground truth pitch contour (4): where N is the total number of frames/points. For the Threshold, in studies such as [1], a constant value, e.g., 16 Hz, was used as an acceptable variation from the ground truth. However, as is discussed by Faghih and Timoney [5], a fixed distance from the ground truth may not be a good approach, because the perceptual effect of 16 Hz is different when the estimated pitch is 100 Hz compared to 1000 Hz. However, it is also common to use a percentage, usually 20%, as the threshold [20], and a similar approach is used in this study.
Higher values of this metric indicate a higher similarity between the smoothed pitch contour and the ground truth pitch contour.
It should be mentioned that there are other algorithms for finding the similarities between pitch contours, such as those of Sampaio [13], Wu [14], and Lin et al. [15]. However, these aim to determine perceptual similarity between two pitch contours. In other words, those researchers were seeking to determine the similarity of one melody to another. The purpose of the current paper is not to ascertain the overall similarities between two tunes, but rather the numerical relationship between each point on two different pitch contours. Therefore, those algorithms were not suitable for this study.

Current Contour Smoother Algorithms
Several contour-smoother algorithms are commonly used to smooth pitch contours. This section provides a list of these algorithms.
In addition, the Python libraries employed to implement these smoothers are listed in Appendix A.  Table 1.
Each of the algorithms is described below.

Gaussian Filter
Generally, in signal processing, filtering removes or modifies unwanted error and noise signals from a series of data. Therefore, Gaussian filters smooth out fluctuations in data by convolution with a Gaussian function [9]. The one-dimensional Gaussian filter is expressed as (5):  Table 1. In addition, the Python libraries employed to implement these smoothers are listed in Appendix A.
Each of the algorithms is described below.

Gaussian Filter
Generally, in signal processing, filtering removes or modifies unwanted error and noise signals from a series of data. Therefore, Gaussian filters smooth out fluctuations in data by convolution with a Gaussian function [9]. The one-dimensional Gaussian filter is expressed as (5): where Es i is the original signal at position i, and Sm i is the smoothed signal at position i. In addition, σ 2 indicates the variance of the Gaussian filter. The smoothing degree depends on the variance value size [9]. Although the Gaussian filter smooths out the noise, as shown in Figure 2a, some correctly estimated F0 may also change, i.e., become distorted [9].

Savitzky-Golay Filter
This particular type of low-pass filter was introduced into analytical chemistry, but soon found many applications in other fields [33]. It can be considered a weighted moving average [34], and is defined as follows (6): where Es i is the original signal at position i, and Sm i is the smoothed signal at position i. M is window length and h k are the filter coefficients that indicate the boundaries of the data. The drawback of the Savitzky-Golay (SG) filter, according to Schmid et al. [35], is that the data near the edges is prone to artefacts. Figure 2a illustrates its effect on a contour.

Exponential Filter
This approach is based on weighting the current values by the previously observed data, assuming that the most recent observations are more important than the older ones. The smoothed series starts with the second point in the contour. It is calculated by [36], (7): where α is called the smoothing constant. An illustration of this smoothing effect can be seen in Figure 2a.

Window-Based Finite Impulse Response Filter
In this approach, a window works as a mask to filter the data series. Different window shapes can be considered for filtering data. Each window point is usually between 0 and 1. Therefore, this method uses weighted windows. If Es i is considered a signal at index i, and a window at index i as w i , the smoothed signal Sm i is calculated as follows, (8): The window types used in this study are described below.

Rectangular Window
This means that the window's values all equal one; Figure 2b.

Hanning Window
The Hanning window is defined as follows, from [37] (9): where N is the length of the window; Figure 2b.

Hamming Window
The Hamming window is defined as follows, from [37] (10): where N is the length of the window; Figure 2b.

Blackman Window
The Blackman window is defined [38] by (12): where N is the window length, and a 0 , a 1 and a 2 are constants (13): The α is static, and equals 0.16; Figure 2c.

Direct Spectral Filter
In this approach, a time series is smoothed by employing a Fourier Transformation. The essential frequencies remain, and others are removed. It operates similarly to multiplying the frequency domain by a rectangular window. In other words, it is a circular convolution generated by transforming the window in the time domain; Figure 2c.

Polynomial
This approach uses weighted linear regression on an ad-hoc expansion basis to smooth the time series. It is a generalization of the Finite Impulse Response (FIR) filter that can better preserve the desired signal's higher frequency content without removing as much noise as the average [39]. The first derivative of the polynomial evaluated at the midpoint of the N-interval is generated by multiplying the position data Es i by coefficients and adding these multiplications, as shown in (14) [6]: where W i are the weights (coefficients) of the polynomial fit of degree p. The weights depend on the degree p, and the number of points, N, used in the fit; Figure 2c is an example.

Spline
This approach employs Spline functions to eliminate the noise from the data. It works by estimating the optimum amount of smoothing required for the data. Three types of spline smoothing were used in this study: 'linear' (Figure 2c), 'cubic' (Figure 2d), and 'natural cubic' (Figure 2d). The details of this approach are provided in [7,8].

Binner
This approach applies linear regression on an ad-hoc expansion basis within a time series. The features created by this method are obtained via binning the input space into intervals. An indicator feature is designed for each bin, indicating into which bin a given observation falls. The input space consists of a single continuous increasing sequence in the time series domain [40]; an illustration is shown in Figure 2d.

Locally Weighted Scatterplot Smoothing (LOWESS) Smoother
This is a non-parametric regression method. LOWESS attempts to fit a linear model to each data point, based on local data points; Figure 2e. This makes the procedure more versatile than simply including a high-order polynomial [10,11].

Seasonal Decomposition
One of the considerations in analysing time series data is dealing with seasonality. A seasonal decomposition deconstructs a time series into several components: a trend, a repeating seasonal time series, and the remainder. One of the benefits of seasonal decomposition is its capacity to locate anomalies and errors in data [12]. Seasonal decomposition can estimate the notes and seasons in a pitch contour, but the vibrations sung in each note are removed. Therefore, it can show the movements between seasons and notes in a pitch contour, as shown in Figure 2e,f.
Two seasonal component assessments may be used: 'additive' and 'multiplicative'. In the additive method, the variables are assumed to be mutually independent and calculated by summation of the variables. The multiplicative approach considers that components are dependent on each other, and it is calculated by the multiplication of the variables [41].
Seasonal decomposition can be employed using different smoothing techniques. The smoothing techniques used in this study are Window-based, 'LOWESS', and 'nat-ural_cubic_spline'.

Kalman Filter
The Kalman filter is a set of mathematical equations that provides an efficient recursive means to estimate the state of a process in a way that minimises the norm of the squared error. The Kalman filter uses a form of feedback control, assessing the process state and then obtaining feedback in the form of (noisy) measurements. The equations for the Kalman filter have two parts: time update equations and measurement update equations. The time update equations operate as predictor equations, while the measurement update equations are corrector equations. Thus, the overall estimation algorithm is close to a predictorcorrector algorithm, i.e., correcting to improve the predicted value. In the standard Kalman filter, it is assumed that the noise is Gaussian, which may or may not reflect the reality of the system that is being modelled [42]. Thus, the more accurate the model used in the Kalman algorithm, the better the performance.
The Kalman smoother can be represented in the state space form. Therefore, a matrix representation of all the components is required. Four structure presentations in the contours are considered: 'level', 'trend', 'seasonality' and 'long seasonality', and a combination of these structures can be considered. Examples of the effects of different variations of the Kalman filter are shown in Figure 2f,g.

Moving Average
This simple filter aims to reduce random noise in a data series [17] by following the Formula (15): Es i+j (15) where Es is the original pitch contour, Sm is the smoothed pitch contour, and n is the number of points analysed at any given time and is referred to as the window length of the filter. The larger the value of n, the greater the level of smoothing. An example can be seen in Figure 2h.

Median Filter
The Median filter approach is similar to the moving average. Still, instead of calculating the average of a window of length n, the Median of the window is considered (16). Unlike the moving average filter, which is a linear system, this filter is nonlinear, rendering a more complicated analysis: where Es is the original pitch contour, Sm is the smoothed pitch contour, and n is the number of points to calculate the Median at each instant. Figure 2h illustrates the effect of this method.

Okada Filter
This filter is an exciting combination of moving average and Median filter. This filter aims to remove the outliers from a contour while closely retaining its shape, by not incurring softening of contour definition at transitions typically observed with smoothing. Each of the estimated points Es i in a contour is compared with its immediate previous and successive points, Es i−1 and Es i+1 , respectively. If Es i is the median of Es i−1 , Es i , and Es i+1 , then it does not need to be changed, otherwise Es i will be replaced by the average of Es i−1 and Es i+1 , as shown in (17). In this case, the first and the last point will not be changed [19].

Jlassi Filter
This technique was presented by Jlassi et al. [20]. This approach has two main steps, first, finding the incorrect points in the pitch contour by considering those that exhibit a difference of more than a set threshold from both their previous and successive points. Second, replacing the incorrect point with the average of the last two points (18): The value for Threshold is assumed to be 30, as mentioned in the original paper. Figure 2h illustrates the effect of the algorithm.

Smart-Median: A Real-Time Pitch Contour Smoother Algorithm
The approach applied in this study to adjust the incorrectly determined pitch values was based on the Median method, and has been named Smart-Median. The Smart-Median method is based on the belief that each contour should be smoothed based on its data features and intended applications. In other words, a general contour smoother may not be suitable for all applications. The considerations for designing the Smart-Median are given below.

Considerations
The Smart-Median algorithm is based on the following considerations: 1.
Only the incorrectly estimated pitches need to be changed. Therefore, it is necessary to decide which jumps in a contour are incorrect.

2.
To calculate the median, some of the estimated pitches around the incorrectly detected F0 should be selected. This represents the window length for calculating the median. Therefore, the decision on the number of estimated pitches before and/or after the erroneously estimated pitches provides the median window length. Thus, a delay is required in real-time scenarios to ensure sufficient successive pitch frequencies are available when correcting the current pitch frequency.

3.
There is a minimum duration for which a human can sing.

4.
There is a minimum duration for which a human can rest between singing two notes.

5.
There is a maximum frequency that a human can sing 6.
There is a maximum interval during which humans can move from one note to another when singing.

7.
A large pitch interval in a very short time is impossible.
The following section explains our decisions regarding each of the above considerations.

Smart-Median Algorithm
The flowchart shown in Figure 3 illustrates how incorrectly estimated pitches can be distinguished. In addition, it indicates which estimated pitches should be selected to calculate the median for the wrongly detected pitches.  There are several variables and functions in the flowchart, explained as follows: 1. refers to the frequency at index . 2. AFD (Acceptable Frequency Difference) indicates the maximum pitch frequency interval acceptable for jumping between two consequent detected pitches. In two studies on speech contour-smoother algorithms [2,20], 30 Hz was selected as the AFD according to the researchers' experiences. Because the frequency range that humans There are several variables and functions in the flowchart, explained as follows: 1. F i refers to the frequency at index i.

2.
AFD (Acceptable Frequency Difference) indicates the maximum pitch frequency interval acceptable for jumping between two consequent detected pitches. In two studies on speech contour-smoother algorithms [2,20], 30 Hz was selected as the AFD according to the researchers' experiences. Because the frequency range that humans use for singing is wider than for speaking, a larger AFD is needed for singing. According to the dataset used, the largest interval between two consequently notes sung by men was from C4 to F4, at frequencies of approximately 261 Hz and 349 Hz, respectively, so the maximum interval was 88 Hz for men. The largest interval between notes sung by women was C5 to F5, at frequencies of approximately 523 Hz and 698 Hz, respectively. Therefore, the biggest interval for women was 175 Hz. According to our observations of pitch contours, the human voice cannot physically produce such a big jump within a 30 ms timestep; i.e., for moving from C4 to F4 or from C5 to F5, more than 30 ms is needed. Therefore, it was found that an AFD with a value of 75 Hz was an acceptable choice for pitch contours comprised mostly of frequencies less than 300 Hz (male voices). For those with frequencies that mostly greater than 300 Hz (female singers), 110 Hz was a good choice of AFD.

3.
noZero: this is the minimum number of consequent zero pitch frequencies that should be considered a correctly estimated silence or rest. In this study, 50 milliseconds was regarded as the minimum duration for silence to be accepted as correct [43]; otherwise, the silence requires adjustment to the local median value.

4.
The ZeroCounter(i) method calculates how many frequencies (pitches) of zero value exist after index i. The reason for checking the number of zero values (silence) is to ascertain whether or not the pitch detector algorithm has estimated a region of silence correctly or in error.

5.
Median(i,j): calculates the median based on pitch frequencies from index i to index j. 6.
PD (Prior Distance): this indicates how many estimated pitches before the current pitch frequency should be considered for the median. In this study, the PD was calculated to cover three estimated pitch frequencies, approximately 35 and 70 milliseconds for men's and women's voices, respectively. Nevertheless, the algorithm does not need to wait until this duration becomes available, e.g., at a time of 20 milliseconds, covering 20 milliseconds with PD is sufficient. 7.
FD (Following Distance): indicates how many estimated pitches after the current pitch frequency should be considered for the median. In this study, the number three was assigned to FD, meaning that to calculate the median of the current wrongly estimated pitch required 35 milliseconds for women's voices and 70 milliseconds for men's voices. Therefore, in real-time environments a delay is required until three more estimated pitches are available. 8.
MaxF0: indicates the maximum acceptable frequency. In this study, for male voices, a value of 600 Hz (near to tenor) and for female voices, a maximum of 1050 Hz (soprano) were considered for MaxF0. Rarely, male and female voices may exceed these boundaries. However, if the singer's voice range is higher than these boundaries, a higher value can be considered for MaxF0.
The first condition in Figure 3 aims to calculate whether the frequency at index i is valid. There are three conditions for considering invalid estimates of pitch frequency. First, the previously estimated pitch should not be zero, because after a silence there should naturally be a significant difference between the current pitch frequency and the rest. Second, the absolute difference between the current estimated pitch and the previous one should be greater than the AFD. Finally, the number of consecutive zeros from the current index should be less than noZero. This condition checks whether the current index is zero, but it cannot be considered a proper rest.
If the current estimated pitch is not a good frequency, it branches to the right to "Yes". The algorithm then continues by reducing the value of FD until the second condition is no longer true. In other words, the window for calculating the median shrinks until the difference between the calculated median and the previous point is less than the AFD. Finally, the correct median is held in the Med variable. This should be less than the MaxF0 if it is considered a valid replacement value; otherwise, a zero will be substituted instead.
Since several incorrect estimated pitches have been observed after silences, the third condition in Figure 3 checks whether the estimated F0 immediately follows a silence. In this case, the difference between the current estimated F0 and the next estimated F0 is considered. If neither the first nor the third conditions are correct, the estimated F0 is assumed to be accurate, and it does not need to be changed.
For more detail, the algorithm's source code is available from the GitHub repository mentioned above.

Results
This section provides the results of the comparisons between the Smart-Median algorithm and the other 35 contour smoothers mentioned in Section 3.3. Three groups of data were obtained for evaluation. These groups were 1-the ground truth pitch contour (GT), 2-the original estimated pitches (ES), and 3-the smoothed contour. The metrics explained in Section 3.5 were employed to compare these data groups. The data series were compared two by two, i.e., GT with ES, GT with SM, and ES with SM.
Tables A4-A8 show the accuracy of each of the pitch detector algorithms, and the accuracy of contour-smoother algorithms applied to the estimated pitch contours to bring them closer to the ground truth pitch contour. The GT-ES columns show the initial difference between ground truth and the original estimated pitch contour. Next, the differences between ground truth and the smoothed contours are shown in the GT-MS columns. Finally, the results of comparing the initially estimated pitch contour and the smoothed pitch contour are provided in the ES-SM columns. The metrics comparing GT and SM are more important than those comparing GT-ES and ES-SM, because the values of GT-SM illustrate the resulting improvement supplied by each algorithm. For example, in the Specacf column in Table A7,  According to Tables A4-A7, the Smart-Median was the best algorithm for all pitch contours estimated by Specacf, FComb, MComb, Yin, or YinFFT. However, the best accuracy for the pitch contours calculated by Praat was recorded by the contour smoother code 33 (standard median). However, there was no agreement between the metrics employed to select the best smoother of pitch contours generated by Schmitt or PYin. Table A8 aggregates all the data in Tables A4-A7. It can be observed in Table A8 that all the metrics agree that the Smart-Median worked better than the other smoother algorithms.
Only the GT-SM column was considered to have found significant differences between the accuracy of the algorithms. All the algorithms in the range of the column average plus/minus standard deviation were considered to exhibit a similar accuracy. The algorithms with values outside this range were considered to be in the best or worst category, as shown in Table 2. There were certain agreements and disagreements between the metrics employed to find the best and worst algorithms. For example, the smoother code 07 was in the worst category based on the metrics MAE and RMSE, but in the best category based on the FFE metric. These agreements and disagreements are discussed in Section 6.
An ANOVA test was used to check the accuracy of the smoother algorithms. For all the metrics, the p-value calculated for each smoother algorithm was 0. That means that the accuracy of all the smoother algorithms depended on errors that occurred in the pitch contours, i.e., the smoother algorithms did not work with the same accuracy when each pitch contour was affected by different sources of error. All the other algorithms

Discussion
This section discusses several aspects of the results obtained in Section 5. Because this paper focuses on the Smart-Median method, the only considerations provided here are those relating to comparisons of the accuracy of Smart-Median with that of other smoother algorithms.

Comparing the Results of Each Metric
A higher R 2 value does not always mean a better fitting [44]. For example, Table 3 shows the R-squared scores of three series of predicted data. According to the R-squared (R 2 ) scores in Table 3, the order of the best prediction to the worst was 4, 3, 2, then 1. However, Predict 3 estimated two wrong notes, such that each was one tone above the corresponding ground truth notes (A2 instead of G2), while Predicts 1 and 2 each had only one wrong estimated note (B2 instead of A2). Therefore, musically, the third was the worst, but based on R-squared, it was the second-best. In addition, musically, Predict 1 and Predict 2 were similar, and the 0.2 Hz pitch frequency difference could easily have resulted from a different method of F0 tracking, but their R-squared scores were different. In conclusion, we cannot compare two series of smoothed pitches based only on R-squared. According to the RMSE and MAE columns in Table 3, the best to worst series were 4, 2, 1, then 3. This order is better than that based on R-squared. However, musically, we need to consider the similarity of Predict 1 and Predict 2; based on the FFE column in Table 3, Predicts 1 and 2 both had the same value. As shown in Table 3, Predict 4 was the best according to all the metrics, and musically, it was also the best. Moreover, although Predicts 1 and 2 were musically similar (FFE metric), Predict 2 was more accurate than Predict 1 (R 2 , RMSE, and MAE metrics).
To conclude, a single metric alone cannot provide a clear and accurate evaluation to compare pitch contours, but a firm conclusion can be reached by using all of them.

Comparing Moving Average, Median, Okada, Jlassi, and Smart-Median
The main weakness of the Median, Okada [19], and Jlassi [20] filters is that they only adjust noises with a duration of one point in the contour. In other words, if more than one consecutive wrongly estimated pitch occurs within a contour, these algorithms cannot smooth the errors. The following example illustrates the operation of the moving average, Median, Okada, Jlassi, and Smart-Median approaches on a data series.
As shown in Table 4, the moving average and Median methods changed some of the correctly estimated values, i.e., the 102 value which was the second piece of input data. On the other hand, Okada's and Jlassi's approaches did not change any of the values, because they look for significant differences with immediately preceding and following points. However, the Smart-Median is mainly concerned with finding an acceptable jump by comparing the current and previous points. Because of this different approach to identification of errors, when the pitch contour was already almost smooth (contours estimated by Praat and PYin) there was no significant difference between the accuracy of these approaches (as seen by comparing rows 00, 33, 34, and 35 in Praat and PYin columns in Tables A4-A7). However, while the pitch contours estimated by the other pitch detection algorithms exhibited several errors, Smart-Median appears to have worked in a meaningful manner that outperformed all other methods (observable in Specacf, Schmitt, FComb, MComb, Yin, and YinFFT columns in Tables A4-A7). Generally, according to Table A8, the accuracy of Smart-Median based on the four metrics was much better than all the other algorithms.

Accuracy of the Contour Smoother Algorithms
All the contour smoother algorithms provided strong results according to the R 2 and RMSE metrics (by comparing the GT-ES columns with GT-SM columns in Table A8). However, only the Smart-Median (00), Median (33), and Jlassi (35) approaches could change the pitch contour to ensure that more of the estimated F0 values were constrained to the range of 20% of the ground truth pitch contour (Tables A4-A8). Therefore, although all the algorithms smoothed contour errors, many also altered the value of the corrected estimated pitches.

Conclusions
This paper has introduced a new pitch-contour-smoother targeted towards the singing voice in real-time environments. The proposed algorithm is based on the median filter and considers the features of fundamental frequencies in singing. The algorithm's accuracy was compared with 35 other smoother techniques, and four metrics evaluated their results: R-Squared, Root-Mean-Square Error, Mean Absolute Error, and F0 Frame Error. The proposed Smart-Median algorithm achieved better results across all the metrics, in comparison to the other smoother algorithms. According to this study, a buffer delay of 35 to 70 milliseconds is required for the algorithm to smooth the contour appropriately.
Most of the general smoother algorithms did not show acceptable accuracy. A general observation is that in the ideal case, a smoother algorithm should be defined based on the essential features of the data in the contour and how that data is to be used after smoothing.
For future work, one short-term task is based on recognizing that the parameters of the Smart-Median can be set according to the specific properties of the sound input, such as those of particular musical instruments or their families, to improve accuracy in a targeted way. Another task considers that the Smart-Median finds the incorrect F0 based on its interval from the previous F0; this approach can be improved by considering a maximum noise duration. For example, if there is a considerable frequency interval between the previous F0 and the current one, or if several immediately subsequent F0s are near to the current F0, then we may not consider the large jump to be noise but rather a new musical articulation. This requires the introduction of an extra decision-making stage into the algorithm. In the longer term, further testing can be carried out on vocal material from a wide variety of genres and techniques. This would require the creation of new, specialist corpora, requiring considerable manual effort in both the gathering and labelling. This can be supported by machine learning. Such a dataset would also benefit the research field at large.

Conflicts of Interest:
The authors declare no conflict of interest. In addition, the funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Table A1. Python libraries used for pitch detection.