Singing Transcription from Polyphonic Music Using Melody Contour Filtering

: Automatic singing transcription and analysis from polyphonic music records are essential in a number of indexing techniques for computational auditory scenes. To obtain a note-level sequence in this work, we divide the singing transcription task into two subtasks: melody extraction and note transcription. We construct a salience function in terms of harmonic and rhythmic similarity and a measurement of spectral balance. Central to our proposed method is the measurement of melody contours, which are calculated using edge searching based on their continuity properties. We calculate the mean contour salience by separating melody analysis from the adjacent breakpoint connective strength matrix, and we select the ﬁnal melody contour to determine MIDI notes. This unique method, combining audio signals with image edge analysis, provides a more interpretable analysis platform for continuous singing signals. Experimental analysis using Music Information Retrieval Evaluation Exchange (MIREX) datasets shows that our technique achieves promising results both for audio melody extraction and polyphonic singing transcription. using sinusoid extraction the building of the spectrogram superposition presented. The contours are selected by a salience function based on edge searching and contour ﬁltering by setting breakpoints and post-processing. In the contour is segmented using stability region segmentation and a pitch line matching algorithm. In MIREX audio melody extraction


Introduction
The task of singing transcription from polyphonic music has always been worthy of research, and it represents one of the most difficult challenges when analyzing audio information and retrieving music content. The automatic music transcription approach described in [1] exclusively considers instrumental music and does not discuss the effects of drum instruments. The emphasis is on estimating the multiple fundamental frequencies of several concurrent sounds, such as a piano. The definition of polyphonic music is different from multiple fundamental frequency estimation and tracking [2] and automatic music transcription. In music, polyphony is the simultaneous combination of two or more tones or melodic lines. However, according to the description in Mirex2020 Singing Transcription, the task of transcribing polyphonic music containing only monophonic vocals into notes is known as singing transcription from polyphonic music [3]. Even a single interval made up of two simultaneous tones, or a chord of three simultaneous tones, is rudimentarily polyphonic. A piece of music could exhibit a rather moderate degree of polyphony, featuring a predominant singing voice and a light accompaniment. This definition is consistent with the description in [4], and the singing voice is the source of the main melody in popular music. The data collections used for the evaluation of melody extraction tasks and singing transcription tasks deal mostly with percussive instrument and rhythmic instrumental music.
In this article we divide singing transcription into two sub-tasks based on note-level and frame-level factors. First, from the perspective of notation, note transcription, with notes as the basic units, is consistent with people's cognition of music. Second, according to human auditory effects, melody extraction, which uses signal frames as the basic units, conforms to the characteristic of decomposing audio into signal frames through a Fourier transform in audio analysis. The audio signal displayed on the two-dimensional plane is The rest of the article is structured as follows. In Section 2, the phase spectral reconstruction using sinusoid extraction and the building of the salient spectrogram by harmonic superposition are presented. The melody contours are selected by a salience function based on edge searching and contour filtering by setting breakpoints and post-processing. In Section 3, the melody contour is segmented using stability region segmentation and a pitch line matching algorithm. In Section 4, we evaluate our methods by MIREX audio melody extraction metrics and the singing transcription criterion. Finally, in Section 5, we present conclusions about our system.

Melody Extraction
The block diagram in Figure 1 illustrates an overview of the proposed algorithm. The following subsections describe successive stages of the singing transcription system. its rhythmical aspects. The creation of a singing transcription system that is able to transcribe polyphonic music in a karaoke scene, without setting restrictions on the degree of polyphony and instrument types, is still an open question.
The rest of the article is structured as follows. In Section 2, the phase spectral reconstruction using sinusoid extraction and the building of the salient spectrogram by harmonic superposition are presented. The melody contours are selected by a salience function based on edge searching and contour filtering by setting breakpoints and post-processing. In Section 3, the melody contour is segmented using stability region segmentation and a pitch line matching algorithm. In Section 4, we evaluate our methods by MIREX audio melody extraction metrics and the singing transcription criterion. Finally, in Section 5, we present conclusions about our system.

Melody Extraction
The block diagram in Figure 1 illustrates an overview of the proposed algorithm. The following subsections describe successive stages of the singing transcription system.

Phase Processing
In the audio signal processing field, Fourier transform technology, which is used for converting the time-frequency domain, has been widely used. With the development of preprocessing by many open-source projects, including those by Librosa [25] and Madmom [26], additional preprocessing of input audio signals will no longer be necessary, such as enhancing high-frequency-band energy with a high-pass filter [10]. The advantage of this is that it is only necessary to adjust the parameters and change the function to obtain the optimal result, which improves the program's running efficiency and reduces the difficulty of basic analysis. One of the most classical and enduring formulas is the discrete Fourier transform, which is formulated as follows, where ∈ − , , = 0, 1, 2 …, in the sequence; ( ) is the input audio signal; w(n) is the window function; N is the number of points for the FFT (fast Fourier transform); H is the hop length; and is the number of frames. The original signal sampling rate is 44.1

Phase Processing
In the audio signal processing field, Fourier transform technology, which is used for converting the time-frequency domain, has been widely used. With the development of preprocessing by many open-source projects, including those by Librosa [25] and Madmom [26], additional preprocessing of input audio signals will no longer be necessary, such as enhancing high-frequency-band energy with a high-pass filter [10]. The advantage of this is that it is only necessary to adjust the parameters and change the function to obtain the optimal result, which improves the program's running efficiency and reduces the difficulty of basic analysis. One of the most classical and enduring formulas is the discrete Fourier transform, which is formulated as follows, where k ∈ − N 2 , N 2 , l = 0, 1, 2 . . ., in the sequence; x(n) is the input audio signal; w(n) is the window function; N is the number of points for the FFT (fast Fourier transform); H is the hop length; and l is the number of frames. The original signal sampling rate is 44.1 KHz, n = 8192, H = 441 (satisfying the requirement of 10 ms for a frame; the overlap rate is 94.6%). For a larger window length and a higher frequency resolution, the relative time resolution is comparatively reduced. Effective specification information that involves Appl. Sci. 2021, 11, 5913 4 of 19 a relatively small hop length and large window length facilitates the subsequent detection effect of spectral edge searching.
The accuracy of the FFT bin frequency depends on the sampling rate and the number of points in the FFT, N. According to the Nyquist sampling theorem, each bin represents 5.38 Hz. The most relevant information contained in the spectrogram resulting in sinusoids noise is included in the spectral data analysis. Due to the logarithmic relationship of frequency f 0 and pitch, We need to improve the accuracy of the low frequency more effectively; thus, it is necessary to calculate the instantaneous frequency and amplitude using the Fourier coefficient. The instantaneous frequency calculation proposed by Boashash [27] has been used as a classical algorithm, and the FFT phase spectrum based on the estimation method in [28] can also provide a good reference for the analysis of polyphonic music.
We use a famous method, phase vocoder, to calculate the instantaneous frequency and amplitude. However, the calculation of amplitude is slightly different from the traditional approach: we processed the phase angle of the neighboring frame in the spectrogram graph obtained by a Fourier transform with the phase Angle ϕ l of the previous frame ϕ l−1 , which is calculated as follows: where k = 0, 1 . . . N 2 − 1; Ґ is the constructed normalization function; and the phase angle difference is within a stable distribution within the interval (−2π, 2π). With a gradual increase in the k value, the high-frequency part affected by phase angle change is reduced, and the opposite phase angle change is larger in the low-frequency area. Ґ can clearly reflect the rate of phase angle change, and the function effectively results in fluctuations in radiation in the phase angle difference, reflecting the low-frequency area between adjacent frames of instantaneous frequency change. The instantaneous frequency and amplitude are calculated as follows: where δ = 0.24, σ = 1.5, A l represents the origin amplitude of each bin, and the amplitude of each bin is recalculated through the designed kernel function to obtain the instantaneous amplitude A l . This function reduces the energy of both low-frequency and high-frequency regions in proportion to the reduction in the error of subsequent melody extractions caused by excessive energy in the accompaniment and octaves. The extraction of the instantaneous frequency F l is the superposition of the frequency f l and each bin. In the next step, we only retain the instantaneous frequency and amplitude in the spectrum, and the non-peak points are filtered in the new spectrum. In this way, we preliminarily retain the main features of the spectrogram through the Fourier transform and reconstruct the information for the peak points.

Construct Salience Spectrogram
In this section, we reconstruct the entire energy frequency distribution by constructing a salience graph. Similar to the salience function [29] in some classical algorithms, the energy of the higher harmonics is accumulated downward in an attempt to enhance the energy of the fundamental frequency.
Next, we convert the instantaneous frequency into a MIDI pitch. In order to better match the auditory range of the human voice, we set the pitch range to five octaves from Appl. Sci. 2021, 11, 5913 5 of 19 A1 to A6, where the singing voice is at an effective level, which is divided into 60 semitones. Analogous to the salience function used by Salamon [17], we covered the frequency range from 55 to 1760 Hz, implying that our bin range was from 11 to 352, while the quantization range was reduced by a factor of 10. According to the human auditory effect, to judge the standard of a pitch change when the fluctuation of adjacent frames is more than a specific threshold, a new note appears. Similarly, the difference in pitch between adjacent peaks is allowed to be within one semitone, which is analogous to the distance between adjacent pixels on the edge of an image. It is this pitch difference that is an important indicator for distinguishing between melodic contours. Therefore, in analyzing melodic properties, we classify pitches as semitones. Our formula for obtaining the relevant MIDI pitch for each peak point of each frame of l k is as follows: After the frequency is converted to a MIDI pitch, we use the salience function to conduct a down-scaling superposition of the octave at the peak point l k . The energy of the high-order harmonic superposition gradually decreases with the increase in harmonic frequency. The equation is as follows: where δ = B l k h − τ ; I corresponds to 60 bins; N represents the count of harmonics; the experimental result is 10; τ = 1, 2, 3 . . .; and α is the parameter of the high-order harmonic attenuation coefficient, which is chosen as 0.8. The judgment condition represents l k , the condition that the absolute value of the pitch difference between each octave of the peak point and the fundamental frequency is less than a semitone. The obtained peak energy is taken as the cosine change multiplied by the power of the number of harmonics. The cosine function means that the result of the corresponding salient function is a fluctuation distribution, which is conducive to the superposition between peak values, and it increases the convenience of our later contour search problems. The result also shows a remarkable effect in terms of finding the potential fundamental frequency and dealing with the misjudgment caused by octave error. The superimposing of harmonic energy can also effectively increase the energy at some peak points, stabilize and enhance the energy distribution of the system, and contribute to the accuracy of detecting a singing voice.
Through further analysis of a large number of graphs, the energy contained in the low-frequency region has a magnitude range substantially higher than most of the mediumfrequency regions. In order to maintain the relative balance of bands in different frequency domains, we use a Gaussian weighting function for each frame to reduce the energy in the low-frequency region.
where k = 0, 1, 2 . . . and δ = 3.3; weighting the low-frequency harmonics separately is helpful for enhancing the fundamental frequency result, allowing us to effectively control the overall realization of low-frequency suppression. Excessive low-frequency energy is the main factor causing the potential deviation of the fundamental frequency [30]. In the case of similar energy regions, weighting can make the selection of the fundamental frequency more centralized, which is in line with the human auditory effect. We chose Gaussian weighting due to its versatility and high level of implementation, and the robustness of the system was therefore enhanced.

Singing Voice Detection
After the construction and processing of the salience graph, we filtered out spectral peak in each frame to find the singing voice. We applied edge searching [31], which is widely used in image recognition to find potential contours. Two types of characteristiclength and energy-can be used to determine a melody.
In the reconstructed salience graph, the frequency and peak energy correspond to a MIDI pitch and instantaneous amplitude, respectively. Each frame contains multiple peaks as valid information. Two problems are still present in melody extraction: voice detection and octave error [17]. At this time, the frequency of the accompaniment, the number of singing voices, and many other interference sources exist in the overall graph. In contrast to automatic singing transcription from a monophonic signal [32], the presence of an accompaniment presents a huge challenge. The salience spectrum constructed in the previous section highlights the features and quantifies the parameters more accurately; thus, a solution to this dilemma is presented.
On the basis of the salience graph, we first we calculate the energy mean µ and the standard deviation σ of the peak points in the whole graph.
We calculate the threshold value of the human voice as v = µ − θσ, where θ = 0.5. The maximum energy of peak points in frames S + less than v will be recorded as nonvoice frames S − . Frame-level energy filtering seems to be the simplest and most efficient implementation in singing voice filtering.
Similarly, we filter the peak point S < µS + in each frame and take the smaller parameter µ = 0.3. To some extent, this step is beneficial for removing the smaller values and reducing excessively long contour disturbances in edge searching. The interference caused to the new contour by the last sound of the previous one is blocked by the discontinuity point in the frame. This kind of interference is a problem, as the long continuation of the previous contour may overlay the new contour. Using the features of the edge searching algorithm, it is not possible to determine a highly continuous contour profile from global analysis. Thus, the study of the breakpoint is a critical step for our entire experiment. In the following sections, we also propose new algorithms to solve the problem of excessively long contours.

Edge Searching
As the basic feature of an image, the contour is the essential element of object detection and segmentation. Similarly, for the salience graph, we search each contour, filter most invalid information, and improve the SNR (signal-to-noise ratio) of the overall graph, which are all important steps. Coincidentally, the energy of the singing voice is distributed beyond that of the accompaniment, and it is distinct and clear. The method of edge searching for the entire graph proceeds as follows: • Binarization is performed for all peak points and classified as S + ; • The distance between two peak points in neighboring frames must not be greater than √ 2 of that of the continuous contours; • When two points S + i , S + i+1 are discontinuous, the corresponding contour point S + i+1 is generated as a new contour points, and S + i is the end of the previous contour; • Until the completion of the entire salience graph search, the process is repeated.
Based on this searching principle, we extract all the edges Ω E in the graph. The peak points in the middle region are used as the characteristics of their contours. For each Ω E , there are two important properties: length and average energy. Length is used to ensure that the length of the shortest note is not less than a fixed threshold α.
For either E l ∈ Ω E , Len(E l ) < α is filtered, the constructed spectrum presents a sparse distribution, and the SNR is consequently higher. As shown in Figure 2, the contour information in the figure is still rich.
Based on this searching principle, we extract all the edges in the graph. The peak points in the middle region are used as the characteristics of their contours. For each , there are two important properties: length and average energy. Length is used to ensure that the length of the shortest note is not less than a fixed threshold .
For either ∈ , ( ) < is filtered, the constructed spectrum presents a sparse distribution, and the SNR is consequently higher. As shown in Figure 2, the contour information in the figure is still rich.
Next, we look for the breakpoints in the contours of different syllables to cut them so that the contours are as close to the length of a note as possible without falling below the threshold .

Sliding Window
In Section 2.4, we proposed a method for intra-frame breakpoint screening to remove interference caused by long endnotes or homophonic accompaniment. Unfortunately, the aim of intermittent contours was not achieved between adjacent frames. In this section, we use the sliding window algorithm to find the breakpoints in all the contours . In essence, the subtle fluctuation in energy generated by two adjacent notes at the moment of a sound change is the change in syllables that produces one note after another. Our algorithm determines the energy difference in the change in syllables. The sliding window method is as follows: (1) Superpose the peak points in each frame to obtain the total energy of a frame and create a difference in the energy difference axis, as shown in Figure 3; (2) Select a window with a specific length, and the hop length is /2 padded with zero, while the length of the tail window is insufficient; (3) Search in each window and constitute a collection ; (4) When the same appears in two neighboring windows as the local maximum, all of the set a new collection . (5) Repeat the above steps (3)-(4) until all searches for the salience graph are completed.
The same , which occurs in two successive windows, is considered to be a breakpoint. However, the actual minimum length is 1.5 . The purpose of the algorithm is to find the local minimum value in a certain region considered to have no breakpoint; if there are multiple local minimum values, we search the next region. At this time, the corresponding frame is regarded as the breakpoint of all the contours containing the frame to segment the melody. After our experiments, the best 50 values were selected. Next, we look for the breakpoints in the contours of different syllables to cut them so that the contours are as close to the length of a note as possible without falling below the threshold α.

Sliding Window
In Section 2.4, we proposed a method for intra-frame breakpoint screening to remove interference caused by long endnotes or homophonic accompaniment. Unfortunately, the aim of intermittent contours was not achieved between adjacent frames. In this section, we use the sliding window algorithm to find the breakpoints in all the contours Ω E . In essence, the subtle fluctuation in energy generated by two adjacent notes at the moment of a sound change is the change in syllables that produces one note after another. Our algorithm determines the energy difference in the change in syllables. The sliding window method is as follows: (1) Superpose the peak points l k in each frame to obtain the total energy of a frame E l and create a difference in the energy difference axis, as shown in Figure 3; (2) Select a window W with a specific length, and the hop length is W/2 padded with zero, while the length of the tail window is insufficient; (3) Search E max in each window and constitute a collection Ω E ; (4) When the same E max appears in two neighboring windows as the local maximum, all of the E max set a new collection Ω E . (5) Repeat the above steps (3)-(4) until all searches for the salience graph are completed.
The same E max , which occurs in two successive windows, is considered to be a breakpoint. However, the actual minimum length is 1.5 W. The purpose of the algorithm is to find the local minimum value in a certain region considered to have no breakpoint; if there are multiple local minimum values, we search the next region. At this time, the corresponding frame is regarded as the breakpoint E min of all the contours containing the frame to segment the melody. After our experiments, the best 50 W values were selected. Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 19 Figure 3. Sliding window filtering used for searching for the maximum energy difference to obtain breakpoints. The red box is our sliding window of 50 min length.

Contour Filtering
Energy characteristics in contours are regarded as the main feature in melody searching. In this section, the interfering contour are largely eliminated first; afterwards, the unique melody contour are determined within each minimum frame. Under the segmentation of (which can be regarded as filtering all the peak points in the corresponding frame and temporarily identifying them as non-voice frames), we conduct edge searching on the graph again.
We calculate the average energy of the peak point of the salience graph and calculate the average energy = / ( ) of each contour. The contour energy < is filtered, and the remaining contours are shown in Figure 4. Next, we need to select the final contours with the most obvious features in each region from . Obviously, there are multiple contours between adjacent breakpoints (shown in Figure 4), including accompaniment contours, singing voice contours, and octave contours. We selected the range between two neighboring breakpoints in to facilitate the comparison of the same melody contours and ensure the fundamental frequency of the human voice. In the previous construction of a salience function, we included the downward superposition of high-order harmonics, which increases the energy of the fundamental frequency with an abundant octave frequency to the maximum extent. Our proposed contour filtering algorithm is as follows:

Contour Filtering
Energy characteristics in contours are regarded as the main feature in melody searching. In this section, the interfering contour are largely eliminated first; afterwards, the unique melody contour are determined within each minimum frame. Under the segmentation of Ω E (which can be regarded as filtering all the peak points in the corresponding frame and temporarily identifying them as non-voice frames), we conduct edge searching on the graph again.
We calculate the average energy E k of the peak point of the salience graph and calculate the average energy E l = E l /len(l) of each contour. The contour energy E l < E k is filtered, and the remaining contours are shown in Figure 4. Next, we need to select the final contours with the most obvious features in each region from Ω E .
Obviously, there are multiple contours between adjacent breakpoints (shown in Figure 4), including accompaniment contours, singing voice contours, and octave contours. We selected the range between two neighboring breakpoints in Ω E to facilitate the comparison of the same melody contours and ensure the fundamental frequency of the human voice. In the previous construction of a salience function, we included the downward superposition of high-order harmonics, which increases the energy of the fundamental frequency with an abundant octave frequency to the maximum extent. Our proposed contour filtering algorithm is as follows: (1) Compare the size of each contour E l in the region Ω k , Ω k+1 to find the contour E l with the maximum energy (shown in Figure 5, the red lines); (2) Set the head and the tail from E l as i, j to filter the remaining peak points in the region (3) If either the head i or tail j end of their distance between the boundaries Ω k , Ω k+1 is greater than the note threshold α, contour filtering will be conducted on the region Ω k , Ω i ∨ Ω j , Ω k+1 to find the next largest contour E l p . (4) Otherwise, the melody contour searching ends in this area and the method proceeds to the next region Ω k+1 , Ω k+2 . tation of (which can be regarded as filtering all the peak points in the corresponding frame and temporarily identifying them as non-voice frames), we conduct edge searching on the graph again.
We calculate the average energy of the peak point of the salience graph and calculate the average energy = / ( ) of each contour. The contour energy < is filtered, and the remaining contours are shown in Figure 4. Next, we need to select the final contours with the most obvious features in each region from . Obviously, there are multiple contours between adjacent breakpoints (shown in Figure 4), including accompaniment contours, singing voice contours, and octave contours. We selected the range between two neighboring breakpoints in to facilitate the comparison of the same melody contours and ensure the fundamental frequency of the human voice. In the previous construction of a salience function, we included the downward superposition of high-order harmonics, which increases the energy of the fundamental frequency with an abundant octave frequency to the maximum extent. Our proposed contour filtering algorithm is as follows: (1) Compare the size of each contour in the region , to find the contour with the maximum energy (shown in Figure 5, the red lines); (2) Set the head and the tail from as , to filter the remaining peak points in the region , ; (3) If either the head or tail end of their distance between the boundaries , is greater than the note threshold , contour filtering will be conducted on the region , ∨ , to find the next largest contour . (4) Otherwise, the melody contour searching ends in this area and the method proceeds to the next region , . (5) Repeat the above steps until all searches for the entire map are completed. In the above algorithm, the regions are calculated in sections, which are filtered according to the strength of the contour energy, and multiple non-overlapping contour searches are performed on the remaining regions to determine the unique singing melody. The preprocessing of the contour filters most of the interference information, and multiple edge searches do not degrade the performance of the program.

Post-Processing
In the previous section, we determined the contours of each region and ensured the uniqueness of the contours through non-overlapping filtering. However, the search for the region can only determine the local optimal solution; regarding the whole, a jump in the abnormal contour cannot be avoided. Based on the human auditory effect, the notes are distributed in a fluctuating way, so we maintain relative stability at the same level. For a few anomalous contours, we introduce the concept of an abnormal contour, assuming that this leads to an anomaly and deviates from the correct path (shown in Figure 6). On this basis, we propose a post-processing method to correct abnormal contours: (1) Calculate the average pitch of all current contours; (2) For the mean pitch of each contour, see Equation (11)  In the above algorithm, the regions are calculated in sections, which are filtered according to the strength of the contour energy, and multiple non-overlapping contour searches are performed on the remaining regions to determine the unique singing melody. The preprocessing of the contour filters most of the interference information, and multiple edge searches do not degrade the performance of the program.

Post-Processing
In the previous section, we determined the contours of each region and ensured the uniqueness of the contours through non-overlapping filtering. However, the search for the region can only determine the local optimal solution; regarding the whole, a jump in the abnormal contour cannot be avoided. Based on the human auditory effect, the notes are distributed in a fluctuating way, so we maintain relative stability at the same level. For a few anomalous contours, we introduce the concept of an abnormal contour, assuming that this leads to an anomaly and deviates from the correct path (shown in Figure 6). On this basis, we propose a post-processing method to correct abnormal contours: (1) Calculate the average pitch P of all current contours; (2) For the mean pitch P l of each contour, see Equation (11); (3) If the number of abnormal contours exceeds 1/4 of the total number, do not perform contour processing. (4) Repeat the above steps of (2)-(3) until all contours are completed.
According to the conversion characteristics of pitch, the double octaves and pitches are ±12 and the triple octaves and pitches are ±19. At the same time, this corresponds to the previous classification of the pitch as 60 semitones. Only the frequency doubling and the frequency tripling misjudgment interference are considered here, where ( 1, 2, 3) are (17,10,13), respectively.
Correcting an abnormal contour is an important step that avoids the contour jump and greatly enhances the robustness and accuracy of the program. For polyphonic music with strong background sound, a Fourier transform will cause extremely high octave interference. The proposed post-processing method has a great effect on continuous abnormal contours, and it improved the overall result for about 5% of them.
Therefore, due to the existence of edge branches, the edge searching algorithm can only classify branches in the same contour, and the energy mean calculation includes the peak points in each branch. A fundamental pitch is generated by applying the strongest energy principle from each frame if additional branches exist, and then reprocessing is undertaken by the application of median filtering. The use of median filtering has two main advantages: the first is the reduction in the problem of pitch fluctuation caused by the selection of peak points in the branches, and the second is the filling of the breakpoints caused by the sliding window algorithm.

Two Fundamental Frequency Sequences
Transcription on the basis of melody extraction is a process of effectively segmenting the fundamental frequency to extract notes. Unlike the multi-fundamental frequency estimation task, the audio involved is mostly generated by instrumental playing [33]. However, for the characteristics of aspiration and vibrato in the singing voice, the composition is more difficult, the boundary is less obvious, and the judgment of notes is more complicated.
After extracting the overall melody, we propose two fundamental frequency selection methods. The methods we used in both MIREX tasks are described in detail below: • Baseline: Retain the extracted current integer frequency sequences, which only limits the input for subsequent singing transcription tasks. • Extension: Readjust the precision of pitch to the accuracy of 10 cents (where 100 cents correspond to a semitone). The basic frequency accuracy error required for the MIREX audio melody extraction competition is within 25 cents. Therefore, we recalculate the According to the conversion characteristics of pitch, the double octaves and pitches are ±12 and the triple octaves and pitches are ±19. At the same time, this corresponds to the previous classification of the pitch as 60 semitones. Only the frequency doubling and the frequency tripling misjudgment interference are considered here, where (α1, α2, α3) are (17,10,13), respectively.
Correcting an abnormal contour is an important step that avoids the contour jump and greatly enhances the robustness and accuracy of the program.
For polyphonic music with strong background sound, a Fourier transform will cause extremely high octave interference. The proposed post-processing method has a great effect on continuous abnormal contours, and it improved the overall result for about 5% of them.
Therefore, due to the existence of edge branches, the edge searching algorithm can only classify branches in the same contour, and the energy mean calculation includes the peak points in each branch. A fundamental pitch p 0 is generated by applying the strongest energy principle from each frame if additional branches exist, and then reprocessing p 0 is undertaken by the application of median filtering. The use of median filtering has two main advantages: the first is the reduction in the problem of pitch fluctuation caused by the selection of peak points in the branches, and the second is the filling of the breakpoints caused by the sliding window algorithm.

Two Fundamental Frequency Sequences
Transcription on the basis of melody extraction is a process of effectively segmenting the fundamental frequency to extract notes. Unlike the multi-fundamental frequency estimation task, the audio involved is mostly generated by instrumental playing [33]. However, for the characteristics of aspiration and vibrato in the singing voice, the composition is more difficult, the boundary is less obvious, and the judgment of notes is more complicated.
After extracting the overall melody, we propose two fundamental frequency selection methods. The methods we used in both MIREX tasks are described in detail below: • Baseline: Retain the extracted current integer frequency sequences, which only limits the input for subsequent singing transcription tasks.
• Extension: Readjust the precision of pitch to the accuracy of 10 cents (where 100 cents correspond to a semitone). The basic frequency accuracy error required for the MIREX audio melody extraction competition is within 25 cents. Therefore, we recalculate the salience function using the same instantaneous frequency and amplitude after the phase processing step. The novel pitch frequency conversion formula is as follows: The parameters of the original function remain unchanged, producing a more accurate salience graph. After readjusting, we obtain 600 bins, and the quantization range expands 10-fold. Finally, the extracted melody results are mapped into the new graph through the following equation: In the new salience graph, the pitch with the highest energy within the range of 2θ corresponding to the semitone p 0 is taken as the fundamental pitch p 0 . The parameter θ is 5. The result of such a conventional mapping concept is high accuracy and a small error range. The disadvantage is that the salient function is recalculated with a span of 10 cents, which greatly increases the running time of the whole program. The extension results were uploaded to the melody extraction task of MIREX 2020. The comparative evaluation is shown in Section 4.1.

Stability Region Division
The composition of notes includes three features: onset, pitch, and offset. The more critical of these criteria are onset and pitch. Similar to the solo singing task, discrimination can be obtained by using the two characteristics of pitch change and energy fluctuation, and even higher accuracy can be achieved by the latter [34]. To distinguish the boundaries of two notes, we turn to the search for regions of stability. The definition of a region of stability is that each pitch fluctuation is within a certain range. Analogous to Section 2.5.1, we effectively establish window sliding conceptions for onset detection. As seen in Figure 7, our proposed stability region division algorithm is detailed as follows: (1) The initial position l 0 of each contour l is regarded as an onset. If l < θ, the contour is indecomposable, and l end is regarded as an offset and is transferred to the segmentation of the next contour; (2) Create a sliding window W (as shown in the block in Figure 7), where the initial position is one window length, P max is the maximum pitch in the window, P 0 is the first pitch, and P 7 is the last pitch; (3) For each window, there are following operating conditions: If P max − P 0 > α, where the subscript of P 0 is denoted as O f f 0 , and the hop length is one window length; where the subscript of P 7 is denoted as O f f 0 , and the hop length is two window lengths; (c) If neither of above is true, we consider the window to have attained stability, the hop length is a frame. We then recalculate Step (3) until the distance between P 0 and l end is less than a unit length 12;

Pitch Line Matching
The importance of pitch issues in note transcription has generally been ignored, and most of the criteria selected represent the mean value of pitch within the segmentation interval [35]. This method is certainly the most efficient; however, there are some deviations in the singing voice. RJ McNab [36] proposed a local pitch histogram to estimate the correct pitch. As shown in Figure 8, the existence of glide will affect the judgement of the entire pitch. Nevertheless, the stable area of singing voice mostly appears in the tail. Our proposed pitch line matching algorithm is as follows: (1) Calculate the average pitch of the note and round it to an integer MIDI pitch; (2) Determine the five pitch lines , ± 1, ± 2, as shown by the dashed blue line in Figure 8; (3) For the pitch of each frame, if the interval between a certain pitch line and is less than 50 cents, will be matched to that pitch line. Lastly, the most frequently matched pitches are recorded as the final pitch of the note; (4) Repeat the above steps to determine the pitch of all notes.
We furthermore considered the characteristics of singing voices to avoid the interference of aspiration. This subtle improvement has a huge impact on our entire experiment process. Our proposed algorithm effectively improves the accuracy of pitch by over 4% on average.  Regarding the minimum length of a note, we consider at least eight frames to be a stable note. Simultaneously, a contour length greater than 20 frames is considered separable. We ensure that the tail note meets the minimum length condition even after segmentation. The position relationship of pitch differences is calculated to determine the syncopation of notes for the two categories of rising tones and falling tones. With regard to the selection of offsets, we take the last frame of the former melody as the offset and add a new offset, starting with the second onset as the end of the previous note. Comparing the two methods, the only difference is in the threshold α. Based on the temporal rhythm, our baseline α takes a value of 1. On the basis of our experiments, our extension α takes a value of 1.2.

Pitch Line Matching
The importance of pitch issues in note transcription has generally been ignored, and most of the criteria selected represent the mean value of pitch within the segmentation interval [35]. This method is certainly the most efficient; however, there are some deviations in the singing voice. RJ McNab [36] proposed a local pitch histogram to estimate the correct pitch. As shown in Figure 8, the existence of glide will affect the judgement of the entire pitch. Nevertheless, the stable area of singing voice mostly appears in the tail. Our proposed pitch line matching algorithm is as follows: (1) Calculate the average pitch P of the note and round it to an integer MIDI pitch; (2) Determine the five pitch lines P, P ± 1, P ± 2, as shown by the dashed blue line in Figure 8; (3) For the pitch P i of each frame, if the interval between a certain pitch line and P i is less than 50 cents, P i will be matched to that pitch line. Lastly, the most frequently matched pitches are recorded as the final pitch of the note; (4) Repeat the above steps to determine the pitch of all notes.
We furthermore considered the characteristics of singing voices to avoid the interference of aspiration. This subtle improvement has a huge impact on our entire experiment process. Our proposed algorithm effectively improves the accuracy of pitch by over 4% on average. less than 50 cents, will be matched to that pitch line. Lastly, the most frequently matched pitches are recorded as the final pitch of the note; (4) Repeat the above steps to determine the pitch of all notes.
We furthermore considered the characteristics of singing voices to avoid the interference of aspiration. This subtle improvement has a huge impact on our entire experiment process. Our proposed algorithm effectively improves the accuracy of pitch by over 4% on average.  We used statistics from four classically used data collections from previous years that contain approximately 400 audio samples. The audio formats of all collections were wav files, the sampling rate was 44.1 KHz, the bit rate was 16, and only mono format was included. The datasets contained mostly vocal melodies, such as English songs, and matched all types of content, although some nonvoice audios was also included. Details of the test are as follows:

Metrics
There were four kinds of definition for the samples, which are explained as follows: • TP: true positives. These were frames in which the voicing was correctly detected and where TPC means a correct pitch, TPCch means a chroma correct, and TPI means an incorrect pitch but truly voiced; TP = TPC + TPCch + TPI (14) • TN: true negative. These were frames in which the nonvoice was correctly detected; • FP: false positive. These frames were actually unpitched but were detected as pitched; • FN: false negative. These frames were actually pitched but were detected as unpitched.
All figures were evaluated in MIREX in terms of five metrics: overall accuracy (OA), raw pitch accuracy (RPA), raw chroma accuracy (RCA), voicing detection rate (VR), and voicing false alarm rate (VFA). The equations are defined below: For the correct voiced pitch ( f 0 ) and chroma, we allowed tone buffers within ± 1 4 compared to the ground truth. For comparison, we selected the results from a competition from the past two years, which was dominated by results from 2019. The statistical result for each metric is the average of the algorithms over the presentation datasets.
The overall accuracy of our algorithm (HZ4) was lower than that of KN4 and close to AH1 ( Table 1). Both of the previous two algorithms were realized through neural network modeling. Our system focuses on the innovation of contour searching, in contrast to the traditional frame-to-frame filtering mode. Significantly, we directly analyzed the contours using existing datasets, especially the threshold values in each detection, which seemed to be acceptable. Due to the existence of a large number of thresholds in the experiment, there is still considerable room for progress in melody extraction. Owing to the limitations in the precision of MIREX, we only submitted the extension system. In addition to the estimation of accuracy, we also present the performance of each algorithm for different datasets, mainly including the statistical results for the overall accuracy (OA). As shown in Figure 9, in the ADC04 datasets, the algorithm proposed in this paper reached an optimal accuracy rate of 0.824; barring an accuracy rate of 0.815 for the KD1 algorithm, other algorithms did not reach a value of more than 80%. For the MIREX05 dataset, BH1 achieved the highest accuracy rate of 0.679, and the overall accuracy rate of this dataset was the lowest. For the MIREX08 and MIREX09 data sets, the highest accuracy rates were 0.793 and 0.828, respectively, and the results were within the applicable range.

Dataset
A new mission presented online in MIREX2020 fitted the scope of our research exactly. The purpose of this task was to transcribe the singing voice from polyphonic music into a chain of notes, where each note was indicated by three parameters: the onset, offset and score pitch. Only one collection was used to evaluate the proposed system. The detailed description is as follows: This collection contained multiple accompanied melodies and one singing melody. The audio sampling rate was 44.1 KHz, the sampling size was 16 bits, and the dataset contained two-channel waves and mostly included Chinese songs. The training set can be downloaded at https://drive.google.com/file/d/15b298vSP9cPP8qARQwa2X_0dbzl6_Eu7/edit (accessed on 21 May 2021).   Figure 10a is steeper, which proves that θ has a greater impact on the singing voice detection results. It can be seen from Figure 10b that, when µ is selected as 0.5, the OA is higher. This also shows the process of selecting parameters in our experiments.

Dataset
A new mission presented online in MIREX2020 fitted the scope of our research exactly. The purpose of this task was to transcribe the singing voice from polyphonic music into a chain of notes, where each note was indicated by three parameters: the onset, offset and score pitch. Only one collection was used to evaluate the proposed system. The detailed description is as follows: This collection contained multiple accompanied melodies and one singing melody. The

Dataset
A new mission presented online in MIREX2020 fitted the scope of our research exactly. The purpose of this task was to transcribe the singing voice from polyphonic music into a chain of notes, where each note was indicated by three parameters: the onset, offset and score pitch. Only one collection was used to evaluate the proposed system. The detailed description is as follows: This collection contained multiple accompanied melodies and one singing melody. The audio sampling rate was 44.1 KHz, the sampling size was 16 bits, and the dataset contained two-channel waves and mostly included Chinese songs. The training set can be downloaded at https://drive.google.com/file/d/15b298vSP9cPP8qARQwa2X_0dbzl6 _Eu7/edit (accessed on 21 May 2021).

Metrics
We evaluated the accuracy of the transcription by computing COnP and COn metrics [32], as well as computing the corrected transcribed notes of the ground truth. The following rules were utilized to determine whether notes were successfully matched:

•
The onset difference was less than 100 ms in this competition; • The pitch difference was less than 50 cents in this competition.
COnP requires the satisfaction of both conditions, while COn only requires the satisfaction of the first one. We computed the F-measure (FM), precision (Pr), and recall (Rec) on the overall results. The FM can be given by the following equation:

Results
Our submission was HZ_SingingTranscription, which only contained the baseline system. We used mir_val [41] to evaluate our extension accuracy. Figure 11 shows the performance of our proposed two systems.

Metrics
We evaluated the accuracy of the transcription by computing COnP and COn metrics [32], as well as computing the corrected transcribed notes of the ground truth. The following rules were utilized to determine whether notes were successfully matched: • The onset difference was less than 100 ms in this competition; • The pitch difference was less than 50 cents in this competition.
COnP requires the satisfaction of both conditions, while COn only requires the satisfaction of the first one. We computed the F-measure (FM), precision (Pr), and recall (Rec) on the overall results. The FM can be given by the following equation:

Results
Our submission was HZ_SingingTranscription, which only contained the baseline system. We used mir_val [41] to evaluate our extension accuracy. Figure 11 shows the performance of our proposed two systems. The result is obvious from the above description. In the pitch estimation, the advantage of the pitch line matching algorithm increased the overall accuracy of the extension to 0.468, which was about 5% higher than the accuracy of 0.411 of the baseline algorithm. Nevertheless, the consequence was that our programming time increased substantially; as a rough estimate, the time required doubled. There is apparently much room for improvement in distinguishing a singing voice from polyphonic music.
Therefore, it is more advantageous to calculate the pitch of a note for a more accurate extension pitch sequence. We readjusted the threshold of the onset difference to 150 ms. Figure 12 shows the accuracy results for the two algorithms within the new onset difference. It can be seen from the figure that the results of the two algorithms improved after adjusting the threshold, and the accuracy of the extension algorithm for OnP reached 0.556, which was about 9% higher than the error range of 100 ms. The final result of the starting point detection also reached 0.702, breaking through the 70% accuracy level. The result is obvious from the above description. In the pitch estimation, the advantage of the pitch line matching algorithm increased the overall accuracy of the extension to 0.468, which was about 5% higher than the accuracy of 0.411 of the baseline algorithm. Nevertheless, the consequence was that our programming time increased substantially; as a rough estimate, the time required doubled. There is apparently much room for improvement in distinguishing a singing voice from polyphonic music.
Therefore, it is more advantageous to calculate the pitch of a note for a more accurate extension pitch sequence. We readjusted the threshold of the onset difference to 150 ms. Figure 12 shows the accuracy results for the two algorithms within the new onset difference. It can be seen from the figure that the results of the two algorithms improved after adjusting the threshold, and the accuracy of the extension algorithm for OnP reached 0.556, which was about 9% higher than the error range of 100 ms. The final result of the starting point detection also reached 0.702, breaking through the 70% accuracy level.

Conclusions
In this paper, we have proposed a method for singing transcription based on traditional signal processing and analysis, which involves the subclass method of melody extraction. In contrast to the current widely used neural network, our method aims to break down the content of each step and present the whole experiment in a clear and visible state. In addition, aspiration is mostly present at the beginning of a note being sung, leading to a greater backward deviation in onset detection. Non-voice sounds will have a lower energy component and lower frequency, and we thoroughly investigated the contours and analyzed their features by introducing the edge contour characteristics of the image, instead of being limited to the application of frame information. The main advantage of our method is the avoidance of multiple corrections of abnormal points, which lead to program redundancy, meaning we can enhance the efficiency of operation. The essence of using a Fourier transform to analyze contours lies in the characteristics of the contour; algorithms such as the sliding window, stabilization zone split note, and pitch line matching were introduced, and the importance of contour lines was suggested for the task of melody extraction from polyphonic music signals. For the selection of each parameter, our system used the results obtained from existing standard datasets, and multiple experiments were conducted to adjust the optimal parameters.
The accuracy of the algorithm in this paper depends greatly on the characteristics of the salience contour in the spectrogram. In many datasets, the accuracy and stability of the melody contour calculated by edge detection and contour filtering have been proven to be effective. However, in the singing voice detection task, there are still some shortcomings in terms of misjudgments and octaves. Among them, the low threshold value selected for human voice detection by this method preserves the outline of the human voice as much as possible while also retaining the outline of the accompaniment, which affects the accuracy of overall melody recognition. Therefore, it is hoped that a better signal processing method will be developed to distinguish between the different characteristics of the singing voice and its accompaniment. Informed Consent Statement: Not applicable. Written informed consent was obtained from patients to publish this paper.

Conclusions
In this paper, we have proposed a method for singing transcription based on traditional signal processing and analysis, which involves the subclass method of melody extraction. In contrast to the current widely used neural network, our method aims to break down the content of each step and present the whole experiment in a clear and visible state. In addition, aspiration is mostly present at the beginning of a note being sung, leading to a greater backward deviation in onset detection. Non-voice sounds will have a lower energy component and lower frequency, and we thoroughly investigated the contours and analyzed their features by introducing the edge contour characteristics of the image, instead of being limited to the application of frame information. The main advantage of our method is the avoidance of multiple corrections of abnormal points, which lead to program redundancy, meaning we can enhance the efficiency of operation. The essence of using a Fourier transform to analyze contours lies in the characteristics of the contour; algorithms such as the sliding window, stabilization zone split note, and pitch line matching were introduced, and the importance of contour lines was suggested for the task of melody extraction from polyphonic music signals. For the selection of each parameter, our system used the results obtained from existing standard datasets, and multiple experiments were conducted to adjust the optimal parameters.
The accuracy of the algorithm in this paper depends greatly on the characteristics of the salience contour in the spectrogram. In many datasets, the accuracy and stability of the melody contour calculated by edge detection and contour filtering have been proven to be effective. However, in the singing voice detection task, there are still some shortcomings in terms of misjudgments and octaves. Among them, the low threshold value selected for human voice detection by this method preserves the outline of the human voice as much as possible while also retaining the outline of the accompaniment, which affects the accuracy of overall melody recognition. Therefore, it is hoped that a better signal processing method will be developed to distinguish between the different characteristics of the singing voice and its accompaniment. Informed Consent Statement: Not applicable. Written informed consent was obtained from patients to publish this paper. Data Availability Statement: Data are available from https://www.music-ir.org/mirex/wiki/2020: Audio_Melody_Extraction (accessed on 21 May 2021) and https://www.music-ir.org/mirex/wiki/ 2020:Singing_Transcription_from_Polyphonic_Music (accessed on 21 May 2021).

Conflicts of Interest:
The authors declare no conflict of interest.