E ﬃ cient Melody Extraction Based on Extreme Learning Machine

: Melody extraction is an important task in music information retrieval community and it is unresolved due to the complex nature of real-world recordings. In this paper, the melody extraction problem is addressed in the extreme learning machine (ELM) framework. More speciﬁcally, the input musical signal is ﬁrst pre-processed to mimic the human auditory system. The music features are then constructed by constant-Q transform (CQT), and the concentration strategy is introduced to make use of contextual information. Afterwards, the rough melody pitches are determined by ELM network, according to its pre-trained parameters. Finally, the rough melody pitches are ﬁne-tuned by the spectral peaks around the frame-wise rough pitches. The proposed method can extract melody from polyphonic music e ﬃ ciently and e ﬀ ectively, where pitch estimation and voicing detection are conducted jointly. Some experiments have been conducted based on three publicly available datasets. The experimental results reveal that the proposed method achieves higher overall accuracies with very fast speed.


Introduction
Melody extraction, also known as main melody extraction or predominant F0 estimation, aims to extract the predominant pitch sequence of melody (the lead voice or instrument) from polyphonic music [1]. It can be used in some applications, such as query-by-humming [2], version identification [3], music retrieval [4], and so on.
Various methods have been proposed since Goto first put forward the melody extraction problem [5]. Salience and temporal continuity are two principles commonly utilized in the literature. In the early studies, researchers tried various ways to formulate the melody extraction problem based on the two principles. For example, Fuentes et al. designed a translation invariant model to track the lead melody based on probabilistic latent component analysis [6]. Expectation maximization was utilized to estimate the model parameters. Salamon et al. defined a set of contour characteristics, studied theirs distributions, and devised rules to distinguish melodic and non-melodic contours [7]. This approach works especially well on singing melody extraction due to the preference of singing pitch contour. Later, Bosch and Gómez combined a salience function based on a smoothed instantaneous mixture model and pitch tracking based on pitch contour characterization [8,9]. To alleviate the low-frequency strong accompaniment influence, Zhang et al. generalized the Euclidean algorithm, which was designed for computing the greatest common divisor of two natural numbers to positive real numbers, and areas, such as fault diagnosis [20], biomedical signal processing [21], animal emotional recognition [22], and so on.
The main contributions of this paper include: The melody extraction is formulated in the ELM framework, which can extract melody efficiently and effectively; the pitch estimation and voicing detection are conducted jointly, which reduces their mutual inhibition; and the melodic pitches are fine-tuned after coarse melody estimation, which reserves the tiny dynamics of melody.
The rest of this paper is organized as follows. ELM is presented in Section 2. The ELM-based melody extraction method is elaborated in Section 3. The experimental results and discussions are provided in Section 4. Finally, the conclusions are drawn in Section 5.

Extreme Learning Machine
Feedforward neural networks can approximate complex nonlinear mapping functions directly from training data and provide models for difficult artificial phenomena. Traditionally, the parameters of the feedforward networks need to be tuned, leading to the dependency between different layers. The gradient descent-based methods are commonly used for parameter learning. However, these methods are often time-consuming due to either improper steps or converging to local minima. In this subsection, the SLFN and ELM will be presented in detail.
Given N arbitrary distinct samples (x j , y j ), where x j = [x j1 , x j2 , . . . , x jn ] ∈ R n and y j = [y j1 , y j2 , . . . , y jm ] ∈ R m , standard SLFNs with N hidden nodes and an activation function g(·) are modeled as: where o j is the output of x j , w i = [w i1 , w i2 , . . . , w in ] T is the input weight vector connecting the i-th hidden unit and input units, β i = [β i1 , β i2 , . . . , β im ] T is the output weight vector connecting the i-th hidden unit and the output units, b i is the bias of the i-th hidden unit, and w i •x j denotes the inner product of w i and x j . The architecture of the ELM network that maps x j to o j is illustrated in Figure 1.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 15 The main contributions of this paper include: The melody extraction is formulated in the ELM framework, which can extract melody efficiently and effectively; the pitch estimation and voicing detection are conducted jointly, which reduces their mutual inhibition; and the melodic pitches are fine-tuned after coarse melody estimation, which reserves the tiny dynamics of melody.
The rest of this paper is organized as follows. ELM is presented in Section 2. The ELM-based melody extraction method is elaborated in Section 3. The experimental results and discussions are provided in Section 4. Finally, the conclusions are drawn in Section 5.

Extreme Learning Machine
Feedforward neural networks can approximate complex nonlinear mapping functions directly from training data and provide models for difficult artificial phenomena. Traditionally, the parameters of the feedforward networks need to be tuned, leading to the dependency between different layers. The gradient descent-based methods are commonly used for parameter learning. However, these methods are often time-consuming due to either improper steps or converging to local minima. In this subsection, the SLFN and ELM will be presented in detail.
Given N arbitrary distinct samples ( , ) is the input weight vector connecting the i -th hidden unit and input units, The architecture of the ELM network that maps j x to j o is illustrated in Figure 1.   N samples can be approximated by the standard SLFNs with N hidden nodes and an activation function g(·), i.e., In other words, there exist β i , w i and b i satisfying The above N equations can be denoted in the matrix form as: where hidden layer output matrix H, weight matrix β, and label matrix Y are, respectively, defined as: If the activation function g(·) is infinitely differentiable, the number of hidden nodes satisfies N ≤ N.
It is proven in [19] that the input weights w i and the hidden layer biases b i can be randomly initialized and then the output matrix H can be calculated. Furthermore, the least-squares solutionβ of the linear function denoted in Equation (4) is: If N = N, matrix H is square and invertible when the input weights w i and the hidden layer biases b i are randomly initialized. In this case, the SLFN can approximate the training samples with zero error.
In most cases, the number of hidden nodes is much less than that of the training samples, i.e., N N, then H is not square. Thus, the smallest norm least-squares solution of Equation (4) is: where H † is the Moore-Penrose generalized inverse of matrix H [23]. Huang et al. also proved thatβ expressed in Equation (9) is one of the least-squares solutions with the smallest norm. Additionally, the minimum norm least-squares solution of Equation (4) is unique. In other words, the solution expressed by Equation (9) is the unique least-squares solution of approximating the samples. Moreover, g(·) can be any infinitely differential activation function, such as the sigmoidal function, radial basis, sine, cosine, or exponential function, and so on.
The detailed ELM training is presented in Algorithm 1.

Steps
(1) Randomly assign input weights w i and the hidden layer biases b i ; (2) Calculate the hidden layer output matrix H using Equation (5); (3) Compute the output weightsβ according to Equation (9); Compared with the traditional gradient-based learning algorithms, ELM has several advantages, such as faster learning speed, better generalization, reaching the solutions directly without confusion of local minima or over-fitting, and so on.

Constant-Q Transform
Given a discrete time domain signal x(n), its CQT representation is defined as [24]: where k and n denote frequency and time indices, respectively, N is the length of the input signal x(n), and the atoms a * k (·) are the complex conjugated window functions, defined as: with a zero-centered window function g k (m), bin center frequency f k , sampling rate f s , and i = √ −1. The center frequencies f k are geometrically distributed as where f 0 is the lowest frequency, b is the number of frequency bins per octave, and K is the total number of frequency bins. The Q-factor of CQT is constant. The frequency resolution ∆ f k at the k-th frequency bin is Substituting Equation (13) into Equation (12) yields It can be found that the frequency resolution of CQT is also geometrically spaced along the frequency bins. That is, higher frequency resolutions are obtained in the lower frequency bands, while lower frequency resolutions are obtained in the higher frequency bands, in accordance with the pitch intervals of notes.

Equal Loudness Filter
The human auditory system perceives sounds at different sound pressure levels for different frequencies. More specifically, human listeners are more sensitive to sounds among mid-frequency bands [25]. Due to this fact, an equal loudness filter is usually introduced in the music information retrieval community to pre-filter the musical signals. It is commonly implemented using the cascade of a 10-th order infinite impulse filter, followed by a second order high pass filter [25]. Following the previous works, we use the same implementation to enhance the components that the human auditory system is more sensitive to. The amplitude frequency response curve of the equal loudness filter is illustrated in Figure 2.

Extreme Learning Machine-Based Melody Extraction
To extract melody from polyphonic music efficiently and effectively with good generalization properties, the ELM-based melody extraction is proposed. The block diagram of the proposed method is shown in Figure 3. In detail, the audio mixture is first down-sampled and processed by the equal loudness filter to simulate the human auditory system. Then, CQT is utilized to analyze the audio with multiple resolutions and several CQT amplitude spectra are concentrated at the centering frames to construct the input vectors for the ELM. Next, the coarse pitches are estimated by pre-trained ELM. Finally, a post-processing step is employed in order to obtain a smoother melody contour. The pre-processing, rough melody pitch estimation, post-processing, and computational complexity analysis is presented in detail in this section.

Pre-Processing
Previous studies illustrate that the spectral slope of musical signals decays 3 to 12 dB per octave [26], which implies that the amplitudes of higher frequency components drop dramatically. Therefore, down-sampling is utilized to reduce data quantity and accelerate processing speed.
In music, notes are spaced logarithmically with 12 semitones per octave. Hence, only the estimates falling with about 3% of the ground truth are considered correct. According to the Heisenberg uncertainty principle, the time and frequency resolutions cannot be increased at the same time, meaning that higher frequency resolution can only be obtained at the expense of lower time resolution. Hence, CQT is introduced to achieve variable resolution spectral analysis [24]. By CQT, the higher frequency resolution can be obtained for the lower frequency bands and the higher time resolution can be obtained for the higher frequency bands.
As musical signal is non-stationary and varies a lot, it is helpful to make use of the contextual information. Inspired by this phenomenon, the input vectors of ELM are constructed by concentrating the CQT amplitude spectra of several frames before and after the current frame, as depicted in Figure 4. In music, notes are spaced logarithmically with 12 semitones per octave. Hence, only the estimates falling with about 3% of the ground truth are considered correct. According to the Heisenberg uncertainty principle, the time and frequency resolutions cannot be increased at the same time, meaning that higher frequency resolution can only be obtained at the expense of lower time resolution. Hence, CQT is introduced to achieve variable resolution spectral analysis [24]. By CQT, the higher frequency resolution can be obtained for the lower frequency bands and the higher time resolution can be obtained for the higher frequency bands.
As musical signal is non-stationary and varies a lot, it is helpful to make use of the contextual information. Inspired by this phenomenon, the input vectors of ELM are constructed by concentrating the CQT amplitude spectra of several frames before and after the current frame, as depicted in Figure 4.

Coarse Melody Pitch Estimation
In this work, the coarse melody pitches are estimated by ELM, with one semitone interval. ELM parameters need to be trained based on training set. The training audios are first pre-processed, as described in Section 3.1, and the input vectors are generated by incorporating some CQT amplitude spectra of adjacent frames.
Pitch estimation and voicing detection are two sub-problems in the melody extraction task. Either performance of these two sub-problems affects the other. To avoid their mutual inhibitions, pitch estimation and voicing detection are conducted simultaneously. That is, the unvoiced situation is also considered as one pattern, the same as pitches. More specifically, the labels are one-hot vectors with the first element equal to one for the unvoiced frames. The other elements correspond to the individual pitches. Then, the hidden layer output matrix H is determined according to Equation (5), where the activation function is the sigmoid function, i.e.,

Coarse Melody Pitch Estimation
In this work, the coarse melody pitches are estimated by ELM, with one semitone interval. ELM parameters need to be trained based on training set. The training audios are first pre-processed, as described in Section 3.1, and the input vectors are generated by incorporating some CQT amplitude spectra of adjacent frames.
Pitch estimation and voicing detection are two sub-problems in the melody extraction task. Either performance of these two sub-problems affects the other. To avoid their mutual inhibitions, pitch estimation and voicing detection are conducted simultaneously. That is, the unvoiced situation is also considered as one pattern, the same as pitches. More specifically, the labels are one-hot vectors with the first element equal to one for the unvoiced frames. The other elements correspond to the individual pitches.
During the training stage, given the training set , where x j is an input feature vector, y j is the corresponding label, and l is the number of training samples, the input weights w i and hidden layer biases b i are first generated following uniform distribution on [−1,1]. Then, the hidden layer output matrix H is determined according to Equation (5), where the activation function is the sigmoid function, i.e., Afterwards, the output weightsβ can be calculated according the Equation (9). All parameters of ELM are now known, including the input weights w i , hidden layer biases b i , and the output weightsβ.
During the test stage, the coarse melody pitches can be obtained based on the ELM parameters. Specifically, given the input vectors of a recording, the input feature vectors can be obtained, as described in Section 3.1.
Then, the output of sample Next, the function argmax(·) is utilized to locate the maximum value. If the maximum value is the first element of vector f (x j ), the corresponding frame is considered as unvoiced. In other cases, the pitches, with respect to these locations, are the coarse melody pitches.

Pitch Fine-Tuning
In real-world recordings, the pitches are continuous. However, the pitches are quantized to generate different classes in this paper. The estimated pitch contour might have some pitch jumps between adjacent frames. Hence, pitch fine-tuning is employed to derive a smoother melody contour herein.
As sinusoidal components exhibit a spectral peak in the CQT amplitude spectrum, pitch fine-tuning is conducted by searching around the estimated rough pitches. Suppose the rough pitch at frame t is f t,r , the interval of peak search is set to be [ f t,r − δ, f t,r + δ], where δ is the search radius. If some peaks are found, the frequency with the highest amplitude is chosen as the final pitch at this frame. If no peak is found at this interval, the rough pitch is chosen as the final one.

Computational Complexity Analysis
In this subsection, the computational complexity of the proposed method is analyzed. As commonly assumed, one addition, subtraction, multiplication, and division of two floating numbers are treated equally as one basic floating operation (Flops), and they contribute the same to the overall computation load [27].
The computational load of the proposed method mainly originates from CQT and ELM training and testing. Let L l and L t be the down-sampled signal lengths of training and testing recordings, respectively.
Assume that l and t are the numbers of training and testing samples, respectively. Suppose that there are m pitch classes and N hidden neurons. CQT calculations of the training and testing recordings require Flops on the order of O(L l log 2 L l ) and O(L t log 2 L t ) [24], respectively. The computational cost of the ELM solution is 3 N 2 +2 N 2 l N+2(l − 1) N 2 + N 3 + N 2 l + Nl 2 + Nlm+ N( N − 1)l+ N(l − 1)l+ N(l − 1)m. As N 1, l 1, l m, and l N, the computation of ELM training is on the order of O( N 3 l + Nl 2 ). Similarly, ELM testing is on the order of O ( N 3 t). Therefore, the computational cost of ELM training is on the order of O(L l log 2 L l + N 3 l + Nl 2 ), and that of ELM testing is on the order of O(L t log 2 L t + N 3 t).

Experimental Results and Discussions
Some experiments were conducted to evaluate the performance of the proposed method. In this section, the experimental results and discussions are provided in detail.

Evaluation Metrics
In this paper, we chose three metrics that are commonly used in melody extraction literature, i.e., overall accuracy (OA), raw pitch accuracy (RPA), and raw chroma accuracy (RCA) [11].They are defined as where the term 'true positive' means that the estimated pitch falls within a quarter tone from the ground truth on a given frame or one frame is correctly identified as unvoiced [28].

Evaluation Collections
In this paper, the evaluation experiments were carried out based on three publicly available collections: ISMIR2004, MIREX05 train, and MIR-1K. ISMIR2004 was collected by the Music Technology Group of Pompeu Fabra University. It contains 20 excerpts with different genres, such as Jazz, R&B, Pop, Opera, and so on. The sampling rate is 44.1 kHz. Durations of these recordings were about 20 s. The reference melody pitches are labeled each 5.8 ms.
MIREX05 train (also referred to as MIREX05 for simplicity) is provided by Graham Poliner and Dan Ellis (LabROSA, Columbia University). It involves 13 excerpts lasting between 24 s and 39 s, with a sampling rate of 44.1 kHz. The ground truths are labeled each 10 ms.
MIR-1K is gathered by the MIR Lab of National Taiwan University. It contains 1000 song clips chopped from 110 Karaoke songs. The total length of this dataset is 133 min, with each recording lasting from 4 s to 13 s. The sampling rate of this dataset is 16 kHz. A 10 ms interval is also used on this collection.
In this paper, the recordings of all collections were mixed together and randomly split into three subsets; 150 recordings for training, 100 recordings for validation, and the rest 783 for testing. The overview of training, validation, and testing sets is illustrated in Table 1.

Parameter Setting
There are some parameters that needed to be set before evaluation. As mentioned before, the musical signals are down-sampled to reduce the data quantity and accelerate the processing speed. In this work, the mixtures were re-sampled to 16 kHz. The MATLAB CQT toolbox implemented by Schörkhuber et al. was utilized herein for multi-resolution spectral analysis [24]. The spectral analysis range was [0, 8 kHz]. As the frequency tolerance is half semitone range of the ground truth, there are 12 semitones per octave, hence the CQT bins are geometrically spaced with 36 bins per octave, enough to satisfy the tolerance. The melody pitches were experientially set, ranging from 110 Hz to 1000 Hz. There were 40 notes with one semitone interval within this range. Hence, the ELM output one-hot vector was of dimension 41, since the unvoiced frames are also assigned with one pattern. The search radius δ of fine-tuning was set to be 2/3 semitones (i.e., two bins). If δ is set greater than 2/3 semitones, the fine-tuned pitch might shift to other notes rather than the coarse one, and if it is set less than 2/3 semitones, i.e., 1/3 semitone, the searching range does not cover the frequency tolerance. Moreover, the tiny margin between 2/3 semitones and tolerance can help track occasional singing glides. To make use of contextual information, the input feature vectors of the ELM cover 7 adjacent frames centered at the current frame.
Except for the aforementioned parameters, the hidden neuron number of ELM also need to be set. In this work, we evaluated the training accuracy, validation accuracy, training time, and validation time when the neuron number ranged from 1000 to 7000. The experimental results are given in Table 2. It can be seen from Table 2 that training accuracy grows with the increase of the hidden neuron number. This phenomenon might because more complicated non-linear mapping functions can be approximated with more neuron numbers. However, the validation accuracy first grows then declines slightly with the increase of the neuron number. This observation reveals that the ELM network suffers from over-fitting to a small extent. As far as the processing times are concerned, the training time was prolonged dramatically with the increase of a hidden neuron number and the validation time was also extended approximately linearly. Considering both accuracies and time efficiencies, the hidden neuron number was set as 5000 in the following experiments.

Experimental Results on Test Sets
The performance of the proposed method is compared with some typical methods, including Melodia [7], the source/filter model incorporated with the contour characteristics (BG1) [9], the harmonic cluster tracking (HCT) [29], the probabilistic model (PM) [6], the modified Euclidean algorithm (MEA) [10], and the particle filter and dynamic programming (PFDP) [11] methods. These methods were chosen since they are typical methods and we could get their source codes or Vamp plug-in to assure the results were unbiased. In more detail, we used the Vamp plug-in of Melodia, the source codes of BG1, HCT, and PM, provided by the authors. MEA and PFDP are two methods proposed by us before. The detailed results and discussions of these methods are elaborated in this subsection.

Accuracies on Different Collections
As mentioned before, the three collections were mixed together and divided into training, validation, and testing sets. Thus, only the results on the testing sets were reported. To be fair, the results of the reference methods were also based on the testing set. The results are illustrated on individual collections to provide some deep insights.
The OAs, RPAs, and RCAs on ISMIR2004 are shown in Figure 5. It can be observed that the accuracies did not vary much among different methods. The proposed method obtained the highest OA, while its RPA and RCA were not high. This phenomenon implies that the proposed method was superior for voicing detection, while inferior for pitch estimation. Table 1 shows that recordings from ISMIR2004 contributed 2.7% to the whole training set. Hence, it can be concluded that the pitch estimation results relied more on the training set than voicing detection. The OAs, RPAs, and RCAs on MIREX05 are depicted in Figure 6. It can be seen that the OA of the proposed method was much higher than all other compared methods. However, its RPA and RCA were much lower than others. Only one recording of this collection was covered in the training set, while 12 others were involved in the testing set. This great gap confirms our conclusion that pitch estimation relied more on training data than voicing detection. Moreover, both RPA and RCA of the other methods were higher than OA, while the OA of the proposed method was higher than the other two. Similar results can be observed in ISMIR2004.
The OAs, RPAs, and RCAs on MIR-1K are provided in Figure 7. Results were very diverse on this collection. The OAs ranged from 26% to 64%. RPAs varied between 31% and 66%. RCAs covered the range of 39% to 69%. The proposed method obtained the highest OA, while Melodia achieved the highest RPA and RCA. This result may be due to the fact that it involved some strategies to preferably select the singing melody. RPA and RCA indicate the accuracies among the voiced frames, while the OA took into account of both voiced and unvoiced frames. It can be inferred that the voicing detection performance of the proposed method still outperformed the others on this dataset. Its RPA and RCA on this dataset was much better than those on ISMIR2004 and MIREX04. Table 1 reveals that the training set was mostly contributed by this collection, and that may be the reason for the RPA and RCA improvement. The OAs, RPAs, and RCAs on MIREX05 are depicted in Figure 6. It can be seen that the OA of the proposed method was much higher than all other compared methods. However, its RPA and RCA were much lower than others. Only one recording of this collection was covered in the training set, while 12 others were involved in the testing set. This great gap confirms our conclusion that pitch estimation relied more on training data than voicing detection. Moreover, both RPA and RCA of the other methods were higher than OA, while the OA of the proposed method was higher than the other two. Similar results can be observed in ISMIR2004.
The OAs, RPAs, and RCAs on MIR-1K are provided in Figure 7. Results were very diverse on this collection. The OAs ranged from 26% to 64%. RPAs varied between 31% and 66%. RCAs covered the range of 39% to 69%. The proposed method obtained the highest OA, while Melodia achieved the highest RPA and RCA. This result may be due to the fact that it involved some strategies to preferably select the singing melody. RPA and RCA indicate the accuracies among the voiced frames, while the OA took into account of both voiced and unvoiced frames. It can be inferred that the voicing detection performance of the proposed method still outperformed the others on this dataset. Its RPA and RCA on this dataset was much better than those on ISMIR2004 and MIREX04. Table 1 reveals that the training set was mostly contributed by this collection, and that may be the reason for the RPA and RCA improvement.
OA took into account of both voiced and unvoiced frames. It can be inferred that the voicing detection performance of the proposed method still outperformed the others on this dataset. Its RPA and RCA on this dataset was much better than those on ISMIR2004 and MIREX04. Table 1 reveals that the training set was mostly contributed by this collection, and that may be the reason for the RPA and RCA improvement.

Statistical Significance Analysis
To determine if the proposed method achieves significantly higher OA and lower RPA compared with the reference methods, the statistical significance analysis was performed. A pairedsample t-test was conducted between the proposed and reference methods in terms of OA and RPA.
The statistical significance results about OA are reported in Table 3. It can be seen from this table that the proposed method performed better than reference methods with respect to OA on ISMIR2004, but the differences were not significant. It outperformed the other methods significantly on both MIREX05 and MIR-1K.
The statistical significance results about RPA are listed in Table 4. It can be found from this table that Melodia and PFDP achieved significantly higher RPAs than the proposed method on ISMIR2004. The proposed method obtained comparable RPA with the other methods on this dataset. However, all methods, except MEA, outperformed the proposed method significantly on MIREX05. As only one recording of MIREX05 was covered in the training set, it can be inferred that training samples were important for ELM parameter training, similar to other machine learning-based methods. As far as MIR-1K is concerned, the proposed method surpassed BG1 and HCT significantly, was comparable with MEA, and was inferior to Melodia and PFDF significantly with respect to RPA.

Statistical Significance Analysis
To determine if the proposed method achieves significantly higher OA and lower RPA compared with the reference methods, the statistical significance analysis was performed. A paired-sample t-test was conducted between the proposed and reference methods in terms of OA and RPA.
The statistical significance results about OA are reported in Table 3. It can be seen from this table that the proposed method performed better than reference methods with respect to OA on ISMIR2004, but the differences were not significant. It outperformed the other methods significantly on both MIREX05 and MIR-1K. The statistical significance results about RPA are listed in Table 4. It can be found from this table that Melodia and PFDP achieved significantly higher RPAs than the proposed method on ISMIR2004. The proposed method obtained comparable RPA with the other methods on this dataset. However, all methods, except MEA, outperformed the proposed method significantly on MIREX05. As only one recording of MIREX05 was covered in the training set, it can be inferred that training samples were important for ELM parameter training, similar to other machine learning-based methods. As far as MIR-1K is concerned, the proposed method surpassed BG1 and HCT significantly, was comparable with MEA, and was inferior to Melodia and PFDF significantly with respect to RPA. The statistical significance results, with respect to RCA, were similar to the results of RPA, so they are not given herein. The statistical results confirmed the judgment that the proposed method works better on OA than RPA and RCA, indicating its superiority in voicing detection.

Discussions
The averaged OAs, RPAs. and RCAs of all methods on the three collections are reported in Table 5. It can be seen from this table that Melodia achieved the highest RCA, PFDP obtained the highest RPA, while the proposed method gained the highest OA. The results revealed in this table indicate the superiority of the proposed method, with respective to OA. However, the RPAs and RCAs of the proposed method were not that high, indicating that the proposed method works much better on voice detection than pitch estimation, compared with the reference methods. There are still several limitations of this method that need to be further studied in the future. First, the performance of the proposed method also relies on training set. It will help to improve its performance if an abundant dataset is utilized instead of randomly selecting some recordings to construct the training set. In addition, the proposed method does not take into account of the temporal dependency. Incorporating some temporal strategies also might be one future direction.

Conclusions
In this paper, melody extraction from polyphonic music is addressed in the ELM framework. Pitch estimation and voicing detection are carried out simultaneously. More specifically, the musical signals are first down-sampled, processed by an equal loudness filter to mimic the human auditory system, and analyzed by CQT. The frame-wise CQT spectra are then concentrated at the center frame to build the input feature vectors for the ELM. The output one-hot vectors of the ELM are generated based on the labels. Next, during the testing stage, the output prediction matrix of the ELM is computed according to the pre-trained parameters, and the rough melody pitches are obtained by locating the maximum value of each frame. Finally, the rough pitches are fine-tuned by searching around the rough pitches. The proposed method can learn high-level melody features very efficiently. It performs pitch estimation and voicing detection simultaneously. Experimental results demonstrate that the proposed method achieves higher overall accuracies compared with reference methods.