4.1. Comparison of Melody Extraction Performance
Figure 3a shows the results of the five melody extraction evaluation metrics for the compared models and outputs. In general,
networks were superior to
in terms of OA, and among the
networks,
networks that used the sum of the two outputs in the loss function were more accurate than
that used only the output of the auxiliary network.
Both RPA and RCA increased significantly in all networks, especially and . This is mainly attributed to the increase in VR. That is, the networks detected the activity of singing voice more responsively, having fewer missing errors. The average RPA and RCA of were 76.1% and 78.1%, respectively, while those of were 84.7% and 86.0%, respectively (p-value < 0.01). However, both VR and VFA were high due to their aggressiveness, and this led to degradation in OA. On the other hand, predicted the voice activity more reliably by significantly reducing VFA. The average VFA of was 17.7%, but that of was 9.0%. As a result, achieved the highest average OA (85.7%, p-value < 0.01), outperforming the two networks.
This result indicates that the voice detection output of the main network was more conservative than the output of the auxiliary network. This is true because the main network had more classes (i.e., pitch labels) with which to compete. However, comparing
to
, the main network in
became more sensitive to voice activity due to the influence of the auxiliary network. This reveals that combining
with
in calculating the voice detection loss function (Equation (
4)) contributed to driving more tightly-coupled classification and detection, thereby improving the performance of melody extraction.
The overall performance of was generally higher than that of , but it did not outperform . The average OA of was comparable to , and the performance was lower than that of . Experimental results also showed that the deviations of RPA and RCA of the proposed models were high, except for and . Since the proposed models were trained for both pitch estimation and voice detection at different levels of abstraction, they were sensitive to initialization.
Figure 3b shows the results of overall accuracy (OA) on the four test sets for the compared models and outputs. The performance gap varied by up to 10% depending on the dataset, indicating that the models were affected by the characteristic that each test set had (e.g., genre). Again, we see that the performances of the JDC networks were generally superior to that of
for all test datasets.
Comparing to in each of the three cases (, , and ), the average of OA for three and networks were 83.5% and 84.9%, respectively. networks were generally superior to networks. The average OA of was improved by 3.17% over that of . With regard to OA, a t-test revealed a statistical significance between and . The results are as follows: ADC04 (0.025), MIREX05 (0.01), MedleyDB (0.027), and RWC (0.043). increased the average OA with respect to for ADC04, which is an especially challenging dataset. The average overall accuracy of is 83.7%, which was 6.1% higher than that of 77.6% of .
To summarize, in the training phase, the most effective models were networks that used both the main and auxiliary outputs for voice detection in the loss function. In the inference stage, the most effective output was , which used only the output of the main network. As a result, the best performance was obtained by . The overall performances of and were lower than the JDC networks. The JDC network had only 3.8 M parameters, while and had 7.6 M and 5.3 M parameters, respectively. It also shows that the JDC network is an efficient architecture for melody extraction.
4.2. Comparison of Voice Detection Performance
Figure 4 shows the average performances of singing voice detection for the
,
,
, and
networks evaluated on the four test sets.
achieved the best voice detection performance, leading to improved melody extraction performance. The F1 score of
was 91.0%, and that of
was 93.3% (
p-value < 0.05). F1 scores of other JDC networks were higher than
, but there were no significant differences. For
,
, voice detection performance was significantly lower (the F1 scores were 87.5% and 88.9%, respectively). This seems to be due to the fact that the used training set had a higher percentage of voice segments than non-voice segments. If enough data can be used for model training, there is a possibility that the performance of SVD may be further improved.
Figure 5 displays the performances of the proposed networks evaluated on the Jamendo dataset, which is dedicated to singing voice detection and unseen in training the models. As observed in the melody extraction results, the voice detection output of the main network was more conservative. This led to a low VR and VFA. On the other hand, the
networks that had the separate singing voice detector became more responsive, having higher VR and VFA. When comparing the two families of JDC networks,
was more conservative than
as the voice loss function contained the voice output from the main network. A similar result was found among the voice detection outputs. That is, JDC with
had lower VR and VFA than JDC with
or
. While the JDC networks returned comparable results, the best performance in terms of accuracy was obtained by
. The average of VR of
was 18.3% higher than that of
, maintaining a low VFA of 22.6%.
In
Table 2, we compare the voice detection result with other state-of-the-art algorithms. Lee et al. [
46] reproduced each algorithm using the Jamendo dataset as the training data under the same conditions, and we used the results for comparison. The performance of
was lower; however, considering that the compared models were in fact trained with the same Jamendo dataset (by using different splits for training and testing), the result from our proposed model was highly encouraging, showing that it generalized to some extent.
4.3. Comparison with State-of-the-Art Methods for Melody Extraction
We compared our best melody extraction model,
, with state-of-the-art methods using deep neural networks [
17,
18,
21,
22]. For a comparison of results under the same conditions, the test sets were ADC04, MIREX05, and MedleyDB for comparing other methods as mentioned in
Section 3.1.2.
Table 3 lists the melody extraction performance metrics on three test datasets. The pre-trained model and code of Bittner et al. [
17] are publicly available online, and the results in
Table 3 were reproduced by [
21] for vocal melody extraction. The results show that the proposed method had high VR and low VFA, leading to high RPA and RCA, and it outperformed the state-of-the-art methods. In addition, we confirmed that the proposed method had stable performance over all datasets compared to other state-of-the-art methods. It also showed that combining two tasks of melody extraction, i.e., pitch classification and singing voice detection, through the proposed JDC network and loss function was helpful for performance improvement.
4.4. Case Study of Melody Extraction on MedleyDB
We evaluated the models with a tolerance of one semi-tone, following the standard melody extraction evaluation rule. However, we should note that our proposed model can predict the pitch with a higher resolution (1/16 semi-tone).
Figure 6a shows the spectrogram of an audio clip (top) and the corresponding melodic pitch prediction along with the ground truth (bottom). Our proposed model can track nearly continuous pitch curves, preserving natural singing styles such as pitch transition patterns or vibrato.
While the proposed model achieved improved performance in singing melody extraction, the overall accuracy was still below 90%. We found that errors occurred more frequently in particular cases.
Figure 6b,c gives the examples of bad cases where VR and RPA were less than 60%. In both examples, the failures were mainly attributed to voice detection errors.
In
Figure 6b, the harmonic patterns of the vocal melody were not clearly distinguished from background music because the vocal track was relatively softer than the accompanying music track. This weak vocal volume was investigated as a cause of bad singing detection in [
46]. Since our melody extraction model was trained in a data-driven way, this could be addressed to some degree by augmenting the training data, for example adjusting the mixing of vocal gains (if they are in separate tracks).
In
Figure 6c, a strong reverberation effect was imposed on the singing voice; thus, the harmonic patterns of the singing voice appeared even after the voice became silent. The algorithm then detected the reverberated tone as vocals and predicted the pitch from it. This case is somewhat controversial because this could be seen as a problem of the ground truth notation. When we excluded these types of heavily-processed audio clips in MedleyDB (“PortStWillow-StayEven” and “MatthewEntwistle-Lontano”), we observed a significant increase in performance (about 5% in OA on MedleyDB).