5.1. Parameter Optimization Study
As discussed in
Section 4, we investigate two different data partition strategies. In this experiment, we want to study the influence of the data augmentation method, the skip connection type, as well as the network capacity of the U-net approach to the transcription performance on the validation set. For each strategy, we compare 64 hyperparameter configurations based on the parameter settings defined in
Table 3. The sets of hyperparameters for the best performing models in both scenarios are listed in
Table 4.
For the mixed data partition, skip connection strategy B, where the intermediate activations are transferred, outperforms strategy C, which involves transferring the max pooling indices. This finding goes in line with the proposed method for melody transcription in [
2]. Larger models with
combined with RandomEQ data augmentation consistently showed the best results. The highest overall accuracy value achieved was 0.82. We conjecture that this relatively high number is due to model overfitting to the
Mixed Genre Set, where both the training and the validation set were drawn from it.
For the Jazz data partition, the highest overall accuracy is 0.6 and therefore significantly lower compared to the mixed data partition. Note that, in this case, the validation set only contains jazz ensemble recordings while the training set includes various music genres. Presumably, this shows that the bass transcription task is more complex due to the predominance of the melody instruments. Skip connection strategy B and pitch shifting data augmentation seem beneficial for this data partition although no clear trends could be observed across different hyperparameter configurations. The best models BassUNet and BassUNet obtained from the Mixed and Jazz data partition strategy, respectively, will be evaluated in the comparative study against three state-of-the-art bass transcription algorithms as will be described in the following section.
After identifying the optimal models BassUNet
and BassUNet
, we report in
Table 5 the results of an ablation study. This table shows how the overall model accuracy values decrease when data augmentation and skip connections are neglected separately and jointly during the model training. The results show that both components are important for the performance of the U-net model. Similar findings were reported for the skip connections in U-nets for singing voice separation [
26] as well as for the use of data augmentation for singing voice detection [
24] and music transcription [
27]. The sets of hyperparameters for the best performing models in both scenarios are listed in
Table 4.
5.2. Comparison to the State of the Art
In this experiment, we compare the two best configurations of the proposed method BassUNet
and BassUNet
as identified in
Section 5.1 with three reference bass transcription algorithms as listed in
Table 6. We use the remaining 80% of the Jazz Set (compare
Section 4 and
Table 2), i.e., the full Jazz Set without the validation set of the Jazz data partition as test set.
The first reference algorithm (BI18) is encapsulated in a deep neural network for joint estimation of melody, multiple F0, and bass estimation as proposed by Bittner et al. [
11]. The network processes harmonic CQT representations of audio signals with a cascade of multiple convolutional layers for multitask feature learning. We use an available online implementation (
https://github.com/marl/superchip/blob/master/superchip/transcribe_f0.py (accessed on 11 March 2021)).
The second reference algorithm ({AB07) was proposed by Abeßer et al. in [
1]. Here, a fully-connected neural network maps a CQT spectrogram to a bass pitch activity representation. Again, we use an available online implementation (
https://github.com/jakobabesser/walking_bass_transcription_dnn (accessed on 11 March 2021)). Both algorithms AB17 and BI18 output independent pitch salience values for different F0 candidates on a frame level. Voicing estimation is implemented by using a fixed minimum salience threshold
. Each time frame is considered to be unvoiced if all pitch salience values are below this threshold. We optimize this threshold independently for both algorithms on the full training set.
The third reference algorithm (SA12) is based on a version of the Melodia melody estimation algorithm [
28], which is modified to transcribe lower fundamental frequencies as described in [
29]. In contrast to the before-mentioned data-driven algorithms, this algorithm combines music domain knowledge with several audio signal processing steps. Furthermore, it analyzes only two octaves from 27.5 Hz to 110.0 Hz. Therefore, it only makes sense to compare the pitch estimation performance of SA12 with the other algorithm based on the raw chroma accuracy (RCA), which disregards the detected octave positions.
We use five common evaluation measures to evaluate the pitch estimation and voicing estimation as defined in [
30]. Raw pitch accuracy (RPA) equals the fraction of the number of frames with correctly estimated pitches (within a given tolerance) and the number of voiced frames, i.e., frames with an annotated pitch. Raw chroma accuracy (RCA) additionally maps all frequency into one octave and therefore focuses on pitch class estimation. In order to evaluate the voicing estimation quality, voicing recall (VR) measures the fraction of correctly identified voiced frames and voicing false alarm rate (VFA) measures the fraction of frames which are incorrectly estimated to be voiced. A well-performing transcription algorithm should have high VR values and low VFA values as indicated by upwards and downwards arrows in
Table 6. Finally, overall accuracy (OA) measures the percentage of frames with correctly estimated voicing and pitch.
Table 6 lists the five evaluation scores for each investigated bass transcription algorithm averaged over all test set files. While the proposed method BassUNet
showed a lower OA value on the validation set of the
Jazz data partition strategy (see
Section 5.1), it outperforms all other algorithm on the test set by around 5 percent in overall accuracy (OA). The algorithm represents a model configuration, which is optimized for transcribing bass lines in jazz ensemble recordings. We believe that the main reason for that is the similar data distribution between its validation set, which guided the model training process, and the final test set.
The BassUNet
model on the other hand, which was not optimized for the jazz scenario, shows a lower overall accuracy of
, which results from both lower voicing and pitch detection scores. While the RPA improvement of
between BassUNet
and the best performing reference algorithm AB17 is only of minor size, the main improvement was achieved in voicing detection especially which is particularly evident in the reduced voicing false alarm rate of (VFA) from
(AB17) to
(BassUNet
). We consider this to be the main contribution of the proposed U-net architecture since it explicitly learns to predict the frame-level instrument activity (voicing) without any additional thresholding operation. Similar findings were reported for the melody estimation task for some of the evaluated datasets in [
2]. When looking at the pitch estimation performance (RPA, RCA), the BassUNet
model performs similar to the reference methods BI18 and AB17. Notably, the reference algorithm SA12 achieves the highest VR and an almost similar raw chroma accuracy RCA as the proposed method.