SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classiﬁcation †

: Convolutional Neural Networks (CNN) have been applied to diverse machine learning tasks for different modalities of raw data in an end-to-end fashion. In the audio domain, a raw waveform-based approach has been explored to directly learn hierarchical characteristics of audio. However, the majority of previous studies have limited their model capacity by taking a frame-level structure similar to short-time Fourier transforms. We previously proposed a CNN architecture which learns representations using sample-level ﬁlters beyond typical frame-level input representations. The architecture showed comparable performance to the spectrogram-based CNN model in music auto-tagging. In this paper, we extend the previous work in three ways. First, considering the sample-level model requires much longer training time, we progressively downsample the input signals and examine how it affects the performance. Second, we extend the model using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classiﬁcation tasks. Finally, we visualize ﬁlters learned by the sample-level CNN in each layer to identify hierarchically learned features and show that they are sensitive to log-scaled frequency.


Introduction
Convolutional Neural Networks (CNN) have been applied to diverse machine learning tasks.The benefit of using CNN is that the model can learn hierarchical levels of features from high-dimensional raw data.This end-to-end hierarchical learning has been mainly explored in the image domain since the break-through in image classification [1].However, the approach has been recently attempted in other domains as well.
In the text domain, a language model is typically built in two steps, first by embedding words into low-dimensional vectors and then by learning a model on top of the word-level vectors.While the word-level embedding plays a vital role in language processing [2], it has limitations in that the embedding space is learned separately from the word-level model.To handle this problem, character-level language models that learn from the bottom-level raw data (e.g., alphabet characters) were proposed and showed that they can yield comparable results to the word-level learning models [3,4].
In the audio domain, raw waveforms are typically converted to time-frequency representations that better capture patterns in complex sound sources.For example, spectrogram and more concise representations such as mel-filterbank are widely used.These spectral representations have served a similar role to the word embedding in the language model in that the mid-level representation are computed separately from the learning model and they are not particularly optimized for the target task.This issue has been addressed by taking raw waveforms directly as input in different audio tasks, for example, speech recognition [5][6][7], music classification [8][9][10] and acoustic scene classification [11,12].
However, the majority of previous work have focused on replacing the frame-level time-frequency transforms with a convolutional layer, expecting that the layer can learn parameters comparable to the filter banks.The limitation of this approach was pointed out by Dieleman and Schrauwen [8].They conducted an experiment of music classification using a simple CNN that takes raw waveforms or mel-spectrogram.Unexpectedly, their CNN models with the raw waveform as input did not produce better results than those with the spectral data as input.The authors attributed this unexpected outcome to three possible causes.First, their CNN models were too simple (e.g., a small number of layers and filters) to learn the complex structure of polyphonic music.Second, the end-to-end models need an appropriate non-linearity function that can replace the log-based amplitude compression in the spectrogram.Third, the first 1D convolutional layer takes raw waveforms in a frame-level which is typically several hundred samples long.The filters in the first 1D convolutional layer should learn all possible phase variations of periodic waveforms within the length.In spectrogram, the phase variation is removed.
We recently tackled the issues by stacking 1D convolutional layers using very small filters instead of a 1D convolutional layer with the frame-level filters, inspired by the VGG networks in image classification that is built with deep stack of 3×3 convolutional layers [13,14].The sample-level CNN model has filters with very small granularity (e.g., 3 samples) in time for all convolutional layers.The results were comparable to those using mel-spectrogram in music auto-tagging.In this paper, we term the sample-level CNN architecture as SampleCNN and extend the previous work in three ways.First, we should note that SampleCNN takes four times longer training time than a comparable CNN model that takes mel-spectrogram.In order to reduce the training time, we progressively downsample the waveforms and report the effect on performance.By reducing the band-width of music audio this way, we will be able to find the cut-off frequency where the performance starts to become degraded.Second, we extended SampleCNN using multi-level and multi-scale feature aggregation [15].The technique proved to be highly effective in music classification tasks.We additionally evaluate the extended model in transfer learning settings where the features extracted from SampleCNN can be used for three different datasets in music genre classification and music auto-tagging.We show that the proposed model achieves state-of-the-art results.Third, we visualize learned intermediate layers of SampleCNN to observe how the filters with small granularity process music signals in a hierarchical manner.In particular, we visualize them for each of sampling rates.

Related Work
There are a decent number of CNN models that take raw waveforms as input.The majority of them used large-sized filters in the first convolutional layer with various size of strides to capture frequency-selective responses which were carefully designed to handle their target problems.We termed this approach as frame-level raw waveform model because the filter and stride sizes of the first convolutional layer were chosen to be comparable to the window and the hop sizes of short-time Fourier transformation, respectively [5][6][7][8][9][10][11].
There are a few work that used small filter and stride sizes in the first convolution layer (8 samples-sized filter [16] and 10 samples-sized filter [17,18] at 16 kHz).However, the CNN models have only two or three convolution layers, which are not sufficient to learn the complex structure of the acoustic signals.In SampleCNN, we deepen the layers even more, thereby reducing the filter and stride sizes of the first convolution layer down to two or three samples.

Learning Models
Figure 1 illustrates three CNN models in music auto-tagging that we compare in our experiments.Note that they are actually general architectures and so can be applied to any audio classification tasks.In this section, we describe the three models in detail.

Frame-Level Mel-Spectrogram Model
This is the most common CNN model used in music classification.The time-frequency representation is usually regarded as either two-dimensional images [19,20] or one-dimensional sequence of vectors [8,21].We only used one-dimensional(1D) CNN model for experimental comparisons because the performance gap between 1D and 2D models is not significant and the 1D model is directly comparable to models using raw waveforms.

Frame-Level Raw Waveform Model
In this model, a strided convolution layer is added beneath the bottom layer of the frame-level mel-spectrogram model.The strided convolution layer is expected to learn a filter-bank that returns a time-frequency representation.In this model, once the first strided convolution layer slides over the raw waveforms, the output feature map has the same dimensions as the mel-spectrogram.This is because the stride size, filter size, and the number of filters in the first convolution layer correspond to the hop size, window size, and the number of mel-bands in the mel-spectrogram, respectively.This configuration was used for the music auto-tagging task in [8,9] and thus we used it as a baseline model.

Sample-Level Raw Waveform Model: SampleCNN
As described in Section 1, the approach using raw waveforms should be able to address log-scale amplitude compression and phase-invariance.Simply adding a strided convolution layer is not sufficient to overcome the issues.To improve this, we add multiple layers beneath the frame-level such that the first convolution layer can handle much smaller size of samples.For example, if the stride of the first convolution layer is reduced from 729 (=3 6 ) to 243 (=3 5 ), 3-size convolution layer and max-pooling layer are added to keep the output dimensions in the subsequent convolution layers unchanged.If we repeatedly reduce the stride of the first convolution layer this way, six convolution layers (five pairs of 3-size convolution and max-pooling layer following one 3-size strided convolution layer) will be added (we assume that the temporal dimensionality reduction occurs only through max-pooling and striding while zero-padding is used in convolution to preserve the size).
We generalized the configuration as m n -SampleCNN where m refers to the filter size (or the pooling size) of intermediate convolution layer modules and n refers to the number of the modules.The first convolutional layer is different from the intermediate convolutional layers in that the stride size is equal to the filter size.An example of m n -SampleCNN is shown in Table 1 where m is 3 and n is 9.Note that the network is composed of convolution layers and max-pooling only, and so the input size is determined to be stride size of the first convolutional layer ×m n .In Table 1, as the stride size of the first convolution layer is 3, the input size is set to be 59049 (=3 × 3 9 ).

Multi-Level and Multi-Scale Feature Aggregation
Music classification tasks, particularly music auto-tagging among others, have a wide variety of labels in terms of genre, mood, instruments and other song characteristics.Especially, they are positioned in different hierarchical levels and time-scales.For example, some words related to instrument ones, such as guitar and saxophone, describe objective sound sources which are usually local and repetitive within a song, whereas other labels related to genre or mood, such as rock and happy, are dependent on a larger context of music and are more complicated.In order to address this issue, we recently proposed multi-level and multi-scale feature aggregation technique [15].
The technique is conducted by combining multiple CNN models.This assumes that the hidden layers of each CNN model represent different levels of features and the models with different input sizes provide even richer feature representations by capturing both local and global characteristics of the music.In [15], they showed that different level and time-scale features have different performance sensitivity to individual tags and thus combining them all together is the best strategy to improve performance.In this work, we replace the simple CNN architectures that take mel-spectrogram as input in [15] with SampleCNNs, taking different input sizes (e.g., 700 ms to 3.5 s).Once we train the SampleCNNs as supervised feature extractors, we slide each of them over a song clip (e.g., about 30 s) and obtain features from the last three hidden layers.We then summarize them by a combination of max-pooling and average-pooling.Finally, we concatenate the multi-level and multi-scale features and feed them to a simple neural networks with two fully-connected layers to make a final prediction.

Transfer Learning
The multi-level and multi-scale feature aggregation approach can be used in a transfer learning setting by using different datasets or target tasks for the final classification after training the SampleCNNs.Especially, when the target dataset size is comparably small to the model capacity, transferred parameters can yield better performance on the target task rather than parameters trained from the innate target dataset.The applicability of transfer learning using a frame-level raw waveform model has been explored in the speech domain [17].Here, we examine it using the sample-level raw waveform model for music genre classification and music auto-tagging with different datasets.

Datasets
We validate the effectiveness of the proposed method on different sizes of datasets for music genre classification and auto-tagging.All dataset splits are available on the link [22].The details of each dataset are as follows.The numbers in the parenthesis indicate the split of training, validation and test sets.
• GTZAN [23]: 930 songs (443/197/290) (This is a fault-filtered split designed to avoid the repetition of artists across the training, validation and test sets [24]), genre classification (10 genres).We primarily examined the proposed model on MTAT and then verified the effectiveness of our model on MSD which is much larger than MTAT (MTAT contains 170 h long audio and MSD contains 1955 h long audio in total).We filtered out the tags and used most frequently labeled 50 tags in both datasets, following the previous work [8,19,20].Also, all songs in the two datasets were trimmed to 29.1 s long.For transfer learning experiments, the model is first trained with the largest dataset, MSD, and the pre-trained networks are transferred to other three datasets.The evaluation is conducted with area under receiver operating characteristic (AUC) for auto-tagging datasets and accuracy for genre classification datasets.

Training Details
We used sigmoid activation for the output layer and binary cross entropy loss as the objective function to optimize.For every convolution layer, we used batch normalization [28] and ReLU activation.We should note that, in our experiments, batch normalization plays a vital role in training the deep models that take raw waveforms.We applied dropout of 0.5 to the output of the last convolution layer and minimized the objective function using stochastic gradient descent with 0.9 Nesterov momentum.The learning rate was initially set to 0.01 and decreased by a factor of 5 when the validation loss did not decrease more than 3 epochs.A total decrease of 4 times, the learning rate of the last training was 0.000016.Also, we used batch size of 23 for MTAT and 50 for MSD, respectively.

Mel-Spectrogram and Raw Waveforms
In the mel-spectrogram experiments, window sizes of 3 6 , 3 5 and 3 4 are used to match up to the filter sizes in the first convolution layer of the raw waveform model as shown in Table 2. FFT size was set to 729 (=3 6 ) in all experiments.When the window is less than the FFT size, we zero-padded the windowed frame.The linear frequency in the magnitude spectrum is mapped to 128 mel-bands and the magnitude compression is applied with a nonlinear curve, log(1 + C|A|) where A is the magnitude and C is set to 10. Also, we conducted the input normalization simply by dividing the standard deviation after subtracting mean value of entire input data.On the other hand, we did not perform the input normalization for raw waveforms.As described in Section 3.3, m refers to the filter size (which can be compared to a window size of FFT in the spectrogram) or pooling size (which also can be compared to a hop size of FFT in the spectrogram) of the intermediate convolution layer modules, and n refers to the number of the modules.In our previous work, we adjusted m from 2 to 5 and increased n according to the configuration of m n -SampleCNN [13].Among them, 3 9 -SampleCNN model with 59049 samples as input worked best and thus we fix our baseline model to it.In this configuration, we can increase the filter size and stride size in the first layer by decreasing the layer depth to conduct comparison experiments between the frame-level models and the sample-level model.For example, if the hop size or the stride size of the first convolutional layer is 729 in either the frame-level mel-spectrogram model or the frame-level raw waveform model, 4 convolutional modules with 3-sized filters are added when the input size is 59,049 samples.

Downsampling
The downsampling experiments are performed using the MTAT dataset.3 9 -SampleCNN model is used with audio input sampled at 22,050 Hz.For other sampling rate experiments, we slightly modified the model configuration so that the models used for different sampling rate can have similar architecture and similar input seconds to those used in 22,050 Hz.In our previous work [13], we found that the filter size did not significantly affect performance once it reaches the sample-level (e.g., 2 to 5 samples), while the input size of the network and total layer depth are important.Thus, we configured the models as described in Table 3.For example, if the sampling rate is 2000 Hz, the first four modules use 3-sized filters and the rest 6 modules use 2-sized filters to make the total layer depth similar to the 3 9 -SampleCNN.Also, 3-sized filters are used for the first four modules in all models for fairly visualizing learned filters.Table 3. Models, input sizes and number of parameters used in the downsampling experiment.In the third column (Models), each digit from left to right stands for the filter size (or the pooling size) of the convolutional module of SampleCNN from bottom to top.Thus, the number of digits represents the layer depth of each model.

Transfer Learning
The source task for the transfer learning is fixed to music auto-tagging using MSD because the dataset contains the largest set of music.In this experiment, 3 9 -SampleCNN was used.We examined the proposed model on three target datasets for genre classification and auto-tagging.We also examined the performance differences when using features from multiple levels of the pre-trained CNNs and also their combinations.

Mel-Spectrogram and Raw Waveforms
Table 2 shows that the sample-level raw waveform model achieves results comparable to the frame-level mel-spectrogram model.Specifically, we found that using a smaller hop size (81 samples ≈ 4 ms) worked better than those of conventional approaches (about 20 ms) in the frame-level mel-spectrogram model.However, if the hop size is less than 4 ms, the performance degraded.An interesting finding from the result of the frame-level raw waveform model is that when the filter length is larger than the stride, the accuracy is slightly lower than the models with the same filter length and stride.We interpret that this result is due to the learning ability of the phase variance.As the filter size decreases, the extent of phase variance that the filters should learn is reduced.

Effect of Downsampling
During the experiments, we observed that the training time of the proposed SampleCNN is about four times longer than the frame-level mel-spectrogram model because the proposed model has more network parameters with deeper layers.In order to reduce the training time, we downsampled the audio with a set of lower sampling rates including 2000, 4000, 8000, 12,000, 16,000, 20,000 Hz.This can be regarded as a time-domain counterpart of in linear-to-mel mapping in that both reduce the dimensionality of input and preserve low-frequency content.The results in Table 5 show that the performance is maintained down to 8000 Hz but it starts to be degraded from 4000 Hz.This may indicate that the relevant information to the task is concentrated below 4000 Hz (the Nyquist frequency of 8000 Hz).Also, we report the training time ratio of the models taking re-sampled audio to the model using 22,050 Hz signal as input.At the expence of the accuracy, the training time can be reduced to about half.Table 5.Effect of downsampling on the performance and training time.MTAT is used in the experiments.We matched the depth of the models taking different sampling rate to the 3 9 -SampleCNN.For example, if the sampling rate is 2000 Hz, the first four convolutional modules use 3-sized filters and the rest 6 modules use 2-sized filters to make the total layer depth similar to the 3 9

Effect of Multi-Level and Multi-Scale Features
To measure the effect of multi-level and multi-scale feature combination, we experimented with several settings in Table 4.The SampleCNN models are first trained on MTAT dataset, then this pre-trained networks are used as feature extractors for the MTAT dataset again.The results show that as more features are fusioned, the performance increases.This can be viewed similar to an ensemble method, however our approach is distinguished from it in that the feature aggregation is performed on activations of the hidden layers, not on the prediction values.

Transfer Learning and Comparison to State-of-the-Arts
In Table 6, we show the performance of the SampleCNN model and the transfer learning experiments (the bottom four lines).The results achieved state-of-the-art results on three datasets except for MSD.However, when considering that the model used in [15] utilized both multi-level   9 -SampleCNN with 59,049 samples as input.Visualization was performed using the gradient ascent method to obtain the accumulated gradient-based input waveform like signal that maximizes the activation of a filter in the layers.To effectively find the filter characteristics, we set the input size to 729 samples which is close to a typical frame size.
Note that we set the input waveform estimate to 729 samples in length because, if we initialize and back-propagate to the whole input size of the networks, the estimated filters will have large dimensions such as 59,049 samples in computing spectrum.Thus, the results are equivalent to spectra from a typical frame size.The layer 1 shows the three distinctive filter bands which are possible with the filter size with 3 samples (say, a DFT size of 3).The center frequency of the filter banks increases linearly in low frequency filter banks but, as the layer goes up, it progressively becomes steeper in high frequency filter banks.This nonlinearity was found in learned filters with a frame-level end-to-end learning [8] and also in perceptual pitch scales such as mel or bark.
Finally, we visualized spectrum of the learned filter for each sampling rate up to 4th layers.In Figure 4, we can observe that all SampleCNN models focus (or zoom in) on the important low-frequency bands.We can also find that they show similar non-linear patterns to those in Figure 3.

Song-Level Similarity Using t-SNE
We extracted features from SampleCNN and aggregated them at different hierarchical levels of layer for each audio clip.We then embedded the song-level features into 2-D vectors using t-Distributed Stochastic Neighbor Embedding (t-SNE).Figure 5 visualizes the 2-D embedded features at different layer levels for selected tags to examine how multi-level feature aggregation technique enhances the performance.Songs with genre tag (Techno) are more closely clustered in the higher layer (−1 layer).On the other hand, songs with instrument tag (Piano) are more closely clustered in the lower layer (−3 layer).This may indicate that the optimal layer of feature representations can be different depending on the type of labels.Thus, combining different levels of features can improve the performance.

Conclusions
In this article, we extend our previously proposed SampleCNN for music classification.Through the experiments, we found that downsampling music audio down to 8000 Hz does not significantly degrade performance but it saves training time.Second, transfer learning experiments with multi-level and multi-scale technique showed state-of-the-art results on most of the datasets we tested.Finally, we visualized the spectrum of the learned filters for each sampling rate and found that the SampleCNN model is actively focusing on (or zoom in on) important low-frequency bands.As future work, we will analyze why the sample-level architecture works well without input normalization and nonlinear function that compresses the amplitude, which are important when we use spectrogram as input.Also, we will investigate different filter visualization techniques to interpret the hierarchically-learned filters better.

Figure 1 .
Figure 1.Comparison of (a) frame-level model using mel-spectrogram; (b) frame-level model using raw waveforms and (c) sample-level model using raw waveforms.

Figure 2 .
Figure 2. Examples of learned filters at each layer.

Figure 3 .
Figure 3. Spectrum of the estimated filters in the intermediate layers of SampleCNN which are sorted by the frequency of the peak magnitude.The x-axis represents the index of the filter, and the y-axis represents the frequency ranged from 0 to 11 kHz.The model used for visualization is 39 -SampleCNN with 59,049 samples as input.Visualization was performed using the gradient ascent method to obtain the accumulated gradient-based input waveform like signal that maximizes the activation of a filter in the layers.To effectively find the filter characteristics, we set the input size to 729 samples which is close to a typical frame size.

Figure 4 .
Figure 4. Spectrum visualization of learned filters for different sampling rates.The x-axis represents the index of the filter, and the y-axis represents the frequency ranged from 0 to half the sampling rate.3-sized filters are used for the first four modules in all models for fairly visualizing learned filters.

Figure 5 .
Figure 5. Feature visualization on songs with Piano tag and songs with Techno tag on MTAT using t-SNE.Features are extracted from (a) -3 LAYER and (b) -1 LAYER of the 3 9 -SampleCNN model pre-trained with MSD.

Table 1 .
SampleCNN configuration.In the first column (Layer), "conv 3-128" indicates that the filter size is 3 and the number of filters is 128.

Table 2 .
Comparison of three CNN models with different window size (filter size) and hop size (stride size).n represents the number of intermediate convolution and max-pooling layer modules, thus 3 n times hop (stride) size of each model is equal to the number of input samples.

Table 4 .
Comparison of various multi-scale feature combinations.Only the MTAT dataset was used.