Self-Supervised Transfer Learning from Natural Images for Sound Classiﬁcation

: We propose the implementation of transfer learning from natural images to audio-based images using self-supervised learning schemes. Through self-supervised learning, convolutional neural networks (CNNs) can learn the general representation of natural images without labels. In this study, a convolutional neural network was pre-trained with natural images (ImageNet) via self-supervised learning; subsequently, it was ﬁne-tuned on the target audio samples. Pre-training with the self-supervised learning scheme signiﬁcantly improved the sound classiﬁcation performance when validated on the following benchmarks: ESC-50 , UrbanSound8k , and GTZAN . The network pre-trained via self-supervised learning achieved a similar level of accuracy as those pre-trained using a supervised method that require labels. Therefore, we demonstrated that transfer learning from natural images contributes to improvements in audio-related tasks, and self-supervised learning with natural images is adequate for pre-training scheme in terms of simplicity and effectiveness.


Introduction
Deep learning is neural networks that can learn and analyze the relationship among the data and label inspired by the structure of a human brain. Recently, deep learning has been widely applied in audio-related tasks and achieved superior performance compared to traditional methods [1][2][3][4], especially in real-time smartphone-based voice activity dectection [5], and typing software by recognizing voice [6], sound event classification in noisy environment [7,8], environment sound classification, such as car horn and air conditional sound [9], and speech emotion recognition [10][11][12]. Among the deep learning methods, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) play an important role in audio tasks. RNNs is a network that process sequential inputs by utilizing the output of the before layer as input for the successive layer. It can extract temporal features from the 1-dimensional signals. CNNs is a network that process spatial information by striding the kernels for successive layers. It can extract spatial features from the 2-dimensional images. Yoshimura et al. [13] utilized an RNN to recognize speech from a 1D waveform by extracting temporal information from time-series data. There are some studies utilizing the 2D CNN for the audio related tasks. For the CNN, 1D audio signal is required to be converted into 2D. The spectrogram is one of the visual representation methods that indicates the frequencies of a signal along the time variations. In many studies [5,14], spectrogram representation has been adopted for representing the sound. Sehgal et al. [5] utilized a 2D CNN for speech recognition, which input a 2D spectrogram converted from a 1D waveform. Instead of RNN, the CNN focused more on the spatial features in the spectrogram images. Supervised deep learning-based methods require a large set of labeled data since unsupervised learning does not have that issue.
Although deep learning-based methods perform well on audio-related tasks, they require a large set of labeled data to be appropriately generalized on the test data. To train deep neural networks for new tasks, a large, labeled dataset must be constructed. However, labeling a large set of data is labor intensive especially for the audio data. This drawback has led to the determination of efficient learning schemes, such as transfer learning [15,16]. Transfer learning is a method that utilizes the network pre-trained with largely available dataset on the other relevant tasks. Based on the ImageNet [17], which is the large size image benchmark that consists of 14 million images, transfer learning achieved significant improvements on many image related tasks [18,19]. However, in the audio domain, audio benchmark which has enough number of samples for transfer learning does not available. Instead, Palanisamy et al. [14] proposed to utilize the transfer learning from natural images to audio-based images. They trained the CNN network using the images and corresponding labels of ImageNet and fine-tuned on audio benchmark dataset. By achieving the state-ofthe-art results on three benchmarks (ESC-50 [20], GTZAN [21], and UrbanSound8K [22]), they demonstrated that the network pre-trained with massive natural image dataset can be utilized for improving the performance of other domain's task.
However, transfer learning method proposed by [14] require labels along with the images for pre-training; usually transfer learning is utilized in the cases that the number of labels appropriate for tasks are small; requiring the additional labels for the transfer learning reduce the effectiveness of transfer learning. Additionally, there are extremely large unlabeled image dataset, but they cannot be utilized for the transfer learning that requires labels and only limited to the labeled dataset which has far fewer number of samples compared to the unlabeled ones. To apply the transfer learning across domains (natural images to audio domain), we proposed to utilize the self-supervised learning methods for pre-training instead of supervised ones. Self-supervised learning is a method for learning the general representation of images by learning dissimilarity among the different images and similarity between the ame images without labels [23][24][25][26][27][28]. Therefore, we pre-trained a large number of unlabeled natural images using self-supervised scheme and fine-tuned on audio-based images. Through the experiments on sound classification tasks, we demonstrated that the transfer learning from natural images to audio domain via self-supervised learning scheme can significantly improve the performance with the similar level of supervised ones. In addition, constructing an unlabeled dataset is significantly easier than constructing a labeled dataset. Therefore, self-supervised transfer learning can yield more successful results than supervised learning when a considerably large number of unlabeled images are used.
In summary, our proposed training architecture comprises two steps. First, pretraining a CNN with natural images via self-supervised learning. Second, fine-tuning the CNN with labeled 2D-spectrogram audio samples. Our main contributions can be summarized as follows: • We demonstrated that pre-trained networks with natural images can improve the performance of audio-related tasks with precise pre-training schemes. • Networks pre-trained through self-supervised learning have similar effects on the performance of audio tasks as those pre-trained through supervised methods. • When the self-supervised transfer learning scheme was validated using general sound classification datasets, such as (ESC-50, GTZAN, and UrbanSoun8K), the classification accuracy was significantly improved.
We organized the rest of the paper as follows. In the following, Section 2, we introduce the transfer learning and self-supervised learning methods related to our works. In Section 3, we describe the method about pre-processing, self-supervised learning scheme, and our training mechanism with transfer learning and self-supervised learning. Section 4 presents the experimental results validated on the audio benchmarks and provides discussion about the results. Finally, in Section 5, we report the conclusion.

Transfer Learning on Natural Image Domain
Transfer learning is a method that utilizes the network pre-trained with largely available dataset as a starting point [15,16]. After the pre-training, the network is trained using the target dataset. This simple method can improve the performance in various cases, especially in image-related tasks. For instance, Tajbakhsh et al. [15] demonstrated that a CNN trained from scratch exhibits significantly inferior performance for medical image analysis compared with that trained from pre-trained networks using ImageNet [17]. Marmanis et al. [16] showed that a fine-tuned network with a small fraction of target samples outperforms networks trained from scratch with all samples. Other studies [18,19] also showed that a pre-trained CNN can improve the discriminativeness of unsupervised image clustering. These results reinforce the concept that CNNs pre-trained on large image datasets can extract well-generalized features and can be transferred to other networks.

Transfer Learning on Audio Domain
Likewise, some studies used pre-trained CNN models to improve the performance of audio-related tasks [29][30][31]. Choi et al. [29] proposed a transfer learning framework from a network pre-trained for music tagging for general music-related tasks. It utilized the concatenated feature maps of multiple layers in a pre-trained CNN as efficient transferable knowledge and showed that transfer learning using concatenated features improved the performance of various general music-related tasks. Lee et al. [31] proposed music auto-tagging networks that used the aggregated features from pre-trained networks with different shapes and types of audio features. Kong et al. [30] utilized transfer learning by pre-training networks with large-scale audio data and achieved state-of-the-art performance in several audio tasks. This indicates that transfer learning schemes are effective in the audio domain.
However, there is a lack of large-scale public audio datasets for pre-training. Therefore, new paradigms for transfer learning that utilize a network pre-trained with natural images are transferred to audio-related tasks. Palanisamy et al. [14] showed that, even though pre-trained networks are not designed for audio tasks, a fine-tuned network pre-trained with ImageNet [17] can achieve state-of-the-art results in general sound classification tasks. This result demonstrated that knowledge from natural images can help in audio-based spectrogram analysis; moreover, it can be transferred to pre-training schemes.

Self-Supervised Learning
Pre-training with natural images can improve the performance of audio tasks; thus, we designed transfer learning schemes based on self-supervised learning instead of supervised learning. Through self-supervised learning, the useful representations of images can be learnt without their corresponding labels; this does not require additional human labor for pre-training. Early studies on self-supervised learning relied on heuristic pretext tasks [32][33][34][35]. For example, Zhang et al. [33] proposed a training framework that estimated the two associated color channels of images using the given single-channel images in the CIE lab color space. Noroozi et al. [34] proposed a self-supervised learning method in which the networks were trained to solve jigsaw puzzle-like problems that predicted the manually occluded parts of images. Recently, Gidaris et al. [35] pre-trained networks for predicting the angles of images, which were randomly rotated in advance. Although these studies suggested that pre-training for heuristics pretext tasks can help in learning the general representation of images, their methods require well-defined pretext tasks and do not always work. Recently, contrastive learning-based self-supervised learning demonstrated superior performance in numerous tasks (significantly improved performance compared to previous state-of-the-art methods) [23][24][25][26][27][28]. Contrastive learning that learns the similarity between images were proposed by the siamese network [36] for one shot classification. However, it requires label indicating that whether the given images are the same objects. To utilize the contrastive learning in the self-supervised learning, augmentations was adopted for generating positive samples that denote same-class samples without manually annotated labels. Dosovitskiy et al. [24] proposed an automated self-supervised learning scheme without pretext tasks by training a network to estimate the number of original images from the given augmented images. The network could learn the similarity between samples to count the original images from augmented samples. Bachman et al. [26] developed a self-supervised learning scheme that maximized the mutual information in the features extracted from cropped images from multiple locations of an original image. Recently, Chen et al. [28] proposed semi-supervised learning schemes that used a pretrained network with self-supervised learning as a teacher and fine-tuned the teacher network using a small fraction of target samples; then, the logit from the fine-tuned teacher network was distilled into the target network using a large number of unlabeled samples and small number of labeled samples. This procedure achieved a top-1 accuracy of 73.9% in ImageNet using only 1% of the labeled ImageNet data [37], while standard supervised ResNet-50 achieved an accuracy of 25.4% under the same conditions. Self-supervised learning can be utilized for pre-training without labels.

Data Pre-Processing
We used a 2D CNN for audio classification; therefore, the 1D waveform was converted into a 2D waveform. Previous studies found that mel-spectrograms are suitable for 2D representation of sound for the benchmarks we used for this study [14]. Therefore, we transformed the audio clips into a mel-spectrogram by applying short-time Fourier transform (STFT) and mel-filters. STFT is a Fourier-related transform used to determine the sine wave frequency and phase content of a local cross-section of a signal across time. (1) x raw and x spec denote the 1D raw waveform and 2D spectrogram, respectively. When a raw discrete digital signal is input to STFT function, part of raw signal x[n] and window function (W[t] = e (−j2 f n) ) was utilized to calculate spectrum, where t and f indicates time and frequency, respectively. N, which is number of bins for STFT, was set to 4410 for ESC-50, and 2205 for others.
After STFT, mel-filter banks generated by the function of mel-filters (H m (k)) were applied to extract features from the frequency domain by focusing more on low-frequency regions. Here, f (m) indicates Hertz function calculated from the mel (m) and the melfilter banks are the collections of mel-filters for the various k. As illustrated in Figure 1, filters are located in the low-frequency region more densely than the high-frequency regions to emphasize the differences in low-frequency region inspired by the human auditorium system. Figure 1. Visualization of mel-filter bank through the frequency range between 300-8000 Hz. Melfilter banks are collections of filters with different bandwidths. Filters in the low-frequency region are denser than the high-frequency region. It is inspired by the human auditory system for emphasizing the low-frequency differences.
Mel-spectrogram, shown in Figure 2, is generated by passing the input spectrogram into the mel-filters calculated by Equation (4). Here, x mel indicates mel-spectrogram; T and F denote the time scale and number of frequency bins, respectively. Outline of the pre-processing method that converts the 1D waveform into 2D mel-spectrum. Because we utilized a 2D convolutional neural network (CNN) pre-trained for ImageNet, three different mel-spectrograms with a single channel were concatenated to generate the 3-channel format.
Through mel-spectrogram generation, 1D audio signal expands its dimension into two (time and frequency domain). As a result, spectrogram and mel-spectrogram has single channel in the sense of image dimensions. However, natural images have three channels for each pixel; therefore, spectrogram and mel-spectrogram must be converted into three channels for the transfer learning. As proposed in a previous study [14], we generated three different mel-spectrograms from a single mel-spectrogram by applying three filters with different window sizes and hop lengths into the same signal ( Figure 2). For the first, second, and third channels, mel-spectrograms generated with a window size and hop length of 25 and 10 ms, 50 and 10 ms, and 100 and 50 ms, respectively, were used. Owing to the differences in window size and hop length, all mel-spectrograms were of different size. Therefore, all mel-spectrograms were resized into size of the largest one. Subsequently, the mel-spectrogram used in ESC-50 and UrbanSound8k had an input size of 3 × 128 × 250. In addition, GTZAN had an input size of 3 × 128 × 1500, which was significantly larger than the size of usual images.

Deep Convolutional Neural Network for Sound Event Detection
We utilized ResNet-50 [37], which has widely been used in many studies [14,38], as a baseline network to perform our classification task. ResNet includes several residual blocks that consist of convolutional layers, batch normalization layers, and ReLU activation functions. They are used to extract useful features through repetitive filters and stabilize the training process. In addition, ResNet has skip connections that merge the input and output of the residual block to prevent gradient forgetting when the network goes deeper. Owing to the skip connections, ResNet architectures can be expanded to more deeper networks and achieved significant improvements by extracting the more precise features using the series of convolutional filters. Additionally, the overall architectures can be separated into two parts: encoder ( f (·)) and linear projection head (g(·)). Encoder is the part for extracting the 2-dimensional features from the 2D input by striding the convolutional filters with the learnable parameters. Linear projection head is the part for projecting the embedded features into the target domain, in Figure 3, classification. Overall architecture of ResNet-50 that inputs the mel-spectrogram. ResNet includes several residual blocks that consist of convolutional layers, batch normalization layers, and ReLU activation function. They are used for extracting the spatial features from the input mel-spectrogram by striding the convolutional kernels in the convolutional blocks. The extracted features are categorized into target classes by the classification layers, which consists of the linear layers and ReLU activation function.

Self-Supervised Learning for Pre-Training
In comparison to image datasets, there is a lack of public audio datasets. Therefore, we developed a transfer learning scheme from natural images to the audio domain to deliver well-tuned representations. As a pre-training method, we utilized a self-supervised learning scheme that does not require additional labels for extracting useful representations from samples. We conducted an experiment using the SimCLR framework [27] as a selfsupervised learning method (Figure 4).
In the SimCLR framework, each image sample is randomly augmented into two new images via crop and resizing, color distortions, and Gaussian blur ( Figure 5). After the augmentation, two images are input to the encoder ( f (·)) and linear projection head (g(·)). And the output (z i ) of the self-supervised network is defined as g( f (x i )). Contrastive loss (L contrast ) is constructed to minimize the distance between the output features (z i and z j in Equation (6)) from the same images and maximize the distance between the output features from different images. The contrastive loss [27] is defined as follows.
In Equation (6) The main difference between supervised learning and self-supervised learning is that labels are required for training the supervised learning schemes, but they are not required for the self-supervised learning. Following Equation (7) is the cross entropy loss which is the most widely used training objective for supervised learning. In Equation (7), x i and y i denotes input data and target label, respectively. For training the network using cross entropy loss (L CE ), target label (y i ) that corresponds to the input data (x i ) must be given. However, in the self-supervised learning framework, only the pairs of images (x i and x j ) generated by augmentations are required to calculate the output vectors (z i and z j ); and contrastive loss (L contrast ) in Equations (5) and (6) can be trained using those output vectors by measuring the similarity between those outputs without any labels.

Transfer Learning from Natural Images to Audio Domain
To transfer the knowledge learned from the massive natural images without labels, first, we trained the encoder ( f (·)) and linear projection head (g(·)) using the contrastive loss described in Equations (5) and (6). Encoder network consists of the series of convolutional layers to fully extract the image representations, and linear projection head consists of a single fully connected layer to transform the extracted representations into target domains. When the pre-trained network was transferred into audio domain, the encoder network with pre-trained weights is fine-tuned on the target spectrogram data and linear projection head is trained from the scratch for mapping the newly learned representations into target domain. Figure 4 illustrates our self-supervised transfer learning scheme that has two stages: a pre-training stage using the unlabeled natural images and a fine-tuning stage using the target domain's data.

t-Stochastic Neighbor Embedding Analysis
The t-Stochastic Neighbor Embedding (t-SNE) is one of the non-linear dimension reduction techniques that extracts the high-dimensional features into the low-dimensional ones [39]. Usually in deep learning, t-SNE is utilized for visualizing the embedded features to check whether the network is well-trained for the classification; if the network is welltrained, embedded features from the same class samples are located in the near place; the features from the different class samples are located in the far place each other. t-SNE reduce the dimension of vectors by assigning a high probability to the points located in near to a reference point. After assigning the probability based on distances, it defines a similar probability distribution over the points in the low-dimensional space, by minimizing the Kullback-Leibler divergence (KL divergence) between the two distributions in the highand low-dimensional space.

Datasets
We conducted experiments to validate self-supervised transfer learning using the following three general audio benchmark datasets: ESC-50, UrbanSound8k, and GTZAN Dataset; they are constructed for environmental sound classification, noise-like urban sound classification, and music genre classification, respectively.
(1) ESC-50: It [20] is composed of 2000 audio clips (duration = 5 s) that are labeled into 40 classes of environmental sound, such as door knock, dog, and rain. Each class has 40 audio clips, and each sample was sampled at 44.1 kHz. (2) UrbanSound8K: It [22] is composed of 8732 audio clips of urban sound. They are labeled into 10 classes, namely, air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, sire, and street music. Each class has 800-1000 clips; each sample was sampled at a rate of 16-44.1 kHz. To use the audio samples as input for deep learning, we fixed their length to 4 s by resizing. (3) GTZAN Dataset: It [21] is composed of 1000 music clips labeled with 10 classes of music genres. Each class has 100 clips and each clip was sampled at 22.5 kHz. All clips comprised 30 s-long music files.

Experiments Settings
ResNet-50 [37] was set as the baseline network; parameters in the same settings as a previous study [14] were used. Adam optimizer with a learning rate and weight decay of 1e-4 and 1e-3, respectively, was used. In addition, the batch size was set to 32 and the networks were trained using a NVIDIA TITAN-Xp GPU for 70 epochs. For the pretrained networks, we used ResNet-50 pre-trained with ImageNet via supervised [17] and self-supervised [27] methods. Furthermore, we validated the results through k-fold crossvalidation using the predefined training/validation folds for three benchmarks, namely, GTZAN, ESC-50, and UrbanSound8K. The number of folds (k) was defined as 1 for GTZAN, 10 for ESC-50, and 5 for UrbanSound8K.

Evaluation Results on GTZAN, ESC-50, and UrbanSound8K
It is evident from Table 1 that the pre-trained networks achieved superior classification performance in comparison to the vanilla network. Well-learned features from natural images improve the performance of audio-related tasks [14]. However, the previous study used supervised transfer learning, which pre-trained the networks with a massive image dataset with corresponding labels. Unlike supervised transfer learning, our self-supervised transfer learning scheme does not require any additional label for pre-training and achieves a similar level of accuracy as supervised learning. For ESC-50 and UrbanSound8K, the accuracy of the supervised method was slightly higher than that of the self-supervised method (0.40%p and 0.04%p increase in average accuracy, respectively). Both pre-trained networks demonstrated the same accuracy for GTZAN on average.
In addition, the loss convergence in the pre-trained networks was faster than that in the vanilla network (almost 10 epochs). The training loss between the pre-trained networks with supervised and self-supervised learning showed similar patterns, which denotes the similarity between the supervised and self-supervised pre-training mechanisms with respect to transfer learning from natural images to the audio domain ( Figure 6). In these experiments, both pre-trained networks were trained with the same number of natural images (ImageNet). Annotating the numerous labels in an image requires significant human labor, which limits the construction of massive pre-training image datasets [17,27]. However, self-supervised learning does not require labels and it can utilize considerably large image datasets [27]. Although the network trained with self-supervised learning in Reference [27] showed lower performance on ImageNet classification tasks than the one trained with supervised learning methods, it was the first report that the self-supervised pre-training scheme can achieve above 70% accuracy on ImageNet. From our results, we further demonstrated that the transfer learning across domains using self-supervised pre-training could achieve similar performance with the supervised pre-training scheme. Therefore, pre-training with self-supervised learning is simpler and more efficient for transfer learning from natural images to the audio domain.

Linear Evaluation of Self-Supervised Learning Model in Audio Domain
To measure the effectiveness of pre-training with self-supervised learning, we froze the encoder trained by self-supervised learning and only fine-tuned the single fully-connected (FC) layer that was attached to the encoder. The accuracy of linear evaluation was 77.00% and 63.40% for GTZAN and ESC-50, which was 89.02% and 81.02% of the accuracy of the vanilla network for each dataset, respectively ( Table 2). For UrbanSound8K, linear evaluations achieved an accuracy of 74.24%, which is only 1.87%p less than that of the vanilla network. As illustrated in Figure 6, the training loss in the linear evaluation network also decreased and converged at a similar time as the vanilla network, especially in UrbanSound8K. These results demonstrated that the pre-trained network with natural images could perform the sound classification task by simply fine-tuning the FC layer. The backbone network, ResNet-50, is comprised of 35,318,034 parameters and the FC layer has 102,450 parameters, which is 0.3% of the total network. By fine-tuning the FC layer's weights in self-supervised pre-trained network, it achieved an 80-90% level of the vanilla network; 0.3% of the total network, FC layer, was fine-tuned using the target melspectrogram images with labels, and the remaining 99.7% parameters of the total network, namely encoder, was pre-trained using the natural images without labels and fixed during the transfer learning. This result implies that the encoded features of natural images helped sound classification tasks and can be utilized for transfer learning across domains.

t-SNE Analysis
Using t-SNE [39], we visualized the output features of the residual block 4 of the vanilla network, a pre-trained network with supervised learning, and a pre-trained network with self-supervised learning. t-SNE captures the relevant structures of high-dimensional features and projects them into a low-dimensional space considering that the neighboring points in the high dimension tend to be neighbors in the low-dimensional space. As illustrated in Figure 7, the features from the pre-trained networks gathered closely for the samples with the same class in comparison to the vanilla network. Samples from classes 1 and 4 (colored dark blue and red, respectively) showed more remarkable differences between the vanilla and pre-trained networks. In addition, both pre-trained networks exhibited similar patterns in the t-SNE results, which indicates that pre-training with the self-supervised scheme has similar effects on enhancing the network representations as the supervised schemes without labels.

Conclusions
We studied the self-supervised transfer learning from natural images to audio images based on the assumption that well-tuned features learned from a large number of natural images with self-supervised learning can be transferred to others across the domains. For extracting the well-tuned features, we pre-trained the CNN network using self-supervised learning methods that learn the similarity between images without labels; after, they are fine-tuned using the target audio mel-spectrograms. The CNN trained with natural images using the self-supervised scheme achieved high performance in the similar level of supervised scheme without the corresponding labels. And both pre-trained schemes outperform the vanilla network with the large margin. Therefore, our research can be applied to general tasks in the audio domain to achieve significant performance improvements by simply training the target networks from pre-trained networks; additional labeling or computational resources, similar to ImageNet pre-training in the image domain, are not required. In addition, this can considerably benefit the construction of large pre-training datasets without labels; we will validate our self-supervised pre-training scheme using the larger unsupervised natural image datasets, such as imagenet-21K. We believe that our self-supervised transfer learning across domains can be generalized to other tasks, such as medical imaging analysis and spectral imaging analysis, which are the fields in which it is hard to get a sufficient amount of data.
Author Contributions: Conceptualization, original draft preparation, experiments, S.S.; investigation, J.K. and Y.Y.; dataset pre-processing, S.L.; project supervision and paper writing, K.L. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by the Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-01335), for the development of AI technology to generate and validate the task plan for assembling furniture in real and virtual environments by understanding unstructured multi-modal information.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are openly available in [20][21][22].