Deformer: Denoising Transformer for Improved Audio Music Genre Classification

: Audio music genre classification is performed to categorize audio music into various genres. Traditional approaches based on convolutional recurrent neural networks do not consider long temporal information, and their sequential structures result in longer training times and convergence difficulties. To overcome these problems, a traditional transformer-based approach was introduced. However, this approach employs pre-training based on momentum contrast (MoCo), a technique that increases computational costs owing to its reliance on extracting many negative samples and its use of highly sensitive hyperparameters. Consequently, this complicates the training process and increases the risk of learning imbalances between positive and negative sample sets. In this paper, a method for audio music genre classification called Deformer is proposed. The Deformer learns deep representations of audio music data through a denoising process, eliminating the need for MoCo and additional hyperparameters, thus reducing computational costs. In the denoising process, it employs a prior decoder to reconstruct the audio patches, thereby enhancing the interpretability of the representations. By calculating the mean squared error loss between the reconstructed and real patches, Deformer can learn a more refined representation of the audio data. The performance of the proposed method was experimentally compared with that of two distinct baseline models: one based on S3T and one employing a residual neural network-bidirectional gated recurrent unit (ResNet-BiGRU). The Deformer achieved an 84.5% accuracy, surpassing both the ResNet-BiGRU-based (81%) and S3T-based (81.1%) models, highlighting its superior performance in audio classification.


Introduction
Developments in multimedia technology have resulted in a sharp increase in the variety of digital music and its listening volume, necessitating urgent advancements in music information retrieval (MIR), which involves utilizing computer technology to automatically analyze, recognize, retrieve, and understand music.Audio music genre classification is a MIR task that involves assigning labels to each piece of music based on characteristics such as genre [1,2], mood [3,4], and artist type [5,6].Audio music genre classification enables the automatic categorization of audio music based on different styles or types, facilitating a deeper understanding and organization of music libraries.
The evolution of deep learning has profoundly affected music genre classification, ushering in an era of automatic feature learning.Convolutional neural networks (CNNs) are proficient in discerning the complex spatial features inherent in audio data [7][8][9][10].However, they are limited in their ability to account for the long-term temporal information inherent in musical compositions.To address this limitation, convolutional recurrent neural networks (CRNNs) [11][12][13], which combine the strengths of both CNNs and recurrent neural networks (RNNs), are employed in music classification.In the specific context of music genre classification, CRNNs have demonstrated a marked advantage over CNNs, proficiently discerning both localized features and short-term temporal inter-relationships.Unfortunately, CRNNs still struggle to capture the long-term temporal dependencies that are often crucial in complex musical compositions.
Transformer-based music genre classification approaches, which are fortified with attention mechanisms, have been introduced to address these issues; they have achieved success, particularly in recognizing long-term information in music.Various transformer-based models, such as MusicBERT [14] and MidiBERT [15], have been developed to focus on different aspects of music genre classification.MusicBERT is equipped with specialized encoding and masking techniques that capture complex musical structures, whereas MidiBERT focuses on single-track piano scores.These models can effectively recognize long-term dependencies in music, especially in the context of symbolic music data such as the Musical Instrument Digital Interface (MIDI).Most existing transformerbased models for music classification are primarily tailored for symbolic music data such as MIDI, and there is a notable lack of models that can handle continuous audio data.
A Swin transformer-based approach has emerged as a targeted solution to solve the issues of traditional transformer-based models in handling continuous high-dimensional audio data [16].This advanced architecture employs a pre-training strategy known as momentum contrast (MoCo), which is a form of contrastive learning.This strategy aims to create similar representations for similar data points while pushing dissimilar data points apart in the feature space by maintaining a dynamic dictionary.Unfortunately, the MoCo pre-training strategy presents its own set of challenges.First, it incurs significantly increased computational costs, owing to the need to maintain and update this large dictionary.Second, the contrastive loss function can be sensitive to hyperparameter choices, thereby complicating the model optimization process.Third, MoCo-based approaches typically suffer from low interpretability, making it difficult to understand the model decisions or identify the learned features that contributed to the classification results.
Additionally, denoising has been extensively researched.Denoising approaches based on self-supervised learning [17,18] via the noise-removal process can effectively capture features and learn deep representations.They have many similarities with selfsupervised pre-training strategies, thus making the integration of the denoising concept into pre-training feasible.
In this paper, a novel method for audio music genre classification is proposed.The proposed method is characterized by denoising, which not only reduces computational costs compared to MoCo-based strategies but also offers a robust performance that is uninhibited by hyperparameter dependency.Uniquely, the proposed method incorporates a prior decoder, which substantially enhances the interpretability of the decisionmaking process.The main contributions of this method are as follows.

•
The proposed method includes a novel pre-trained model called Deformer and utilizes unsupervised learning to fully leverage unlabeled data for pre-training.

•
The proposed method design includes a prior decoder that assists Deformer in completing the pre-training effectively; it harnesses the potential of transformers in processing image-like audio data.Notably, this prior decoder improves the interpretability of the results obtained by the method.

•
The proposed method was experimentally proven to not only lower the computational cost but also achieve better results compared with existing approaches.
The remainder of this paper is organized as follows.Section 2 describes related work on audio-based music genre classification, and Section 3 introduces the proposed music-classification method based on audio data.Then, Section 4 details the experimental process and results.Finally, Section 5 concludes the proposed paper.

Related Works
This section provides an overview of the evolution of classification techniques for musical audio data, tracing the transition from methods relying on manual feature extraction to end-to-end models.

Music Genre Classification Based on Audio Data
Researchers achieved music classification by converting audio data into spectrograms and Mel Frequency Cepstral Coefficients (MFCC) images and subsequently applied texture analysis approaches for feature extraction.Various classifiers, such as Knearest neighbors (KNN), Gaussian models, and support vector machines (SVMs), were utilized for classification.Notably, the KNN algorithm was effective in classifying classical music [19].This approach was extended by introducing local binary patterns (LBPs) to extract textural features from spectrograms [20].The extended version explored partitioning techniques to capture local information, emphasizing the importance of local features in enhancing the classification performance.Although these approaches were effective in specific scenarios, they were generally constrained by their focus on timbral features and failed to capture aspects of music that were potentially crucial for a more comprehensive understanding and music classification.
Informative musical patterns could be automatically identified using CNNs [7].Nonetheless, these rudimentary CNN models were restricted in generalizing previously unencountered music datasets.To overcome the limitations of these models in generalization and handling long-term temporal information, researchers proposed a hybrid model that combined residual neural networks (ResNets) and gated recurrent units (GRUs).This model used visual spectrograms as inputs and aimed to analyze music data more comprehensively.This approach [13] could improve the performance of musicrecommendation systems through more accurate genre classification, thereby addressing the shortcomings of traditional machine learning and basic CNN models in handling the complexity of music data.
By contrast, an approach has been proposed using S3T [16], which is a selfsupervised pre-training approach with the Swin Transformer that effectively handles long-term information in music classification.This approach primarily aimed to learn meaningful music representations from a large corpus of unlabeled music data.It employed the momentum-based paradigm MoCo to serve as a feature extractor in the time-frequency domain of music and utilized a music data-augmentation pipeline and two specially designed preprocessors to further optimize the learning of music representations.However, this approach faced challenges such as an increased computational overhead owing to the management of large dynamic dictionaries, sensitivity to hyperparameter selection in the optimization process, and a lack of model interpretability.
Progress in music classification had shifted from a focus on timbral features to more advanced and automated feature identification and extraction.Each stage of this evolution was associated with different challenges, ranging from limited generalization and a narrow focus on certain musical aspects to increased computational demands and complex training requirements.This underscored the ongoing need for obtaining more accurate and computationally efficient solutions and overcoming the persistent challenges of balancing model complexity and interpretability.

Comparison of Music Genre Classification Based on Audio Data
The comparison primarily considered two dimensions: one from the perspective of data types and another from the perspective of model structures.Regarding the perspective of input data, some researchers treated music classification as a visual task, converting audio music data into spectrograms [13,16,19,20] and MFCC images [7].These approaches emphasized the visual characteristics of music data.Spectrograms could capture the local characteristics of audio signals in a time-frequency dimension in an intui-tive and computationally efficient manner, thereby providing a robust and informationrich feature representation for music-classification tasks.
Second, there was an evolution from initially employing classifiers [19], such as KNN, the Gaussian Mixture Model (GMM), and SVM, to incorporating deep CNN structures and combining them with RNNs.Regarding RNNs, they faced challenges such as difficulties in learning long-term dependencies.These shortcomings were addressed by introducing S3T-based models, which offered advantages in capturing long-range dependencies.Unfortunately, these models faced challenges in terms of computational costs and robustness as they evolved from simple to complex and singular to multifaceted.Especially when considering how to effectively capture and process the long-term information of music, these technological variances and advancements became pivotal in addressing the challenge.Notably, the proposed method offered a distinct approach that specifically addressed current limitations by expanding upon existing methodologies.The specifications of the proposed and existing approaches are listed in Table 1.

Denoising Transformer-Based Audio Music Genre Classification
The architecture and training strategies for the Deformer-based method are detailed next.First, the data representation techniques are discussed; then, the pre-training and fine-tuning stages are outlined.

Overview
A method utilizing pre-training techniques based on Deformer was proposed for audio music genre classification.The proposed method consists of pre-training and finetuning stages, as shown in Figure 1.First, unlabeled or labeled audio data are preprocessed into a normalized Mel spectrogram, and a noise-injection operation is applied in the pre-training stage, during which Deformer learns deep representations of audio music from unlabeled audio data.For this, a prior decoder is utilized to restore the denoised Mel spectrogram from the low-dimensional hidden states, which is obtained from Deformer.In the fine-tuning stage, the pre-trained Deformer and classifier are further trained using labeled audio data to perform music genre classification.The flowchart for the proposed method is shown in Figure 2.

Preprocessing and Noise Injection
The preprocessing is employed in the pre-training and fine-tuning stages, as illustrated in Figure 3. Initially, audio data are converted into Mel spectrograms with dimensions  and , corresponding to time and frequency, respectively.These spectrograms are then resized using the librosa library to new dimensions,  and  , which are determined based on the experimental hardware.It is worth noting that the values of  and  should be carefully chosen; excessively large dimensions may incur a larger calculation resource usage.Finally, the resized Mel spectrograms are normalized by scaling the values to fit within a range from zero to one.

Pre-Training Stage
The objective of the pre-training stage is to enable Deformer to understand the deep representation of audio music through unsupervised denoising.The role of the prior decoder within this framework is to restore patches from low-dimensional hidden states obtained from Deformer during pre-training.An autoencoder (AE), which comprises an encoder and a decoder, was designed to train the decoder, as shown in Figure 4.The encoder consists of max-pooling and convolutional layers, and the decoder consists of convolutional and up-sampling layers.The encoder compresses the input patches into lowdimensional vectors, and the decoder restores the low-dimensional hidden states to the original patches.The mean squared error (MSE) loss is calculated to update the AE parameters.

Fine-Tuning Stage
In the fine-tuning stage, the Deformer that has already learned the deep representation of audio music is applied to music genre classification.Figure 6 shows the process of fine-tuning the pre-trained Deformer to a classification network.Different from the pre-training stage, the fine-tuning stage is not unsupervised learning, and the prior decoder is not utilized.Normalized Mel spectrogram patches  ,  ,  , . . .,  , … ,  without added noise are fed into the model, along with a classifier token () [21].The  token serves as a condensed representation of all input patches.As opposed to pretraining, which involves decoding layers, fine-tuning employs a genre prediction layer connected to the final Deformer position.This layer is a linear component that uses a SoftMax function to predict the probability distribution for the genre classes based on the  token's hidden state.The cross-entropy loss is then computed using the predicted and target genres to fine-tune the Deformer and the genre prediction layers, enhancing Deformer's ability to classify music genres effectively.

Experiments and Results
Three experiments, namely, prior decoder training, Deformer pre-training, and Deformer fine-tuning, were performed to thoroughly evaluate the effectiveness of the proposed Deformer-based method in terms of audio music genre classification.First, in the prior decoder training experiment, an autoencoder was trained to convert lowdimensional hidden states into patches of normalized Mel spectrograms.In the Deformer pre-training experiment, Deformer was pre-trained to understand musical deep representations by restoring the original Mel spectrograms from noisy Mel spectrograms.Finally, in the Deformer fine-tuning experiment, the pre-trained Deformer classified the audio music genre through supervised learning.To evaluate the performance and effectiveness of the proposed method, two baseline models are introduced for comparison.The first model [13] utilizes a residual neural network-bidirectional gated recurrent unit (ResNet-BiGRU), while the second relies on S3T [16].These models serve as benchmarks, helping to underscore the advantages of the proposed technique for music genre classification.

Experimental Environment
Table 2 summarizes all the hyperparameters used in the three experiments.The autoencoder comprises an encoder and a decoder; the encoder consists of three convolutional layers utilizing the same kernels but with different channels.The Deformer hyperparameters include 196 patches, a patch size of 16 × 16, a hidden size of 768, four intermediate multiplications, 12 hidden layers, and 12 attention heads.
During the training of the three models, the resized Mel Spec ( ×  ) size was set to 224 × 224, which can be adjusted according to the hardware of the experimental environment.As mentioned before, noise injection was operated in pre-training.The noise injection ratio  was determined by experimental results, given that the highest classification performance was obtained when  was set to 0.75.Similarly, 0.75 was also used as the mask parameter in [22], which similarly achieved good results.As the input to the prior decoder is a patch, it allows for a higher batch size compared to others.The parameters of the AdamW optimizer were nearly similar.When setting the learning rate, it was considered that pre-training requires warmup.Unlike other approaches that use a fixed learning rate, pre-training employed a dynamically changing learning rate based on the WarmupDecayLR scheduler.
The experiments were conducted on a system running Windows 10 with 2 Xeon(R) Silver 4310 CPUs, 4 NVIDIA GeForce RTX 3090 GPUs, and 128 GB of DDR4 RAM.The proposed method was developed in Python 3.10.12and implemented using the PyTorch 2.0.0 platform, complemented by the DeepSpeed acceleration engine for enhanced performance.In addition to conducting these experiments, a comparative assessment was performed with two baseline models.The first baseline model [13] employed a hybrid approach, combining ResNet18 and Bi-GRU.ResNet18 utilizes residual connections, comprising 18 weighted layers, including an initial convolutional layer, a max-pooling layer, 4 convolutional blocks (each with 2 convolutional layers), an average pooling layer, and a fully connected layer.Bi-GRU is a recurrent neural network designed for processing sequential data, consisting of a GRU layer and a fully connected layer.
The second baseline model [16] adopted S3T, leveraging the Swin Transformer as a feature extractor in the time-frequency domain of music.It integrates a momentumbased MoCo paradigm for enhanced performance.The feature extractor follows the Swin-T configuration, using the compact version of the Swin Transformer with a hidden channel number of 96.Each block comprises 2, 2, 6, and 2 layers, ensuring increased efficiency.

Experimental Data
Two distinct datasets, each divided into an 80% training set and a 20% test set, were employed in the genre classification experiment involving audio music data.The MAESTRO dataset [23] was used for feature extraction via an autoencoder and for pretraining Deformer.This dataset encompasses a broad spectrum of musical instruments and styles, with contributions from both professionals and amateur musicians.Mel spectrograms derived from raw audio files were used as inputs.The training set was used for model optimization using techniques such as gradient descent, and the test set was designated for performance evaluation using metrics such as MSE.The GTZAN [24] music dataset was exclusively used to fine-tune Deformer.Renowned in genre classification, this dataset consists of one thousand 30 s audio segments across ten distinct genres, including blues, classical, and hip hop.The audio clips were transformed into Mel spectrograms to serve as inputs for the model.The training process involved iterative Deformer updates based on loss minimization, and the test phase assessed the genreclassification capabilities of Deformer in terms of the precision and recall metrics.

Experimental Results
The results from the autoencoder training, pre-training, and fine-tuning experiments were analyzed.The initial results indicated a rapid loss function convergence, validating the effectiveness of decoder training.Further findings from the pre-training and fine-tuning processes revealed that Deformer exhibited superior performance in music data processing, outperforming the baseline models in multiple key performance metrics.

Prior Decoder Training Results
Figure 7 shows the prior decoder training experiment results, which involved 8000 steps.The MSE loss decreased rapidly from 0.14 to 0.02.Subsequently, the loss continued to decrease at a slower pace, eventually converging to approximately 0.001.Figure 8 shows the test results of the prior decoder experiment and the reference for comparison.Figure 8a shows the Mel spectrograms assembled from the patch output reconstructed by the decoder, while Figure 8b shows the original Mel spectrograms used for comparison with the reconstructed version; subtle local differences can be observed in the areas marked with red boxes.Interestingly, the Mel spectrograms assembled from the reconstructed patches in Figure 8a were almost indistinguishable from the original Mel spectrograms in Figure 8b, demonstrating that the prior decoder could effectively reconstruct the Mel spectrograms from the low-dimensional hidden states.

Pre-Training Results
Figure 9 shows the two distinct phases in the loss curve during the model training process.Initially, the loss value rapidly decreased from a higher level to approximately 0.15, after which the rate of decline significantly decreased and eventually stabilized at approximately 0.01 after approximately 8000 steps.Figure 10 shows the pre-training stage of Deformer using Mel spectrograms constructed from the patches.Figure 10a shows the Mel spectrograms assembled from patches injected with noise, which served as the inputs to the model during the pretraining stage.Figure 10b shows the Mel spectrograms assembled from the denoised patches, which are the outputs of Deformer.Finally, Figure 10c shows the original noisefree Mel spectrograms.The principal features and trends shown in Figure 10c are successfully captured, as shown in Figure 10b, albeit with some loss of detail, demonstrating the capabilities of Deformer in terms of noise reduction and learning meaningful representations of music data.These observations further emphasize the effectiveness of the pre-training stage as well as the preparedness of the pre-trained Deformer for the next fine-tuning stage.To assess the classification efficacy of Deformer across different music genres, the confusion matrix depicted in Figure 12 is provided.The confusion matrix presents truepositive, true-negative, false-positive, and false-negative results, providing a clear classification performance evaluation.The fact that the predicted results are clearly distributed along the diagonal of the confusion matrix indicates that most of the predictions are correct.It can be seen that Deformer exhibited exceptional performance in the "classical" and "pop" categories, achieving impeccable accuracy with zero misclassifications within these genres.This outcome highlights its acute understanding of the unique attributes associated with these music genres.However, it exhibited inaccuracies within the "rock" and "blues" genres.Specifically, a few samples falling under the "rock" category were incorrectly classified as "blues" and "metal".Likewise, a subset of "blues" samples was inaccurately classified as "jazz" and "metal".These misclassifications suggested potential limitations of Deformer, particularly when differentiating between genres having nuanced or overlapping traits.Analyzing the confusion matrix is vital as it paves the way for prospective refinements and emphasizes the need to improve the discriminatory capabilities of the model when classifying music belonging to closely related genres such as "rock" and "blues".Table 3 presents the accuracy, precision, recall, and F1 scores for the proposed pretrained Deformer, Deformer without pre-training used for the ablation experiment, and ResNet-BiGRU and S3T as two baseline models.To complete the comparison, two additional results [25,26] are given, which demonstrated high accuracy in audio music genre classification.The pre-trained Deformer reached a classification accuracy of 84.5%, which is 3.4% higher than that of ResNet-BiGRU (81%), 3.3% higher than that of S3T (81.1%), 0.6% higher than that of M2D [25] (83.9%), and 4.8% higher than that of the Jukebox model pre-trained with CALM (79.7%) [26].The pre-trained Deformer significantly outperformed its non-pre-trained counterpart in terms of accuracy, precision, recall, and F1 score, with the latter only achieving an accuracy and recall of 0.37, a precision of 0.3334, and an F1 score of 0.3464.This comparison highlights the importance of pre-training in enhancing the performance of Deformer for music classification.It is worth noting that all data presented in Table 3 were obtained through testing on the GTZAN dataset.

Conclusions
A pre-trained model, Deformer, was introduced to address the specific challenges associated with existing Swin transformer-based approaches in music genre classification within the context of MIR.These challenges include the computational burden associated with managing large dynamic dictionaries, the finicky nature of the contrastive loss function with respect to hyperparameter choices, and the low level of model interpretability commonly observed in MoCo-based approaches.Utilizing a two-stage process of pre-training and fine-tuning, the proposed model leveraged unlabeled audio data during the pre-training stage.The experimental results underscore the significance of incorporating Deformer in the realm of deep learning architectures for audio music classification.The proposed method achieved an accuracy of 84%, outperforming the ResNet-BiGRU-based (81%) and S3T-based (81.1%) models.This highlights the substantial contribution of Deformer to superior performance in audio classification, marking a noteworthy advancement over traditional approaches.
Regarding its limitations, the proposed model was not assessed on larger or more diverse datasets, creating gaps in information regarding its generalizability.Future research directions could involve restructuring the architecture of the model to enable it to better handle genres that have subtle similarities, such as "rock" and "blues".The focus should be on enhancing the ability of the model to distinguish between closely aligned genres.Further improvements can be made to evaluate the performance of the model across a more diverse set of music genres and use cases.By pursuing these avenues, this research would not only add to the growing literature in the domain of music genre classification but also set a strong performance standard in subsequent investigations.

Figure 1 .
Figure 1.Overview of the method: pre-training and fine-tuning stages.

Figure 2 .
Figure 2. Flowchart of pre-training and fine-tuning under the proposed method.

Figure 3 .
Figure 3. Preprocessing and noise injection of data.Noise injection is operated additionally only in the pre-training stage.They are divided by  equal-sized patches through a matrix-division operation.Each patch is derived by dividing the Mel spectrogram into  sections, resulting in patches with dimen-

Figure 4 .
Figure 4. Structure of the autoencoder.To complete the denoising, patches of noisy Mel spectrograms  ,  ,  , . . .,  , …  are passed into Deformer, as shown in Figure 5.The position embedding layer, which is trainable, utilizes absolute numerical embedding to integrate positional information into these patches.The transformer layers utilize multi-head selfattention and feed-forward neural networks to relate to this layer.Subsequently, the prior decoder restores these low-dimensional hidden states back into the restored patches  * ,  * ,  * , . . .,  * , …  * .Then, the MSE loss is calculated based on the restored patches  * ,  * ,  * , . . .,  * , …  * and original patches  ,  ,  , . . .,  , … ,  for training Deformer.This training strategy allows Deformer to gain a deep understanding of the contextual relationships and interdependencies among patches.

Figure 5 .Algorithm 1 3 :
Figure 5. Pre-training stage of Deformer.Algorithm 1 describes the pre-training stage, where Deformer(•) represents Deformer and  represents Deformer parameters. denotes the number of training steps. represents one of the patches from the noisy Mel spectrograms, and  * is the restored patch.MSE(•) represents the loss function to calculate the loss  between  * and  , where  * is only generated by the injected noise  and  is the patch of the normalized Mel spectrogram without noise. is updated based on a gradient, which is calculated as  * ( ), where  is the learning rate.

Figure 6 .Algorithm 2
Figure 6.Fine-tuning stage of Deformer.Algorithm 2 details the fine-tuning process, where  represents the normalized patches,  represents the target genre,  represents the predicted genre of Deformer, and _(•) calculates the loss of  and  for performing updates.

Figure 7 .
Figure 7. MSE loss in the prior decoder training experiment.

Figure 8 .
Figure 8.Output of the decoder.(a) Reconstructed Mel spectrograms and (b) original Mel spectrograms.The red boxes indicate subtle differences between (a,b).

Figure 9 .
Figure 9. MSE loss in the pre-training experiment.

4. 3 . 3 .
Figure 11 shows the loss changes of Deformer during fine-tuning.The orange line (pre-trained) demonstrates a rapid decline in loss during fine-tuning, indicating a high level of learning efficiency.The blue line (without pre-training) exhibits a slower loss decrease.At 20,000 steps, the fine-tuning process based on pre-training demonstrated a significant performance advantage compared to that without pre-training.

Figure 11 .
Figure 11.Cross entropy loss in fine-tuning of Deformer and ablation experiment.

Figure 12 .
Figure 12.Confusion matrix of audio music genre classification performed by Deformer.

Table 1 .
Differences between existing approaches and the proposed method.
sions of  √ ⁄ ×  √ ⁄ , where  and  are the width and height of the transformed spectrogram, respectively.Patches  ,  ,  , . . .,  , … ,  , follow the order from left to right and top to bottom.The noise ratio, %, dictates the fraction of patches that receive noise.The % of patches are injected with noise  with a Gaussian distribution (0, 1); otherwise, they remain unchanged.Finally, a noisy Mel spectrogram, which is the combination of  ,  ,  , . . .,  , …  , is obtained:

Table 2 .
Hyperparameters used in the autoencoder, pre-training, and fine-tuning experiments.

Table 3 .
Performance comparison: audio music classification models.