3D-DCDAE: Unsupervised Music Latent Representations Learning Method Based on a Deep 3D Convolutional Denoising Autoencoder for Music Genre Classiﬁcation

: With unlabeled music data widely available, it is necessary to build an unsupervised latent music representation extractor to improve the performance of classiﬁcation models. This paper proposes an unsupervised latent music representation learning method based on a deep 3D convolutional denoising autoencoder (3D-DCDAE) for music genre classiﬁcation, which aims to learn common representations from a large amount of unlabeled data to improve the performance of music genre classiﬁcation. Speciﬁcally, unlabeled MIDI ﬁles are applied to 3D-DCDAE to extract latent representations by denoising and reconstructing input data. Next, a decoder is utilized to assist the 3D-DCDAE in training. After 3D-DCDAE training, the decoder is replaced by a multilayer perceptron (MLP) classiﬁer for music genre classiﬁcation. Through the unsupervised latent representations learning method, unlabeled data can be applied to classiﬁcation tasks so that the problem of limiting classiﬁcation performance due to insufﬁcient labeled data can be solved. In addition, the unsupervised 3D-DCDAE can consider the musicological structure to expand the understanding of the music ﬁeld and improve performance in music genre classiﬁcation. In the experiments, which utilized the Lakh MIDI dataset, a large amount of unlabeled data was utilized to train the 3D-DCDAE, obtaining a denoising and reconstruction accuracy of approximately 98%. A small amount of labeled data was utilized for training a classiﬁcation model consisting of the trained 3D-DCDAE and the MLP classiﬁer, which achieved a classiﬁcation accuracy of approximately 88%. The experimental results show that the model achieves state-of-the-art performance and signiﬁcantly outperforms other methods for music genre classiﬁcation with only a small amount of labeled data.


Introduction
In recent years, a series of methods represented by hierarchical and deep layers have been proposed, which give hope for training deep models. These methods have been successful in several application areas, such a music information retrieval [1], computer vision (CV) [2], and natural language processing (NLP) [3]. However, progress with deep learning models relies on large datasets with labels. Therefore, there are two drawbacks. First, supervised learning requires large datasets with labels, which often makes the development of deep learning models expensive. Second, while a deep model excels in solving a given task, there are limitations in its ability to provide insight into the problem domain. Therefore, it is worthwhile to explore an unsupervised latent representation extractor for music genre classification utilizing musicological knowledge.
Early research has shown that a variety of supervised classifiers based on machine learning can meet the requirements of music genre classification, such as support vector machines (SVMs) with radial basis function (RBF) kernel, K-nearest neighbors (K-NN), and naïve Bayes (NB) [4]. Recently, with the increase in the amount of data available, deep learning methods have been widely used in music genre classification, and there has been a tremendous improvement in classification performance. A large number of current efforts are usually combined with features extracted by fixed-parameter feature engineering, such as mel-spectrogram [5], short-time Fourier transform (STFT) [6], etc. Neural networks are then applied for music genre classification [7,8]. With the development of computational power, deep learning models have performed quite well in various research fields. Because of the hierarchical nature of the data itself and the representability of hierarchical features, deep learning models also perform well in music genre classification [9]. In fact, initializing the deep model layer-by-layer can be considered as a process of feature learning. The hidden layers abstract the original input step-by-step to learn latent representations from raw input, so that the performance of the classification model can be improved. However, most of the research still utilizes supervised learning methods.
Although supervised learning methods have been widely utilized for music genre classification, supervised learning methods require sufficient labeled music data, which leads to the limitation that classification models can only understand a given problem domain. Moreover, labeled data requires expert knowledge and a lot of work, and there is no strict fundamental theory to define genres for music, which leads to the problem of ambiguity when tagging genre labels for music. Therefore, it is difficult to obtain sufficient labeled music data. In contrast, a large amount of unlabeled music data is available from the Internet, which is relatively abundant and easily accessible. In this case, unsupervised learning methods have attracted increasing attention in various research fields. Unsupervised learning [10] is applied in music genre classification to improve the performance of classification models. Specifically, a large amount of unlabeled data is used to extract latent representations, and then classification is performed based on the learned latent representations.
In CV and NLP, it has become popular to use unsupervised learning to extract latent representations from large corpora. Although a limited amount of labeled data is available for training classification models, common classifiers that utilize latent representations as input can also perform well in a given domain. In the research of audio classification [9], which is somewhat similar to music genre classification, autoencoder is utilized to extract latent representations from a large number of speech corpora. Then, with latent representation, multilayer perception (MLP) is utilized to build a speech classifier, which makes the classifier more robust when facing unknown samples. This approach can be summarized as a combination of a trained autoencoder and some popular classifiers for a given classification task. First, an unsupervised autoencoder is trained under an assisting decoder, whose purpose is to discover mapping relationships between latent representations and original inputs. This process is considered encoding. Then, the decoder assists the training of the encoder by reconstructing the input using latent representations. Second, a classifier is connected after the trained autoencoder for classification. By globally fine-tuning the parameters of trained models and classifiers, favorable classification performance can be obtained. Compared to the end-to-end model, this technique has been empirically proven to avoid autoencoders falling into the bad solution of random initialization.
There are many different types of autoencoders, e.g., vanilla multilayer autoencoders, denoising autoencoders, convolutional autoencoders, and variational autoencoders, in the existing research, whose basic theory is the same. In [11], Zhou et al. proposed a denoising autoencoder to extract latent representations for sound event classification. These latent representations contain the features of spectrograms. By denoising and reconstructing the training data with noise, the autoencoder can be better adapted to real-life data with noise, which can improve the generalization ability of the model.
For music data, there are usually two main music formats. The first music format records changes in audio intensity over time based on sound signals, for example, MP3, WAV, and so on. Due to the different representations of data, the algorithms for processing data in this format are also different. The second music format is based on the music score representation, which is one type of symbolic music, such as Musical Instrument Digital Interface (MIDI), Humdrum, MusicXML, MEI, and so on [11]. symbolic music files include the details of music, such as pitch, duration, start time, chord, the intensity of each note, and so on. Obviously, when compared to audio files, MIDI provides richer details on music elements, so that deep-learning models with musicology insights can be built. Based on these features, music genres can be analyzed to define separately or together. Sarkar et al. [12] proposed a feature set as input based on multi-frequency domain features for music genre classification. This feature set includes timbre, tonality, pitch, statistical, etc. Compared with the classification methods using a single feature, better performance is obtained. Therefore, it is advantageous to use MIDI music, which can provide more detailed features. In [13], the MIDI files are considered as input. The authors classify music genres based on algorithms such as SVM, decision trees, and random forests. The volume of the training set is small, which leads to the disadvantage of weak generalization ability of the model. Therefore, it is necessary to carry out the research on music genre classification combined with a feature extraction algorithm for unsupervised learning.
In this paper, an unsupervised music latent representation learning method based on a deep 3D convolutional denoising autoencoder (3D-DCDAE) for music genre classification is proposed. Specifically, during MIDI preprocessing, the MIDI files are embedded by MIDI-to-Image (MIDI2Img) to obtain MIDI images considering music features such as pitch, volume, and bar length, which allows the model to consider multiple features of music and multiple tracks. To force the model to consider the musicology structure, a multi-level random noise (MRN) process is applied to MIDI images. Then, combining the 3D-DCDAE model and a decoder, noised MIDI images are denoised and reconstructed. This process enables the 3D-DCDAE to extract latent representations. The inner structure of 3D-DCDAE consists of a deep 3D visual geometry group (VGG) [14], which can consider the spatial relationship between multiple music tracks. The output of 3D-DCDAE is latent representations, which includes the context of music and uses low-dimensional space to represent high-dimensional inputs. Finally, a multilayer perceptron (MLP) classifier can accomplish music genre classification based on latent representations. Based on trained 3D-DCDAE, the MLP classifier can converge effectively and obtain better performance even though labeled MIDI files is insufficient and unbalanced. Experiments were carried out on the Lakh MIDI dataset [15] to verify the feasibility of the proposed method. The experimental results show that the 3D-DCDAE extracts robust latent representations of music from a large amount of data and considers the musicology structure of music, resulting in the proposed model having strong generalization ability. Finally, the performance of the music genre classification was improved. The core contributions of this paper can be summarized as follows.

•
Unlike most existing research, the proposed method utilizes music data in MIDI format as input, which allows the model to consider a variety of music features. In addition, the MRN process is designed based on the musicology structure of music, which allows the model to focus on the unique structure of the music. • Based on unsupervised learning, 3D-DCDAE can extract latent representations from large amounts of unlabeled data which can increase the generalization ability of the model. The 3D-DCDAE can be considered a powerful feature extractor. In addition, the inner structure of 3D-DCDAE is 3D-VGG so that the spatial relationships of multi-track MIDI files can be explored.

•
The MLP classifier is connected after the trained 3D-DCDAE to implement music genre classification. Experimental results show that even when a small amount of labeled data is utilized and the dataset is not balanced, the model can still perform well in music genre classification. • Experiments were conducted on the Lakh MIDI dataset to evaluate the performance of the proposed method. The results indicate that the proposed method is superior to other methods.
The remainder of this paper is organized as follows. Section 2 outlines the related research on music classification. Section 3 introduces the proposed method in detail.

Related Works
In this section, existing research on music genre classification is reviewed. Early techniques of music genre classification based on machine learning algorithms are introduced. With the development of data volume and computational speed, deep learning algorithms have been applied to music genre classification research due to their outstanding achievements in various fields. Models based on unsupervised algorithms, such as the autoencoder, are paid special attention, with the purpose of enhancing the generalization ability of the classification model and the understanding of the problem domain.

Music Genre Classification
Early research on music genre classification was primarily based on machine learning algorithms. In the research conducted by Fulzele et al. [16], an SVM classifier was also utilized to classify music genres in the GTZAN dataset combining seven features such as MFCC, spectral roll-off, zero-crossing rate, chroma frequency, and rhythm histogram. The accuracy of music genre classification obtained by the SVM classifier was 84%. However, the SVM algorithm is difficult to implement for large training datasets, and it is difficult to choose a suitable kernel function.
In recent years, music genre classification has largely benefited from the success of deep learning. For example, in [17], combining mel-spectrogram which is a kind of feature engineering as the input of the model, Choi et al. proposed a convolutional recurrent neural network for music genre classification. The mel-spectrogram requires feature engineering with designed parameters with time-frequency transformation. Therefore, some researchers have utilized different suboptimal spectrogram settings in music genre classification depending on the given domain. For example, a 128-bin mel-spectrogram is a common choice. Its performance has been improved compared to early machine learning supervised algorithms. However, the music time-context information is lost during feature engineering, which is detrimental to the understanding of the music data by the classification model. To avoid loss of data features during feature engineering, Song et al. [9] proposed an automatic music tagging algorithm utilizing deep recurrent neural networks (RNNs) with a scattering transform as input. They indicate that the stacking of RNNs is probably beneficial for reducing the phase variation of the time-domain convolution. However, RNNs cannot perform parallel operations due to the structure of the RNNs themselves. In addition, the scattering transform is a set of linear transformations used to represent the original data based on wavelet change design. However, there is also a loss of data features compared with raw music data.
Therefore, to eliminate feature engineering and constraints on expertise, some researchers started to experiment with end-to-end approaches utilizing raw waveforms as input. Dieleman and Schrauwen [18] applied raw waveforms to convolutional neural networks (CNNs) for automatic music classification. It was shown that CNNs can autonomously understand time-frequency relationships, as well as phase-and translationinvariant feature representations from raw audio. However, the model does not perform as well as mel-spectrogram-based research on the classification due to the complexity of the data and the fact that large time-domain filters can lead to severe phase variation problems. To train a CNN correctly and suppress the problem of phase change when using raw waveforms, in [19], the authors obtained impressive performance based on a deep CNN, which is called SampleCNN. This neural network takes raw waveforms as input and has very small filters. When the network is deep enough, the model with the raw waveform as input can compete with the model utilizing the mel-spectrogram as input. However, supervised learning is only concerned with music features related to classification, which leads to the problem of weak generalization of the model. In addition, there is little research based on the unsupervised feature extractor for music genre classification. In this paper, this shortcoming is remedied by building a classification model combined with an unsupervised feature extractor.
Various unlabeled high-dimensional data, such as music, images, and audio, have the characteristics of large data volume, complex structure, and noise. It is difficult to consider all the features using traditional feature engineering. Therefore, it is worth exploring a simple, automatic, and intelligent feature extraction method. Masci et al. proposed a stacked convolutional autoencoder [20] to complete handwritten digit recognition (MNIST) and object recognition (CIFAR10), which is a hierarchical unsupervised feature extractor. This model can compress high-dimensional input into a low-dimensional latent representation. To avoid the autoencoder learning identity function, Vincent et al. [21] added random noise for input data to build the denoising autoencoder, which is a variant of the common autoencoder. In the field of classification, the performance of the classification model is improved compared with the common autoencoder. Because data with random noise simulates the data with noise that exists in the real world, the latent representations can be obtained by denoising, which improves the generalization ability of the model. Inspired by existing research similar to music genre classification (e.g., speech recognition, speech emotion recognition), a denoising autoencoder is applied. Chorowski et al. [22] trained WaveNet autoencoder with a large amount of speech data as input. By reconstructing the input processed by the bottleneck, the model learned the latent representations in the speech data. Ideally, the latent representations retain the useful features of the input data. In [23], the high-level audio representation, log mel-spectrogram, was transformed from raw audio as input. Two different autoencoder structures were used to extract features, and an MLP trained on the latent space representations to recognize known samples and reject unknown samples. However, the existing research rarely considers the structure of MIDI music to establish an autoencoder. Therefore, in this paper, an effective autoencoder as a latent representation extractor and an MLP classifier is designed for music genre classification. Table 1 shows the differences between current methods music genre classification and the proposed method. The proposed method has two main advantages. First, in the preprocessing, based on the MIDI files, various features of multiple tracks are embedded by MIDI2Img to be MIDI images. After the MIDI2Img, noise is added by MRN to MIDI images according to the musicological theory. Compared with existing research, most of them use small-scale data sets for training, which leads to the problem of weak model generalization ability. The proposed method first utilizes a large amount of data to train a 3D-DCDAE based on unsupervised learning. Therefore, 3D-DCDAE can represent high-dimension input data by low-dimension latent representations, which is the foundation for the robustness of the model. To implement music genre classification, an MLP combined with the trained 3D-DCDAE is utilized as a classifier based on supervised learning. Compared with existing methods in music genre classification, this method requires fewer training epochs to converge, which improves the deployment efficiency of industrial applications. In addition, it solves the limitations of the current methods of insufficient labeled data volume and the unbalanced distribution of data.

Music Genre Classification System Based on 3D-DCDAE
In this section, the proposed method is introduced in detail. First, through MIDI2Img and MRN, MIDI files are embedded into noised MIDI images as the input of the proposed method. Next, the 3D-DCDAE is utilized to extract latent representations. After the 3D-DCDAE training, the decoder for assisting training is discarded and the MLP classifier is connected to the 3D-DCDAE for music genre classification.

Overview
In this paper, an unsupervised feature extraction method based on a deep 3D convolutional autoencoder for music genre classification is proposed. As shown in Figure 1, at first, during the MIDI preprocessing, various musical features in MIDI files, such as pitch, note, track, etc., are embedded by MIDI2Img to MIDI images. According to the musicological structure of MIDI files, MRN is applied to obtain noised MIDI images, so as to force the model to consider the musicological structure of the MIDI files. Second, the 3D-DCDAE based on deep 3D convolution, VGG, learns the latent representations by denoising and reconstructing noised MIDI images. A large amount of unlabeled MIDI files is used to train 3D-DCDAE, which helps to enhance the robust generalization ability of the model. Moreover, 3D convolution is beneficial for the model to explore the spatial relationship between different music tracks. Third, to implement music genre classification, the MLP classifier is connected after the 3D-DCDAE is trained to predict genre labels.

MIDI Preprocessing
To make the time attribute of music in MIDI format easier to be considered by deep learning algorithms, MIDI preprocessing is proposed, including MIDI2Img and MRN. MIDI2Img is a method of expressing MIDI files as images. MRN is a method for data augmentation.
During the MIDI preprocessing, as shown in Figure 1, notes ( ) and chords ( ) in multiple tracks are extracted by the Extractor. The notes and chords include the information of pitches and the corresponding volume, such as (G#4, 80) where G#4 is pitch and

MIDI Preprocessing
To make the time attribute of music in MIDI format easier to be considered by deep learning algorithms, MIDI preprocessing is proposed, including MIDI2Img and MRN.
MIDI2Img is a method of expressing MIDI files as images. MRN is a method for data augmentation.
During the MIDI preprocessing, as shown in Figure 1, notes (N) and chords (C) in multiple tracks are extracted by the Extractor. The notes and chords include the information of pitches and the corresponding volume, such as (G#4, 80) where G#4 is pitch and 80 is the corresponding volume. MIDI2Img embeds N and C into MIDI images (i) in time order. The pitch can be converted into an index from 0 to 128 as the y-axis of the image pixels. The time at which the note is played is considered the x-axis of the pixel. The volume is set to the value of the pixel, and its value range is 0 to 255. Each track in a MIDI file is regarded as an image channel, which is the z-axis of the MIDI image. Different tracks play different roles in music playing. Compared to compressing all notes to the same channel, this embedding method helps preserve the spatial information between the tracks. In addition, the conflict problem of the same note in different music tracks is avoided. Due to the MIDI image structure of multiple channels, it is feasible to use 3D CNN.
To force the model to consider the musicology structure of MIDI files, such as the relationship between notes, and the relationship between bars, MRN is performed to obtain noised MIDI images (i N ). Multi-level noise is defined as two types of unit noise: note-level noise and bar-level noise. In the process of MRN, α% of the content is selected for processing. Specifically, replace β 0 % of the selected data with zero, replace β 1 % with random volume, and keep the remaining β 2 % unchanged. Therefore, through denoising and reconstructing, 3D-DCDAE can focus on the musicology structure of MIDI files.
Another useful property of this preprocessing is that it helps infer music genres even if some MIDI files are not readily present in the training dataset. This is important because there may be noisy and incomplete samples in real data. The proposed model can learn general latent representations from noisy data to overcome the gap between training data and real data. The classification based on the general latent representations is reliable. In this way, the generalization of the classification model can be improved.

3D-DCDAE for Latent Representations
The 3D-DCDAE is a feature extractor designed for MIDI images, which can perform denoising and reconstruction with the assistance of the decode to automatically learn effective latent representations from unlabeled data. The structure of the 3D-DCDAE is shown at the left of Figure 2. With noised MIDI images (i N ) as input, 3D-DCDAE consists of 3D convolutional layers (c n ) and 3D max pooling layers (p n ), which is the 3D-VGG. Latent representations are the output of 3D-DCDAE, which contains the context of the music. The decoder is defined to denoise and reconstruct the low-dimensional latent representations into the denoised and reconstructed MIDI images (i D&R ) to assist 3D-DCDAE training. The i D&R is compared with the MIDI images to obtain the loss value. Loss Function * apply mean square error (MSE) [24] algorithm. The parameters of the 3D-DCDAE are then updated based on the back-propagation algorithm. After the 3D-DCDAE and the decoder are trained to reconstruct noised MIDI images well, it can be utilized to extract latent representations that store the context of MIDI files. The 3D-DCDAE, a music feature extractor based on unsupervised learning, is used with the following three steps: (1) 3D-DCDAE is constructed; (2) the decoder assists 3D-DCDAE training by denoising and reconstructing the noised inputs; and (3) the proposed 3D-DCDAE is utilized to extract unsupervised latent representations from the MIDI images.
DCDAE are then updated based on the back-propagation algorithm. After the 3D-DCDAE and the decoder are trained to reconstruct noised MIDI images well, it can be utilized to extract latent representations that store the context of MIDI files. The 3D-DCDAE, a music feature extractor based on unsupervised learning, is used with the following three steps: (1) 3D-DCDAE is constructed; (2) the decoder assists 3D-DCDAE training by denoising and reconstructing the noised inputs; and (3) the proposed 3D-DCDAE is utilized to extract unsupervised latent representations from the MIDI images. The input of 3D-DCDAE is noised MIDI images ( ), which contains the pitches and volume information of multiple tracks in time-series order. To simultaneously explore the spatial relationship between different music tracks and the time series of MIDI files, the 3D-VGG structure is adopted in 3D-DCDAE, where 3D convolution is implemented by stacking the result from 3D convolutional filters that consider multiple consecutive tracks together. As shown in Figure 3, , , represent the three-dimensional axis of the input. The input can be represented as , whose size is ☓ ☓ . The weights of 3D convolutional filters are denoted as , and , , is the ( , , ) ℎ weight of 3D convolutional filters connected to previous layer. The size of 3D convolutional filters is denoted as , , , whose values are less than the input size. The output of 3D convolutional filters is defined as: The input of 3D-DCDAE is noised MIDI images (i N ), which contains the pitches and volume information of multiple tracks in time-series order. To simultaneously explore the spatial relationship between different music tracks and the time series of MIDI files, the 3D-VGG structure is adopted in 3D-DCDAE, where 3D convolution is implemented by stacking the result from 3D convolutional filters that consider multiple consecutive tracks together. As shown in Figure 3, x, y, z represent the three-dimensional axis of the input. The input can be represented as I, whose size is D I x D I y D I z . The weights of 3D convolutional filters are denoted as W, and W p,q,r is the (p, q, r)th weight of 3D convolutional filters connected to previous layer. The size of 3D convolutional filters is denoted as D F x , D F y , D F z , whose values are less than the input size. The output of 3D convolutional filters is defined as: where the stride of the 3D filter in three dimensions, with the size of output O, is defined as , . To implement denoising and reconstruction, the decoder consists of 3D transposed convolution layers ( ). As shown at the right of Figure 2, the decoder takes latent representations as input. According to the output of 3D-DCDAE, the decoder denoises and reconstructs noised MIDI images to assisted 3D-DCDAE training. This is a way to train 3D-DCDAE. Specifically, the convolution operation is similar to the encoder in a neural network and is used to extract low-dimensional features from high-dimensional data. The transposed convolution is the opposite of the convolution operation, which usually maps low-dimensional features into high-dimensional input. Transposed convolution is also a trainable upsampling algorithm. Therefore, transposed convolution is used to build the decoder for denoising and reconstruction. To implement denoising and reconstruction, the decoder consists of 3D transposed convolution layers (ct n ). As shown at the right of Figure 2, the decoder takes latent representations as input. According to the output of 3D-DCDAE, the decoder denoises and reconstructs noised MIDI images to assisted 3D-DCDAE training. This is a way to train 3D-DCDAE. Specifically, the convolution operation is similar to the encoder in a neural network and is used to extract low-dimensional features from high-dimensional data. The transposed convolution is the opposite of the convolution operation, which usually maps low-dimensional features into high-dimensional input. Transposed convolution is also a trainable upsampling algorithm. Therefore, transposed convolution is used to build the decoder for denoising and reconstruction.

MLP Classifier
As shown in Figure 4, when the 3D-DCDAE is well trained by denoising and reconstruction, the decoder will be replaced by the MLP classifier for music genre classification. Latent representations (r) are considered as the input of the MLP classifier. Next, the MLP classifier can predict genre labels (g ). After comparing the final prediction result with the real genre label (g), the loss value is calculated to update the global parameters in the classification model consisted of the trained 3D-DCDAE and the MLP classifier, which is based on the cross-entropy [25] algorithm.

Experiment
In this section, the experimental objectives, environment, parameters, hardware, and software components are described. The experimental results of each process are displayed. Finally, the classification results are compared with the performance of existing music genre classification research.

Experimental Objectives
Experiments were conducted to verify whether music genre classification could perform well when the latent representations extracted from the 3D-DCDAE were utilized.

Experimental Environment
In the MIDI preprocessing step, as shown in Table 2, the MRN is performed on 15% of the content in MIDI images, according to the noise addition strategy in [26]. In the noise addition process, 80% selected MIDI images are replaced by zero, 10% of the selected MIDI images are replaced by random value, and 10% of the selected MIDI images keep the original value.

Experiment
In this section, the experimental objectives, environment, parameters, hardware, and software components are described. The experimental results of each process are displayed. Finally, the classification results are compared with the performance of existing music genre classification research.

Experimental Objectives
Experiments were conducted to verify whether music genre classification could perform well when the latent representations extracted from the 3D-DCDAE were utilized.

Experimental Environment
In the MIDI preprocessing step, as shown in Table 2, the MRN is performed on 15% of the content in MIDI images, according to the noise addition strategy in [26]. In the noise addition process, 80% selected MIDI images are replaced by zero, 10% of the selected MIDI images are replaced by random value, and 10% of the selected MIDI images keep the original value.

Hyper Parameter
Value The proposed method has two main training processes. Training parameters for 3D-DCDAE are shown in Table 3. First, noised MIDI images were utilized as inputs, whose size was (batch size, width, height, tracks). Batch size was set to 32. Width, height, and tracks respectively correspond to the time, pitch, and tracks axes in Figure 3. The Width was set as 128 means that 3D-DCDAE can consider 10-s MIDI files. The Height was set as 128 because the range of pitches is from 0 to 128. The Adam [27] random optimization algorithm was chosen as the 3D-DCDAE Optimizer. The output of the decoder was denoised and reconstructed as MIDI images, so that the output size was equal to the input size. The learning rate was affected by the warm-up step and gradually increased from 0 to 0.01 after 1000 steps. A total of 20 epochs were trained, and each epoch included 5000 steps. When training the classification model consisted of 3D-DCDAE and the MLP classifier, the input size remains the same as in 3D-DCDAE training. Specifically, the training parameters of the MLP classifier are shown in Table 4. Due to adjusting only a few parameters of training, the learning rate of 3D-DCDAE was set to the relatively small value that was 5 × 10 −5 . The learning rate of MLP classifier was set as 5 × 10 −4 . The batch size is set to 32. The Adam random optimization algorithm was selected as the optimizer. Because the total categories of MIDI's genres are 13, the size of the predicted output was (32, 13). The experiments were conducted in an experimental environment consisting of Windows 10, Intel i7-7700 with 8 cores, Nvidia Titan RTX (24 GB), and 40-GB DDR4. The proposed method was implemented in Python 3.6.8. Specifically, MIDI files were processed by the Music21 library [28], which not only contains the standard feature extraction tools offered by other toolkits, but also allows researchers to customize powerful feature extraction methods. Due to a large volume of data, the multiprocessing technology in python, based on 8 microprocessor cores, was utilized for MIDI preprocessing. To improve computing efficiency, the 3D-DCDAE and MLP classifier were implemented with a deep learning library, Pytorch, which can utilize computing function of graphics processing unit (GPU) in Nvidia Titan RTX.

Experimental Data
Music genre classification based on MIDI files lacks a proven and reliable dataset. The Lakh MIDI dataset [15] is one of the relatively reliable datasets for music genre classification. However, this dataset still has shortcomings, such as incorrect labels and noisy data. In order to overcome these shortcomings, unsupervised 3D-DCDAE is used to comprehensively consider the features of the samples in the Lakh MIDI dataset.
The Lakh MIDI dataset was utilized in the experiment. The dataset contains 176,581 MIDI files. These MIDI files were used for training 3D-DCDAE. After MIDI preprocessing, each file was sliced by 128 frames and embedded into the noised MIDI images with the hybrid features. There were more than 200,000 MIDI images after preprocessing, 80% of which were utilized as training data, and the remaining 10% were equally divided into validation and test datasets. Using a large amount of data to train the model can provide a strong generalization ability.
As the Lakh MIDI dataset is aligned with the labeled data in the Million Song Dataset [29], 11,946 MIDI files have genre labels. They include 13 categories, such as pop/rock, electronics, country, R&B, jazz, Latin, international, and so on. The distribution data of each MIDI genre are shown in Table 5 The labeled data were utilized to train the classification model consisting of 3D-DCDAE and CNN-classifier for music genre classification. After MIDI preprocessing, each file was sliced by 128 frames and embedded into the MIDI images with the various features. Thus, 68,811 labeled data can be obtained. According to the data set division strategy in [17], the labeled dataset is divided with 85% of instances used for training, 5% used for validation, and 10% used for testing.

Experimental Results
In this section, the experimental results of MIDI preprocessing, 3D-DCDAE, and the classification model consisting of the 3D-DCDAE and the MLP classifier are described.
As shown at the top of Figure 5, the hybrid features in the MIDI files, which include pitches, volume, durations, and tracks, are embedded into MIDI images. As shown at the bottom of Figure 5, the result of adding noise according to the multi-level random operation is shown. Point-shaped noise was added as a single note, and strip-shaped noise was added as a bar. During the denoising and reconstruction tasks, the 3D-DCDAE can extract latent representations, which are the music context. classification model consisting of the 3D-DCDAE and the MLP classifier are described.
As shown at the top of Figure 5, the hybrid features in the MIDI files, which include pitches, volume, durations, and tracks, are embedded into MIDI images. As shown at the bottom of Figure 5, the result of adding noise according to the multi-level random operation is shown. Point-shaped noise was added as a single note, and strip-shaped noise was added as a bar. During the denoising and reconstruction tasks, the 3D-DCDAE can extract latent representations, which are the music context.  Figure 6 shows the training results of 3D-DCDAE. As shown is Figure 7a, during the 3D-DCDAE training period, the loss trends drop sharply before the third epoch, and then the loss drops slowly until convergence. The corresponding accuracy rate trend is shown in Figure 7b. The accuracy rate increased sharply before the third epoch, and then slowly increased to 98%.   Figure 6 shows the training results of 3D-DCDAE. As shown is Figure 7a, during the 3D-DCDAE training period, the loss trends drop sharply before the third epoch, and then the loss drops slowly until convergence. The corresponding accuracy rate trend is shown in Figure 7b. The accuracy rate increased sharply before the third epoch, and then slowly increased to 98%. classification model consisting of the 3D-DCDAE and the MLP classifier are described.
As shown at the top of Figure 5, the hybrid features in the MIDI files, which include pitches, volume, durations, and tracks, are embedded into MIDI images. As shown at the bottom of Figure 5, the result of adding noise according to the multi-level random operation is shown. Point-shaped noise was added as a single note, and strip-shaped noise was added as a bar. During the denoising and reconstruction tasks, the 3D-DCDAE can extract latent representations, which are the music context.    Figure 7. The noised MIDI images were denoised and reconstructed by the 3D-DCDAE and the decoder, which were utilized to calculate the loss with MIDI images. Therefore, the decoder can assist the 3D-DCDAE training and verify whether the latent representations extracted by 3D-DCDAE are useful or not. Figure 7 shows the training results of the classification model consisting of 3D-DCDAE and the MLP classifier. As shown in Figure 7a, during the classification model training period, the loss trend drops sharply before the sixth epoch, and then the loss drops slowly until convergence. The corresponding accuracy rate trend is shown in Figure  7b. The accuracy rate increases sharply before the sixth epoch, and then slowly increases to 88%. Table 6 shows the performance of the classification model in the test dataset. When the distribution of the dataset is unbalanced, the values of accuracy, precision, recall, and F1-score will be used to evaluate model performance. All the values reached above 88%, which shows that the prediction results of the model are reliable.   Figure 7. The noised MIDI images were denoised and reconstructed by the 3D-DCDAE and the decoder, which were utilized to calculate the loss with MIDI images. Therefore, the decoder can assist the 3D-DCDAE training and verify whether the latent representations extracted by 3D-DCDAE are useful or not. Figure 7 shows the training results of the classification model consisting of 3D-DCDAE and the MLP classifier. As shown in Figure 7a, during the classification model training period, the loss trend drops sharply before the sixth epoch, and then the loss drops slowly until convergence. The corresponding accuracy rate trend is shown in Figure 7b. The accuracy rate increases sharply before the sixth epoch, and then slowly increases to 88%. Table 6 shows the performance of the classification model in the test dataset. When the distribution of the dataset is unbalanced, the values of accuracy, precision, recall, and F1-score will be used to evaluate model performance. All the values reached above 88%, which shows that the prediction results of the model are reliable.   Table 7 displays a comparison between the proposed method and previous research [30,31]. Two evaluation criteria were considered, which include classification accuracy of the proposed method and existing research, and the receiver operating characteristic and area under curve (ROC-AUC). ROC-AUC is commonly utilized to evaluate imbalance distribution datasets. Each ROC-AUC was calculated by one genre, and all scores were averaged and represented by one score. Table 7. Comparison between the proposed method and existing research.

Model
ROC-AUC Accuracy jSymbolic 2.2 [30] -0.7760 P2 pattern query algorithm [31] 0.8160 0.6410 The proposed method 0.8691 0.8830 Two existing studies of music genre classification [30,31] based on the Lakh MIDI dataset were selected for comparative research. In [30], McKay et al. utilized SVMs based on a sequential minimal optimization algorithm to classify MIDI files, which genre labels include ten categories. The accuracy obtained in [30] was 0.776. In [31], Ferraro et al. utilized the P2 pattern query algorithm to classify 13 MIDI genres. In this case, the obtained ROC-AUC was 0.8160, and the accuracy obtained was 0.6410. For the proposed method, the results of accuracy and ROC-AUC were obtained based on the test dataset. ROC-AUC was 0.8691, and the accuracy of the proposed method was 0.8830. The experimental results show that the proposed classification model improves performance compared with existing research.

Conclusions
This paper proposed an unsupervised method to learning music latent representations based on the 3D-DCDAE for music genre classification. First, through MIDI2Img,  Table 7 displays a comparison between the proposed method and previous research [30,31]. Two evaluation criteria were considered, which include classification accuracy of the proposed method and existing research, and the receiver operating characteristic and area under curve (ROC-AUC). ROC-AUC is commonly utilized to evaluate imbalance distribution datasets. Each ROC-AUC was calculated by one genre, and all scores were averaged and represented by one score. Two existing studies of music genre classification [30,31] based on the Lakh MIDI dataset were selected for comparative research. In [30], McKay et al. utilized SVMs based on a sequential minimal optimization algorithm to classify MIDI files, which genre labels include ten categories. The accuracy obtained in [30] was 0.776. In [31], Ferraro et al. utilized the P2 pattern query algorithm to classify 13 MIDI genres. In this case, the obtained ROC-AUC was 0.8160, and the accuracy obtained was 0.6410. For the proposed method, the results of accuracy and ROC-AUC were obtained based on the test dataset. ROC-AUC was 0.8691, and the accuracy of the proposed method was 0.8830. The experimental results show that the proposed classification model improves performance compared with existing research.

Conclusions
This paper proposed an unsupervised method to learning music latent representations based on the 3D-DCDAE for music genre classification. First, through MIDI2Img, notes and chords from multi-tracks in MIDI files were embedded into MIDI images in time order. The notes and chords included the information of pitches and corresponding volumes. Next, the 3D-DCDAE learned latent representations from large amounts of unlabeled MIDI images while being assisted in training by a decoder. Therefore, the latent representations can be automatically extracted based on unsupervised learning technology, which provides the 3D-DCDAE with a powerful generalization ability. Through the proposed unsupervised latent representations learning method, unlabeled data can be applied to classification tasks so that the problem of limiting classification performance due to insufficient labeled data can be solved. After the 3D-DCDAE training, the decoder was discarded and the MLP classifier was connected after 3D-DCDAE to perform music classification based on supervised learning. Based on trained 3D-DCDAE, the MLP classifier can converge effectively. The experimental results show that the proposed method can improve classification performance compared with existing research. The accuracy, precision, recall, and F1-score exceeded 88%. Although labeled MIDI files are insufficient and unbalanced, music genre classification performance was improved effectively.