A Middle-Level Learning Feature Interaction Method with Deep Learning for Multi-Feature Music Genre Classiﬁcation

: Nowadays, music genre classiﬁcation is becoming an interesting area and attracting lots of research attention. Multi-feature model is acknowledged as a desirable technology to realize the classiﬁcation. However, the major branches of multi-feature models used in most existed works are relatively independent and not interactive, which will result in insufﬁcient learning features for music genre classiﬁcation. In view of this, we exploit the impact of learning feature interaction among different branches and layers on the ﬁnal classiﬁcation results in a multi-feature model. Then, a middle-level learning feature interaction method based on deep learning is proposed correspondingly. Our experimental results show that the designed method can signiﬁcantly improve the accuracy of music genre classiﬁcation. The best classiﬁcation accuracy on the GTZAN dataset can reach 93.65%, which is superior to most current methods.


Introduction
With the rise of music streaming media services, tens of thousands of digital songs have been uploaded to the Internet. The key feature of these services is the playlist, which is usually grouped by genre [1]. The characteristics of different music genres have no strict boundaries, but music of the same genre has similar features. Through the analysis of these characteristics, humankind can label many music works according to their genre. These labels may come from people who publish songs, but it is not a good division. In recent years, with the rapid development and popularization of the Internet and multimedia technology, the number of musical works has shown explosive growth. The traditional way of analyzing and classifying mainly by professionals has gradually become inadequate. Using computer programs to automatically classify music genres can greatly reduce the pressure of professionals and improve the classification efficiency.
At present, some mature traditional machine learning algorithms can be selected to solve the problem of music genre classification, such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gaussian Mixture Models (GMM) with different fusion strategies, etc. [2][3][4]. For the traditional machine learning algorithm, when the amount of data exceeds a certain threshold, the algorithm performance often has no improvement. In recent years, intelligent algorithms [5][6][7] have received extensive attention, especially deep learning. Compared with deep learning, the traditional machine learning needs extra feature engineering. Feature engineering is a process of applying domain knowledge to the creation of feature extractors to reduce the complexity of data and make patterns more visible to learning algorithms. However, one can transfer the labeled data directly to a neural network without developing a new feature extractor for each problem. Therefore, in the fields of computer vision and natural language processing, the performance of most machine learning algorithms depends on the accuracy of feature recognition and extraction rather than too much data. Nowadays, deep learning has an excellent ability to solve lar, the middle-level learning feature refers to the learning features of other layers between the input and the classifier. Furthermore, the following is simplified as learning features. Whether it is a multi-feature model or a single input model, the final extracted learning features are connected or directly sent to the classifier for classification. Considering the influence of residual learning [29], the bottom learning feature has a particular gain effect on the upper learning feature. Therefore, we speculate that different types of learning features have also been in this interactive relationship. To explore this problem, we propose a Middle-level Learning Feature Interaction (MLFI) method. The method includes two modes: one-way interaction and two-way interaction. In particular, one-way refers to how in the multi-feature model, one module learns the learning features of the other module. Two-way refers to modules that learn from each other's learning features based on one-way interaction. So, middle-level learning feature interaction refers to concatenating learning features A and B and sending them to the following layers of B for training. In this paper, the model uses visual features and audio features as the input. Therefore, the one-way interaction mode includes the one-way interaction (audio) mode and one-way interaction (video) mode. Based on the two one-way interaction modes, the two-way interaction mode exchanges the role of A and B again. In this paper, the original network architecture is used to verify the effectiveness of our method. The model consists of a Visual Feature Extraction (VFE) module, an Audio Feature Extraction (AFE) module, and a classifier. The VFE module uses the parallel convolution layer to improve the CRNN model of Choi et al. [24] to extract more low-level and time-series information from the Mel-frequency spectrogram. The AFE module uses multi-dense layers to process audio features.
The main contributions are listed as follows: 1. MLFI optimizes the multi-feature models. This is the method reported so far to maximize the classification accuracy while ensuring speed. Compared with the BBNN model with the highest accuracy 93.7%, the running speed of our method is only half of the former.

2.
MLFI is a simple and very general method for multi-feature models. Firstly, as long as the appropriate interaction mode is found, the classification accuracy can be greatly improved. Secondly, our research verifies that in the multi-feature model using MLFI, the learning features close to the input and output have a better impact on improving the classification accuracy, which is an important and core contribution of this research.

3.
It is also proved by the MLFI method that using more learning features as interactive information may produce a gain effect or inhibit other learning features from playing a role. 4.
As mentioned above, the interaction between middle-level learning features and their impact on the classification results of the multi-feature model are not discussed in the existing methods.
The rest of this paper is organized as follows. Section 2 introduces the typical visual and audio features in MIR. The details of our networks are described in Section 3, followed by the experimental setup and results in Section 4. Finally, we conclude and describe potential future work in Section 5.

Visual Features
Researches show that it is difficult for people to perceive the frequency on a linear scale. Compared with the high-frequency domain, human beings are better at distinguishing the low-frequency domain. For example, humankind can easily differentiate the variance between 500 Hz and 1000 Hz, but find it hard to sense the difference between 10,000 Hz and 10,500 Hz, even though the two pairs have the same span. The Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one to another. Stanley Smith Stevens, John Volkman, and Newman named it [30] based on the definition of frequency. The following is an approximate mathematical conversion between the Mel scale and the linear frequency scale hertz: ). (1) Then, the energy spectrum needs to pass through a set of Mel scale triangular filter banks (as shown in Figure 1) to get the Mel-frequency spectrogram. The frequency loudness of the triangular filter is defined as: where ∑ M−1 m=0 H m (k) = 1, the number of filters is M.
f [1] f [2] f [3] f [4] f [5] f   Figure 2 visualizes the Mel-frequency spectrogram and spectrogram, some extraordinary differences in the genre are captured, which means that the model we proposed can learn good visual features. A Mel-frequency spectrogram is a spectrogram where the frequencies are converted to the Mel scale and closely represents how humans perceive audio. The Mel scale is a scale for the measurement of psychological magnitude pitch [30]. It transforms the frequency scale so that sounds are at equal distances from each other. Therefore, a Mel-frequency spectrogram may be easier for the neural network to extract features from it. The y-axis (frequency) is mapped onto the Mel scale to form the Mel-frequency spectrogram. Then, the model extracts significant differences from the Mel-frequency spectrogram of each genre and conducts audio classification tasks based on autonomous learning. The window size is 2048, and the hop length is 512.

Audio Features
We choose nine audio features along different audio dimensions. Aside from the tempo feature, other features are extracted in the form of mean and variance. The window size of each frame is 2048, and the hop length is 512. This set of audio features (as shown in Table 1) provides the best results for our experiments. The chroma feature is the general name of the chroma vector and chromatogram. The chroma vector contains 12 vector elements, representing the energy of 12 tones in a period (such as a frame). The same tone level energy of different octaves is accumulated, and the chromatogram is the sequence of the chroma vectors.
The spectral centroid [31] represents the position of the "centroid" of the sound and contains important information about the frequency distribution and energy distribution of the sound signal. It is calculated based on the weighted average of the sound frequency. Compared with the same length of blues songs, the frequency distribution of metal songs is denser toward the end. So, the spectral centroid of blues songs will be in the middle of the spectrum, while the spectral centroid of metal songs will be near the end of the spectrum. Spectral roll-off is a measurement of the shape of a signal, which represents the frequency as a specific percentage of spectral energy (such as 85%).
MFCCs of the signal are sets of 20 features, which can simply describe the overall shape of the spectral envelope and model speech features. ZCR reflects the times of signal crossing zero and the frequency characteristics. That is the number of times the speech signal changes from positive to negative or from negative to positive in each frame. This feature has been widely used in speech recognition and music information retrieval, and it is usually of higher value for high-impact sounds such as metal and rock. RMS value is the effective value of the total waveform. It is the area under the curve. In audio, it represents the amount of energy in the waveform. The peak value of the waveform is averaged into the total loudness, which is a more persistent amount than the rapidly changing volume.

Other Features
Tempo, which means "beats per minute", is a unit of speed. The content and style of music determine the playing speed of music, which can be roughly divided into three categories: slow, medium, and fast. It is an important index to describe the content of music rhythm, which affects individuals' music experience. The typical tempo range of hip-hop is 60-100 bpm, and between 115 and 130 bpm for house music.
Generally speaking, many sounds are composed of two elements: harmonic or percussive components. Harmonic components are the ones that we perceive to have a certain pitch. Percussive components often stem from the collision of two objects. Percussion has no pitch feature, but it has a clear location in time. The above two kinds of components provide the harmonic [32] feature and percussive [33] feature, respectively.

Proposed Design and Approach
This paper discusses the influence of learning feature interaction between different branches and layers on the final classification results in the multi-feature model. Considering that a Mel-frequency spectrogram is still an image, the model uses a CRNN structure for visual feature extraction. For audio feature combination, the model uses a direct and effective DNN structure for audio feature extraction. Thus, the model consists of a VFE module, an AFE module, and a classifier.
We mainly refer to the parameter settings in papers [22][23][24][25][26][27][28]. Then, based on the references, a random search determines the hyper-parameters of the model. We will provide a statistical distribution for each hyper-parameter. For each iteration, the random search will set the hyper-parameters by sampling the distribution defined above. The sampled hyper-parameters yield a model. The classification accuracy of the model is evaluated through 10-fold cross-validation. Finally, the best hyper-parameter combination is selected to form the final model.

The Visual Features Extractor (VFE) Module
Based on the CRNN model [24], the VFE module uses parallel convolution to optimize. It includes 3-layer two-dimensional convolution (as shown in Figure 3), 1-layer parallel convolution, and 2-layer RNN. For parallel convolution, one branch output uses maxpooling, and the other uses average-pooling. The beneficial aspect is that it provides more statistical information for the following layers and further enhances the recognition ability of the model.
In each convolution operation process, except the first convolution layer has 64 different kernels of equal size, the other convolution layers have 128 kernels. The size of each convolution kernel is 3 × 3, and the hop length is 1. Furthermore, each convolution kernel forms a mapping relationship with all the underlying features. The convolution kernel is covered at the corresponding position of the input. Multiply each value in the convolution kernel and the value of the corresponding pixel in the input. Furthermore, the sum of the above products is the value of the target pixel in the output. Repeat this operation for all positions of the input. After each convolution, the Batch Normalization (BN) [34] and Rectified Linear Unit (ReLU) operations are implemented. We also add a max-pooling layer (which only works on one branch of parallel convolution layers) to reduce the parameters. Furthermore, it helps the model to expand the receptive field and achieve non-linearly. The filter sizes of pooling operations mainly adopt 2 × 2 with stride 2, 3 × 3 with stride 3 for the first and second pooling operations separately, 4 × 4 with stride 4 for the others. The function of the convolution layer and pooling layer is to map the original data to the hidden layer feature space.
Conv(filter num)-Aver(filter size) Figure 3. Convolution block. In order to simplify the network representation, each convolution block contains four different layers. Convolution blocks are divided into two types according to different pooling operations: Conv-Max block and Conv-Aver block. The dropout ratio is set to 0.1 [24]. In particular, BN is Batch Normalization.
The VFE module uses 2-layer RNNs with Gated Recurrent Units (GRU) to summarize temporal patterns on the top of two-dimensional 3-layer convolutions and 1-layer parallel convolutions (as shown in Figure 4). Considering that humans may pay more attention to a prominent rhythm when recognizing the music genre, only the branch output of parallel convolution using max-pooling operation is put into RNNs. Instead of simply adding outputs together, we concatenate the output of RNNs and the branch output of parallel convolution (which uses an average-pooling operation) to avoid losing some information. Then, we get a vector with a length of 160.

The Audio Features Extractor (AFE) Module
The AFE module consists of five dense layers. The size of each is 1024, 512, 256, 128, and 64. We use the ReLU as an activation function and then execute BN transform immediately to regularize our model. A dropout layer of 0.4 is added after each BN layer [35] to alleviate the over-fitting problem in the experiment. Finally, the AFE module will output a vector with a length of 64.

Network Structure
As shown in Figure 4, the VFE module, AFE module, and classifier constitute the whole network model. In particular, the model in Figure 4 is a benchmark model to measure the excellence of our method, and we call it the original model in the following. Finally, we concatenate the outputs of the two modules to form an eigenvector with a length of 224. Fully Connected (FC) layers usually play the role of "Classifier" in the whole neural network. However, in this paper, we only use one FC layer with SoftMax function for classification to reduce the parameters. It is easier to interpret the correspondence between feature maps and genres and less prone to over-fitting than traditional multi-layer fully connected layers. Since the last layer uses the SoftMax function, we will get the probability distribution. In general, the SoftMax function is defined as:

Middle-Level Learning Feature Interaction Method
One-way interaction (audio) mode: The model shown in Figure 5 is in one-way interaction (audio) mode. Visual learning features play a complementary role in this mode. The output in the VFE module and the output of the dense layer in the AFE module are concatenated and then put into the upper dense layer for training. In this mode, the VFE module provides two kinds of learning features: A (visual learning features obtained through RNN layer) and B (add the visual learning feature of Conv-Aver layer based on A). According to these two learning features, one-way interaction (audio) mode is divided into one-way interaction sub-mode A and one-way interaction sub-mode B. The audio feature extraction module provides four learning features in different layers, thus forming four connection paths: (1, 1'), (2, 2'), (3, 3'), (4, 4'). In particular, the above connection paths have a feature. For example, connection paths 1 and 1' cannot exist simultaneously in the current mode.
One-way interaction (vision) mode: The model shown in Figure 6 is in one-way interaction (vision) mode. Audio learning features play a complementary role in this mode. The output of the dense layer in the AFE module and the output of the fourth Conv-Max layer are concatenated and then put into the RNN layer for training. In this mode, the audio feature extraction module provides five learning features on different layers, thus forming five connection paths: a, b, c, d, and e. Two-way interaction mode: The two-way interaction mode is a combination of the above two one-way interaction modes. The AFE module can provide four audio learning features, and the corresponding paths are b, c, d, and e. Generally speaking, we can get eight different situations in this mode. For easier description, according to the two learning features provided by the VFE module, we divide the two-way interaction mode into two-way interaction sub-mode A and two-way interaction sub-mode B.

Dataset
G. Tzanetakis and P. Cook collected the GTZAN dataset [36] for their well-known paper on music genre classification tasks [37]. The dataset has ten genres, including blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, rock, and 100 songs per genre. Each file is sampled at 22,050 Hz, 16 bits, and then stored in a 30-s monophonic .wav format. Although the dataset is small and old, it still widely serves as a benchmark in academia.

Preprocessing
First of all, load the audios as source data and split them into a nearly 3-s window [22]. Specifically, 66,149 sampling points are reserved every three seconds, and fragments with insufficient length will be discarded. This step can increase the amount of data and simplify the transformation process (such as the Mel-frequency spectrogram). At the same time, to make the model learn the characteristics of each genre better, the data sequence is scrambled after the dataset is segmented. Experiments show that this is very important. Table 2 shows the GTZAN dataset description. As for the input of the VFE module, the Librosa library is used to extract Mel-frequency spectrograms, which contain about 130 frames each, and each has 128 Mel-filter bands. A set of 55 audio features are put into the AFE module, setting a frame length of 2048 with a 25% overlap.

Training and Other Details
All files in the GTZAN dataset are transformed to Mel-frequency spectrograms and audio features separately by the preprocessing program presented in Section 4.2: the Melfrequency spectrogram with size 128 × 130 input to the VFE module and the audio features with the size 55 × 1 input to the AFE module.
For the choice of hyper-parameters, we refer to the experimental results of many academic papers and industry standards. For example, in the first convolution layer of the VFE module, the kernel size is set to 3 × 3. We make some other minor adjustments by referring to other papers and specific examples to meet our special data requirements. For example, since the dataset we used is small, we choose to split it into training, validation, and test sets with the ratio of 7/2/1, rather than the ratio of 8/1/1 used in many papers [22,23,25]. At the same time, instead of using a 3-s window with an overlap rate of 50% at the beginning [22,23], we choose a 3-s window with an overlap rate of 0 according to the classification accuracy, running time, and data volume. In the process of experimental training, the cross-entropy loss function is applied to the last dense layer (which uses the SoftMax function) to make its value as small as possible. Cross-entropy mainly describes the distance between the actual value and the expected output. The smaller the value of cross-entropy is, the closer the two probability distributions are. Suppose that the probability distribution p is the expected output, the probability distribution q is the actual value, H(p, q) is the cross-entropy, and let P and Q be probability density functions of p and q concerning r, then The optimizer we used is Adam with a default learning rate of 0.01 [25]. If the initial learning rate is set too small, the network convergence will be very slow, which will increase the time to find the optimal value. In addition, it is likely to converge into the local extremum, and cannot find the global optimal solution. Kingma and Lei Ba proposed the Adam optimizer, which combines the advantages of the Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Prop (RMSProp) [38]. The optimizer can deal with sparse gradient better, that is, the gradient is 0 at many steps. With Adam, our models trained more quickly and did not plateau as early. Furthermore, our model uses the ReduceLROnPlatrau() function to decrease the learning rate when the standard estimate stops improving. The model has trained over 150 epochs and is adapted in each with a mini-batch size of 128 instances.
Metric: The classification accuracy is evaluated through 10-fold cross-validation [39,40] across all experiments to get more accurate results. Classification accuracy is the ratio of the number of correct predictions to the total number of input samples. It works well only if the number of samples belonging to each class is equal. Considering that the number of samples used in this paper is the same, this paper takes classification accuracy as the standard to measure the classification results. We balance the number of songs for each genre in training, validation, and test sets. Finally, the total classification accuracy is the average of 10-fold cross-validations.
Experiment Platform: We use the Keras [41] and TensorFlow [42] library to build our model in python. Librosa library is used to process audio data. All the experiments are done on Google's Colaboratory platform.

Classification Results of One-Way Interaction (Audio) Mode
As is shown in Figure 5, the network topology in one-way interaction (audio) mode includes two sub-modes. The VFE module provides two visual learning features. The AFE module provides four audio learning features. To further distinguish, we use the number of filters in the dense layer to mark. The experimental results in Table 3 contain eight different situations of the above learning features. In particular, sub-mode B concatenates the output of the Conv-Avg block based on sub-mode A. The experimental results of the two sub-modes are analyzed. The highest accuracy of sub-mode A is 93.14%, and the average accuracy is 92.5675%. The highest accuracy of sub-mode B is 93.65%, and the average accuracy is 92.42%. Compared with the experimental results of sub-mode A, the classification effect of sub-mode B has been improved visibly for the two cases of dense (1024 and 128). Considering that sub-mode B contains more learning features, people may think that the more interactive information they feel, the more sufficient learning features, the better the classification effect. However, for other cases, the classification result of sub-mode B decreases. Therefore, we can know that using more learning features as interactive information may produce a gain effect or inhibit other learning features from playing a role.

Classification Results of One-Way Interaction (Vision) Mode
The network topology in one-way interaction (video) mode is shown in Figure 6. Due to the limitation of the model itself, we only consider one visual learning feature of the VFE module. The AFE module provides five audio learning features. The experimental results in Table 4 include five situations composed of the above learning features. The highest accuracy is 93.11%, and the average accuracy is 91.66%. Compared with the two sub-modes of the one-way interaction (audio) mode, the improvement of the classification effect of this mode is not distinct. The experimental results declare that different learning features as supplementary information have different roles in the classification results. Because the two one-way interaction modes have good performance in the classification task, we want to explore whether the combination of the two one-way modes can further improve the classification effect of the model. Since dense (64) is the last layer of the feature extractor, the AFE module provides only four audio learning features. Since the VFE module provides two learning features A and B, the two-way interaction mode also includes two sub-modes. For ease of expression, we use A and B to represent the two sub-modes. As shown in Table 5, the highest accuracy of two-way interaction sub-mode A is 93.01%, and the average accuracy is 92.39%. However, the worst classification result of dense (256) is 90.79%. The highest accuracy of two-way interaction sub-mode B is 93.28%, and the average accuracy is 92.97%, which is the highest average classification accuracy. On the one hand, although the two-way interaction sub-mode A has more learning features than one-way interaction, its average accuracy is lower than other modes. This situation shows that more learning features as interactive information may inhibit other learning features from playing a role. On the other hand, two-way interaction sub-mode B is established based on two-way interaction sub-mode A, but it has the highest average accuracy. This situation further verifies that more learning features as interactive information may have a gain effect. Therefore, the combination of the two one-way interaction modes can improve the classification effect to a certain extent. However, the final classification result is not the sum of the learning feature effects of two one-way interaction modes. Furthermore, the role of the learning feature in one mode depends on the value of the learning feature in the other mode. The key to getting better classification results is to find the appropriate way and model structure.

Comparison of Each Mode
As can be seen from the above, the classification accuracy of the original model is 90.67%. We can see intuitively from Figure 7 that the classification results of all modes are better than the original model. The results prove the effectiveness of the proposed method. By observing the classification accuracy of the five modes under different dense layer markers, we can see that the classification accuracy of all modes is high at both ends and low in the middle. It shows that the middle-level learning feature interaction method to the layer close to the input and output of the model can improve the classification effect more effectively than other layers. As for why the optimal results often appear at both ends, we get the following two possible reasons:

1.
Considering that dense (128) is closer to the output layer and has a similar degree of abstraction with the input and output of the RNN layer, the restriction between the two learning features will be smaller, and the gain effect will be more prominent.

2.
Considering that dense (1024) is closer to the input layer, it is equivalent to increasing the depth of the model for the input and output of RNN, which is conducive to extracting more effective learning features.
From Figure 8, we can see that all the modes have good classification performance in the four genres of blues, disco, metal, and reggae. The classification accuracy of country, pop, and rock is low. On the one hand, we can speculate that the middle-level learning feature method still has limitations, which can not effectively classify some genres. On the other hand, we can also infer that these genres have great similarities in audio and video features.  The classification accuracy is regarded as the primary metric to measure the quality of the model. A confusion matrix is used to visualize model performance, with each column representing the predicted value and each row representing the actual category. All the correct predictions are on the diagonal, so it is easy to see where the errors are from the confusion matrix because they are all off the diagonal. The confusion matrix allows us to do more analysis than just get the precision and recall values.
As shown in Figure 8, none of the modes can effectively classify the three genres of country, pop, and rock. We analyze the optimal confusion matrix in the one-way interaction (audio B) mode shown in Figure 9. This mode only correctly classifies 90% of country music and wrongly classifies the other 10% as pop and rock. The classification accuracy of pop music is 92%, and most misclassifications identify the music as hip-hop and disco. Rock music classification accuracy is the lowest at only 88%. A large number of them are classified as country, metal, and pop. On the one hand, we can infer that this music has many similarities in its visual and audio characteristics. On the other hand, rock music has indeed influenced many other types of music throughout history.

Comparison to State-of-the-Art
In Table 6, we compare our method with previous excellent results on the GTZAN dataset. Experimental results show that the best classification accuracy of the method (one-way interaction (audio) mode) on the GTZAN dataset is 93.65% (10 folds crossvalidation), which reaches the state-of-the-art performance of the current GTZAN dataset. Our classification result is slightly worse than 93.7% of the BBNN. However, considering that BBNN is composed of multiple Inception blocks, it needs a tremendous number of convolution operations. As shown in Table 7, our method is much faster than BBNN. Figure 9. Optimal confusion matrix of one-way interaction (audio B) mode. Table 6. Classification accuracy(%) on GTZAN dataset is compared across recently proposed methods. In particular, SSD is Statistical Spectrum Descriptor.

Conclusions
This paper proposes a middle-level learning feature interaction method using deep learning. The method aims to solve the problem that the branches of a multi-feature model in the existing music genre classification methods are relatively independent and not interactive, resulting in the lack of learning features for music genre classification. The classification results of each mode are better than the original network architecture. Furthermore, we have shown how our method is effective by comparing state-of-the-art methods. Furthermore, the experimental results prove that the final classification result is not the total sum of the learning feature effects. Using more learning features as interactive information may produce a gain effect or inhibit other learning features from playing a role. Besides, the classification results of all modes show that the learning features near the input and output have a better gain effect on improving the classification results. The above conclusions prove that the proper use of the middle-level learning features interaction method in a multi-feature model can effectively promote the gain effect between different learning features and improve the classification results.
In the future, we will try new methods, such as adopting acoustic features (e.g., SSD, Rhythm Histogram (RH)) as the input of the model or fusing attention mechanisms to give neural networks the ability to focus on their input subset. Meanwhile, we will also research how to distinguish those genres with similar spectral characteristics.