1. Introduction
With the rapid growth of digital music resources across social networks, music information retrieval (MIR) [
1,
2] has become vital for improving accessibility within music recommendation systems. Music retrieval is broadly similar to image retrieval. Instead of content-based retrieval, music is retrieved based on descriptive tags such as music genre (e.g., rock, jazz, classical), instrument type (e.g., piano, violin, symphony), and composer [
3]. Music genre classification [
4] is therefore crucial for boosting retrieval efficiency. Traditionally, music genre classification only considers audio information, whereas recent work collects and utilizes multi-modal information—e.g., texts, scores, and cover images—to increase classification performance. In this study, we address multi-modal music genre classification by proposing a deep fusion network operating in the kernel map space of features derived from different modalities.
Regarding music genre classification [
5,
6,
7,
8], music information retrieval research traces back to Lee and Downnie’s [
9] survey for online shops and streaming services, as well as composer classification [
3,
10,
11]. Most of these methods adopt a traditional classification pipeline, i.e., feature extraction followed by a machine learning algorithm. Feature extraction can be divided into two categories: methods based on the audio and those based on the score/lyrics [
12]. On one hand, since audio is considered a one-dimensional time-serial signal, Fourier transformation-based techniques [
13] are usually adopted to extract the frequency or time–frequency features. This representation capability is relatively limited because the features are manually designed. Until recently, deep neural networks (e.g., LSTM [
14], VGGish [
15], and Transformer [
16]) were designed to extract more discriminative high-level audio features. On the other hand, music score and lyric description are further kinds of music representation; these express the audio’s rich symbolic abstraction. Some methods are proposed to automatically transcribe the audio to symbolic notes or generate the audio from symbolic notes by using deep neural networks [
7,
17,
18]. In this work, we aim to combine the audio and text/lyric information to improve music genre classification performance.
Recently, multi-modal music genre classification has attracted increasing research attention for two main reasons. On the one hand, large-scale music collections naturally contain diverse modal information, including audio signals, album cover images, and textual lyrics, all of which can be fully exploited to support genre recognition. On the other hand, the classification performance of single-modal methods has approached a bottleneck, leaving little room for further improvement. In this context, the rational utilization of complementary multi-modal information becomes an effective way to achieve substantial performance gains. In this challenging task, multi-modal fusion therefore plays a crucial role in exploring complementary information from different modalities. In the literature, there are two fashions to work with multi-modal information [
19]: (i) early fusion at the feature level, where features from different modalities are fused in the network and output a decision; (ii) late fusion at the decision level, where initial decision confidence is achieved independently for each modal, which are fused to deliver a final decision. Compared to late fusion, early fusion allows the model to learn the intrinsic interactions between multi-modal signals. Oramas et al. [
20] propose a multi-modal network that maximizes the similarities between different modalities. The authors first trained each modality network, and the normalized features were concatenated in the last layer for classification, which can be regarded as a simple fusion strategy. Li et al. [
21] investigated feature concatenation, decision weighting, and hybrid fusion for the Mel-spectrogram features from audio data and semantic features from lyric data. Oguike and Primus [
22] studied multi-modal Sotho-Tswana music genre classification by extracting the features from the visual modality, audio modality, and lyric data and by employing a late fusion strategy. In this work, we also consider music audio and text description data for multi-modal music genre classification. Our method differs from the aforementioned models in that features extracted from audio and text data are first embedded into distinct Hilbert kernel spaces via separate elementary kernels, and are then further fused within a unified kernel map network to generate highly nonlinear and discriminative representations, which effectively boost overall music genre classification performance.
Regarding classification methods, kernel learning is a classical approach to pattern classification [
23,
24], for instance, support vector machines (SVMs). It can be reshaped into a quadratic optimization problem and the solution is guaranteed to be optimal. The common framework for classification is first to extract the features from the data, then calculate the kernel similarity between them, and finally classify them using SVMs; kernel similarity between the data is therefore a crucial operation. Multiple kernel learning (MKL) learns a linear combination of multiple elementary kernels, and deep kernel networks (DKNs) [
25] are further designed to capture complex nonlinear similarity between the data, thus exhibiting excellent performance for classification, especially when there is not sufficient data available. However, on large-scale datasets, optimization becomes infeasible due to the quadratic complexity of kernel vectors. According to the Representer Theorem [
26], any positive semi-definite kernel could be written as the inner product of the corresponding kernel map in a high-dimensional Hilbert space. Based on this property, a deep kernel map network (DMN) [
27,
28] is also proposed to approximate its counterpart DKN to avoid heavy computation for deep kernels. However, traditional MKL, DKN, and DMN methods only consider the elementary kernels of the features from a single modal, and they do not study the learning adaptability in the multi-modal scenario. For multi-modal data, the similarity between the data can be calculated as a combination of elementary kernels from different multi-modal signals, which is regarded as multi-modal fusion. Recently, Wang et al. [
29] proposed a non-sparse multi-kernel combination for multi-modal data fusion by imposing a regularized label softening term. Liu et al. [
30] designed a multi-modal fusion model based on a multiple kernel learning algorithm with convolution margin-dimension constraints for sentiment analysis. These kernel-based multi-modal fusion methods usually treat the elementary kernels as independent components and learn the inter-interaction weights between the kernels. However, the elementary kernels across different modalities are potentially redundant and exhibit coarse granularity, thereby failing to capture precise feature correlations. In this work, we further advance the kernel-based multi-modal fusion such that the proposed method enables to not only learn the interactions between the kernel features in the single modality, but also to learn complex interactions between the kernel features across different modalities through end-to-end supervised learning, thereby achieving superior discrimination performance.
In this work, we study the deep nonlinear fusion strategy for audio and text features in a high-dimensional Hilbert space through the deep multi-modal kernel map network for music genre classification. Specifically, our proposed model not only captures the correlations among various elementary kernels from single-modality features, but also explicitly models cross-modal interactions between different modalities at the feature level, rather than at the kernel level. The fusion weights for different features from different modalities can automatically be learned. The study contributions are therefore as follows:
The deep multi-modal kernel map network (DM2KMN) is proposed, jointly learning the combination weights between the modalities and classifier parameters in an end-to-end fashion;
A multi-modal piano genre dataset (the dataset is available at
https://github.com/IntelligentSystemGroup-ZZU/Multi-modal-Piano-Genre-Dataset (accessed on 2 June 2026)) is collected, containing audio recordings of classical piano pieces and corresponding humdrum files. To the best of our knowledge, we are the first to build a multi-modal piano genre dataset with audio and humdrum files;
Extensive experiments on the GTZAN dataset, multi-modal piano genre dataset, and 4MuLA dataset are conducted, and the results validate the effectiveness of the proposed network.
In the following passages, the contents are organized as follows: related works are discussed in
Section 2, and the deep kernel network and proposed DM2KMN are discussed in
Section 3, followed by the experimental results in
Section 4 and finally the conclusion in
Section 5.
2. Related Works
Here we discuss related works on music genre classification and kernel learning methods.
For the music genre classification task [
5], music audio representation is usually the first step. There are two different types of features: sophisticated manually designed features, for instance, Fast Fourier transform-based techniques [
13], and Mel Frequency Cepstral Coefficients (MFCCs) [
31]. However, the discriminative ability of these features is usually relatively poor. In the last decade, extensive research has been conducted on deep neural networks (e.g., convolutional neural networks (CNNs) and long short-term memory (LSTM) [
14]) to extract high-level audio features for music genre classification [
32,
33,
34,
35]. Signh and Biswas [
32] performed a deep analysis on the robustness of commonly used musical and non-musical features against deep learning models and found that Mel-Scale-based features and Swaragram features showed high robustness across different datasets. They further introduced a lightweight CNN [
36] incorporating a genetic algorithm-based approach with a stochastic hyperparameter selection for music genre classification. Ba et al. [
35] compared different deep neural networks, such as CNN, LSTM, gated recurrent units (GRUs), and capsule neural networks (CSNs), and found that CSNs with a Mel spectrogram have produced excellent results. Yu et al. [
37] proposed a deep attention model based on a bidirectional recurrent neural network for music genre classification. Chen et al. [
38] proposed a capsule neural network with an upgraded version of the ideal gas molecular movement optimization algorithm. Zhang and Li [
18] treated the CNN model as an ensemble system, which inputs discrete wavelet transforms, MFCCs, and short-time Fourier transform (STFT) characteristics, and the capuchin search algorithm is adopted to search each model’s hyperparameters, leading to excellent performance for music genre classification.
Another line of music genre classification focuses on the utilization of symbolic features. The experiments conducted in early research also validate that the performance of audio features is typically no better than symbolic features from music scores [
12], which is a kind of symbolic abstraction. Different music properties can be obtained from the symbolic files, for instance, rhythm [
39], pitch [
40], harmony [
41], and melody [
42]. These features are usually reshaped as a histogram vector. Recent work focused on automatic transcription of audio to symbolic notes or generating audio from symbolic notes [
17]. There are several other noteworthy works on multi-modal music genre classification, especially in employing LLMs to utilize lyrics. Oramas et al. [
43] collected a multi-modal music genre dataset (MuMu dataset) that included cover images, text reviews, and audio tracks, and combined several feature embeddings learned from state-of-the-art deep learning networks for classification. They further proposed to apply dimensionality reduction [
20] for the target labels, leading to major improvements in multi-label classification regarding not only the accuracy but also the diversity of predicted genres. Wadhwa and Mukherjee [
44] proposed a multi-modal fusion network approach and a multiframe convolutional recurrent neural network by utilizing both the textual (lyrics) and musical features (Mel spectrogram) for music genre classification. Vatolkin and Mckay [
45] also analyzed the performance of six modalities: audio signals, semantic tags inferred from the audio, symbolic MIDI representations, album cover images, playlist co-occurrences, and lyric texts for music classification, and showed the relative significance of different modalities. The Music4All A+A multi-modal dataset with music artists and albums was also collected for experimentation via the CLIP network in [
46]. Christodoulou et al. [
47] deeply discussed the definition of multi-modality across music disciplines and provided a task-based categorization of multi-modal music datasets, highlighting the direction for multi-modal music processing. It can be seen that for the multi-modal music genre dataset, the proposed methods are usually simple because the multi-modal music data are heterogeneous and not aligned; therefore, it is challenging to fuse them in a unified framework. In this work, we make full use of the audio and text features of the music, and learn their interaction relationships in the kernel map space to obtain a better fused features for multi-modal classification.
Our work is also closely related to image–text cross-modal retrieval [
48], which aims to achieve information retrieval across heterogeneous modalities, including text, images, and audio, by adopting queries from a single modality. A rich body of studies has focused on text–image cross-modal retrieval. Representative methods in this field can be summarized as follows: CNN-RNN-based frameworks [
49,
50] adopt convolutional neural networks (CNNs) for visual representation learning and recurrent neural networks (RNNs) for textual modeling. These methods extract uni-modal features independently and construct positive and negative sample pairs according to aligned image–text annotations. In recent years, advanced techniques, including residual learning [
50], character-level convolution [
51], and spatial attention mechanisms [
52], have been introduced into image–text matching tasks to further strengthen the discriminative capability of multi-modal features. In addition, graph neural networks are widely utilized to construct modality-specific graph structures and implement cross-modal retrieval via graph feature matching [
53]. With the rapid advancement of large language models (LLMs) [
54], vision–language pre-training (VLP) models have enabled efficient extraction of high-quality semantic embeddings. Benefiting from powerful universal representation capabilities, these models have significantly promoted the overall performance of cross-modal retrieval [
55,
56,
57]. Beyond mainstream image–text retrieval research, audio–text retrieval has emerged as a comparable task, which targets retrieving semantically matched audio samples given text-based descriptive queries. Early audio–text retrieval methods heavily relied on predefined category labels to establish cross-modal correspondence [
58]. Recent studies, such as [
59,
60], have further incorporated audio captions and natural language supervision into model training to improve alignment robustness. For example, the audio–text retrieval (ATR) framework [
59] leverages well-pre-trained audio backbones to extract discriminative acoustic features from large-scale audio datasets, and integrates NetRVLAD pooling [
61] to generate comprehensive joint audio–text embeddings. Moreover, optimal metric learning (OML) [
60] employs CNNs to capture robust acoustic characteristics and introduces adaptive metric learning constraints to strengthen fine-grained semantic alignment between audio and textual embeddings. It is clear that although cross-modal retrieval and multi-modal music genre classification have different the objectives, both must extract the features from different modalities and represent them in the common latent feature space.
Kernel-based methods are an extension of multiple kernel learning (MKL) [
62,
63,
64], which aims to learn a linear convex combination of elementary kernels for better pattern representation. Although different optimization algorithms (e.g., constrained quadratic programming [
62], “simpleMKL” based mixed-norm regularization [
64]) are proposed to guarantee the optimal theoretical solution, the main limitations of MKL are as follows: (i) the convex linear combination is hard to express more complex patterns; (ii) the capability of shallow architecture is limited. Inspired by the success of deep learning, deep nonlinear kernel networks are therefore proposed, for instance, the acyclic directed graphs between the kernels [
65], nonlinear combination of polynomial kernels [
66], and Ar-cosine kernels [
67], which can simulate the forward pass of a large network. Recently, multiple nonlinear layers of MKL [
68,
69] have been proposed and several nonlinear activation functions investigated, delivering better discrimination performance. Our work also relates to kernel approximation. The main consumption of kernel-based methods is the calculation of the kernel gram matrix between the data, which can be accelerated through the inner product between the kernel maps in a high Hilbert space according to kernel theory [
26]. Different algorithms have been proposed to obtain the approximated maps for different kernels, for instance, Nyström expansion from uniform random samples without replacement [
70], random Fourier sampling [
71] for stationary kernels and the extension to group-invariant kernels [
72], convolutional kernel networks that approximate the Gaussian kernels [
73,
74], and deep hybrid neural–kernel networks [
75] together with features and kernels. Recently, deep kernel map network [
76] has been proposed to handle any deep nonlinear kernels.
4. Experiments
There are several widely used benchmark datasets for music genre classification, for instance, the GTZAN [
78] and Extended Ballroom [
79] datasets. However, these datasets only contain audio files; therefore, they are inappropriate for multi-modal music genre classification. Recently, multi-modal music datasets have been explored for cross-model music retrieval with different modalities, for instance, the 4MULA dataset [
80] with audio and lyrics, MuMu datasets [
20] integrating with audio, images, and genre tags, and the LMD-ALigned dataset [
45] with six modalities: audio, lyrics, symbolic data, model-based data (e.g., semantic descriptors), album cover images and playlists. With the help of a professional piano musician, we also compiled a multi-modal piano genre dataset containing an audio file and its corresponding humdrum file including a hierarchical symbolic description about music scores.
In the following passages, to fully investigate the performance of the proposed network, we first evaluate the proposed method in a single-modality context on the GTZAN dataset, and then study its adaptability on multi-modal music datasets (i.e., multi-modal piano genre dataset and 4MuLA dataset). All the experiments were performed using a workstation with 4 cores—each 3.20 GHz (Intel Xeon(R) W-2104 CPU)—and an NVIDIA GeForce RTX 3090. It is noted that the random seeds for all the experiments were generated according to the system runtime.
4.1. Data Preprocessing
Feature extraction from audio file: Each audio piece is first divided into a set of small overlapping short windows of length 0.03 s, and MFCCs are extracted for each window and then calculated as a set of coefficients through linear cosine transformation over a log power spectrum on a nonlinear Mel scale frequency:
where
f is the frequency value. We can obtain a spectrogram image for each audio file, and then a pre-trained ResNet-101 model is used to extract the features to represent the audio data.
Symbolic features from text file: For the multi-modal piano genre dataset, a humdrum file is a robust metadata format with a wide range of symbolic features, and is widely used in computational musical analysis. For the 4MuLA dataset, the text information comes from the lyrics. For both datasets, we extract the symbolic features by using the pre-trained RoBERTa model, which tokenizes the input text sequence, extracts the embedding, and focuses on the [CLS] token representing the entire text sequence. The RoBERTa model actually is a robustly optimized BERT model (“BERT-Large”), that contains 24 attention mechanism heads, 24 hidden layers and 1024 hidden units. Compared to the original BERT-Large model, it was pre-trained on a larger dataset (approximated 160 G), and adopted dynamic masking strategy. In our experiments, for each text file, we firstly segment the whole texts as a set of clip with 512 tokens, and then a frozen pre-trained RoBERTa model is employed to extract a 1024-dimensional semantic features for each clip with 512 tokens, and finally perform max pooling to represent the text file.
4.2. Results on the GTZAN Dataset
The GTZAN dataset, one of the first and most widely-used benchmarks for music genre classification, comprises 1000 music clips, each with a duration of 30 s at a frequency of 21.5 kHz. All audio files are recorded in WAV format. The dataset encompasses ten distinct music genres: Pop, Reggae, Rock, Hip Hop, Jazz, Blues, Country, Disco, Classical, and Metal. Each genre category has 100 audio samples per class. Since this dataset has no lyric information, we first validated the performance of the deep kernel map network from audio features, rather than from a combination of different modalities. According to [
34], the data is randomly split into 70% for training and 30% for testing. The performance was measured according to classification accuracy on the test set.
We computed the MFCCs for each audio file, and a pre-trained ResNet-101 with 101 layers was used to produce the deep features of dimensionality of 1000, where the parameters of ResNet-101 model are frozen. To compare MFCCs, we also applied a pre-trained frozen VGGish [
15] model to obtain another kind of deep features of dimensionality of 128. For both kinds of features, we calculated four different elementary kernels (i.e., linear, polynomial with two orders, RBF, and histogram kernels) as well as their exact/approximated kernel maps. The standard deviation of the RBF is calculated for all the samples in the dataset. The eigenvalues with 99% energy in the kernel PCA for the RBF kernel map approximation are preserved. For the histogram intersection kernel map approximation, the maximum quantization level
Q is set to be 10. We first built deep kernel networks for the MFCCs and VGGish [
15] features, respectively, where the depth of the network is chosen empirically to be 3 and the unit number in the hidden layer is twice that of the input units according to [
25]. The learned deep kernel network can then be reformulated as a corresponding deep kernel map network according to
Section 3.2, where the energy threshold is also set to be 99% and
is set to all the training samples when the intermediate deep kernel maps are computed. The weights
are initialized with a Gaussian distribution
, the learning rate is set to be
, and the maximum learning epoch is set to be 400,000. Stochastic gradient descent algorithm with a constant learning rate is applied to update the weights. The model is selected via five-fold cross-validation on the training set. To further improve the performance, the kernels of both features can be combined to build a deep multi-modal kernel network and deep multi-modal kernel map network.
Table 1 shows the results of different comparison methods on the GTZAN dataset, where the accuracy values of other comparison methods are directly taken from the references, except a three-layer DKN and three-layer DM2KMN. All the experiments were conducted on the same training/test splits. For the three-layer DKN and three-layer DM2KMN, we ran three independent runs and computed the mean accuracies with their standard deviations. It can be observed: (i) The methods based on deep neural networks (i.e., CNN, Bi-LSTM and PCNN) usually delivered better performance, demonstrating that the deep learned features are discriminative. (ii) The performance of the deep kernel network on the MFCC and VGGish features is competitive, and their counterpart deep kernel map networks deliver slightly better performance. (iii) The performance on the VGGish features is a little worse than that on MFCCs, which is in accordance with the empirical results in [
32]. (iv) The deep multi-modal kernel network from MFCC and VGGish features exhibits better performance, and their kernel map counterparts also obtain impressive results. From the results, it is empirically validated that the proposed deep multi-modal kernel map is effective for multiple types of features from a single modality. In the following section, we will show the performance of the proposed network for multiple modalities, especially for the feature maps from audio files and text information.
4.3. Results on the Multi-Modal Piano Genre Dataset
We now apply the proposed deep multi-modal kernel map network to the multi-modal piano genre dataset, which contains 985 piano pieces from four genres (i.e., Baroque—243 pieces; Classical—236 pieces; Romantic—258 pieces; Modern—248 pieces) for piano music education, including well-known composers such as Johann Sebastian Bach and Ludwig van Beethoven (all the piano pieces are out of the copyright protection period). The details of the selected piano samples for each genre are shown in the
Table A1. For each audio sample, we collected the audio file (MID format), and its corresponding Kern file (a large amount of a audio files and their humdrum format files for music scores are available in KernScores library from the website
http://kern.humdrum.org/ (accessed on 2 June 2026)) (humdrum format) from the score file containing hierarchical description information about key signature, dynamics, tempos, notes, etc. The piano audio and humdrum data cannot be well aligned, since the Kern file is a text file encoding the high-level abstract semantics of piano scores, rather than a conventional score composed of sequential note symbols. When the corresponding humdrum file was not available, we replaced it with the
MusicXML file downloaded from the Musescore website (Musescore’s website is
https://musescore.com/ (accessed on 2 June 2026)), which can be easily converted into a humdrum file by using Verovio Humdrum Viewer Software. The genre label of each piano sample was double-verified by three professional piano professors, because some piano works were created in the intermediate period spanning two genres.
In the experiments, we randomly split the dataset into two subsets: 600 samples for training and the rest for testing. The sample number of each genre in the training and test sets is shown in
Table 2. It can be observed that the training number for each genre is slightly different; therefore, a set of weights inversely proportional to the training sample number is assigned to mitigate data imbalance. For each sample, MFCCs and VGGish features are extracted from the audio, and symbolic features are also extracted from the humdrum files by using the pre-trained RoBERTA model. Four identical elementary kernels (i.e., linear, polynomial with two orders, RBF, and histogram kernels) and their exact/approximated kernel maps are calculated with the experiments on the GTZAN dataset. Similarly, the weights and the learning rate are initialized with the same setting as the ones on the GTZAN dataset. The discrimination model is estimated by a five-fold cross-validation on the training subset. The performance (i.e., accuracy) is evaluated on the test set.
4.3.1. Ablation Study
We first investigate the performance of different elementary kernels from different modals. In comparison to MFCC features, we also compute the deep features via VGGish [
15]. Their performance is shown in
Table 3. The following can be seen: (i) the performance of the VGGish features is worse than that of the MFCC features because the VGGish features are more appropriate for describing the scenario audio; (iii) the symbolic features clearly outperform the audio features by
in four elementary kernels, which validates that the semantic features of symbolic notes from the humdrum format are more discriminative.
We further investigate the performance of multi-modal deep kernel networks. It is empirically found that the 3-layer multi-modal deep kernel network achieves the best trade-off between performance and computational complexity [
25]. The elementary kernel settings remain the same as those in the GTZAN dataset, such that there are in total eight elementary kernels for both modalities in the input layer. In
Table 4, different hidden unit numbers are investigated, and it is observed that 16 hidden units exhibit better performance, possibly due to overfitting when the number of hidden units increases. A deep multi-modal kernel network with three layers and 16 hidden units is therefore chosen to construct the deep multi-modal kernel map network. In addition, we also study the performance of audio and text respectively in the DKNs, where the number of hidden units is twice that of the input units. The results are shown in
Table 5. For the audio modality, the average accuracy is 72.03%, while that of the text modality reaches 83.46%, indicating a stronger representation capability of the semantic features.
4.3.2. Performance Comparison
According to
Section 3.2, an initial DM2KMN is built using the learned 3-layer multi-modal kernel network. The hidden unit number is set to 16 according to
Table 4. For the intermediate layer, all the training samples are initialized via kernel PCA computation, as shown in Equation (
4), to maximize the approximation ability. We compare the multi-modal DKNs, the initial DM2KMN, and the learned DM2KMN based on three aspects: (i) accuracy; (ii) relative approximation error (RE) between the learned DM2KMNs w.r.t. their counterpart multi-modal DKN, which is defined on the given data
as
where
is the output kernel map of the learned DM2KMN, and
is the original kernel value of the DKNs; (iii) relative importance (RI) of each kernel map in the input layer w.r.t. the output for the learned multi-modal DKN and DM2KMN, which is defined as
where the subscript
q,
and
k refer to the input unit,
p is the hidden unit and
o is an output unit, and
is the weight from unit
q to unit
p in the
-layer. The relative importance of an input unit to the output unit considers all the impacts from each hidden unit.
We compared different multi-modal fusion methods on the same training/test set. We re-implemented several baseline methods, for instance, CNN [
34], CNN + BOW [
91], DCN [
44]. The comparison results are shown in
Table 5. Average accuracies with the standard deviations over three independent runs are reported. The following observations can be made: (i) The performance of the CNN network for the audio and Bag-of-Word (BOW) features from the texts with early fusion in [
91], a fully symmetric architecture DCN [
44], and three different fusion strategies (feature concatenation, decision weighting, and hybrid fusion) in [
21] are re-implemented, validating that modality fusion can improve the performance. (ii) The multi-modal DKNs obtain an average accuracy of 85.11% ± 0.30; when they are initialized into the DM2KMN, the average RE value is relatively small (0.820%), but the classification performance deteriorates to 83.03% ± 0.40 because of the feature information loss. (iii) When the initial DM2KMN is further updated in a supervised fashion, although the RE value becomes large (3.051%), the discrimination performance reaches 88.74% ± 0.54, because joint learning can further optimize toward a better solution. It is clearly seen that the proposed DM2KMN is able to significantly improve the performance. (iv) We also compare the average forward time of multi-modal DKNs and the DM2KMN in
Table 5, whereby the forward time of the DM2KMN (12.27 s) for the test set is less than half that of multi-modal DKNs (31.18 s), since the complexity of the DM2KMN is linear to the training size, rather than having quadratic complexity for the multi-modal DKNs, which validates the efficiency of the DM2KMN.
The relative importance of the learned multi-modal DKN and its corresponding DM2KMN are shown in
Figure 2. For the learned multi-modal DKN, different kernel mappings from audio and text modalities exert vastly disparate impacts, resulting in large standard deviations of RI values across multi-modal kernels. It is interesting that although the polynomial kernel from the audio shows the worst performance in
Table 3, it still has the highest impact on performance; the possible reason is that the polynomial kernel map from the audio has a larger feature size that overwhelms the influence. After joint learning of the proposed DM2KMN, the standard deviation of RI values for different kernel maps is reduced, the RI values from the audio kernels decrease, while kernel importance from symbolic features is boosted, which is consistent with the observation that symbolic features are more discriminative. For the RI values from audio and text data, we respectively accumulate the RI values from four elementary kernel maps of the audio and semantic features to verify the impact of different modalities. For the learned multi-modal DKN, the total importance values from the audio and text modalities are 52.56% and 47.44%. The impact of the audio is slightly higher than that of the text. However, for the learned DM2KMN, their total importance values become 49.41% and 50.59%, where the text has more influence than the audio.
Figure 3 shows the confusion matrices of piano genre classification for different methods: histogram intersection kernel of symbolic features, deep kernel network with symbolic features, multi-modal DKN, and DM2KMN. We observe the following: (i) For the SVM with a single histogram intersection kernel of symbolic features (top left), 21% of Romantic piano pieces are misclassified into Modern, which is in accordance with the fact that the boundary between Romantic and Modern is ambiguous and there are many overlaps. (ii) For the deep kernel network with symbolic features (top right), the performance is slightly improved, but 10% of classical piano pieces are still misclassified into Baroque and 23% of Romantic piano pieces are misclassified into Modern. For Baroque and Modern, their classification accuracies become better. (iii) For the multi-modal deep kernel network (bottom left) and the proposed DM2KMN (top right), the average performance is boosted, especially for the Classical, Romantic and Modern piano pieces, which validate the effectiveness of the proposed method.
4.4. Results on the 4MuLA Dataset
In this study, we further employed the proposed deep multi-modal kernel map network on the public 4MuLA dataset [
80], a multi-modal music genre database with five genres: Rock, Indie, Pop, Hip Hop, and Heavy Metal. It contains 5980 music tracks, with each track represented as a Mel spectrogram for the audio and lyric data. There are 1578 samples for Rock, 1491 samples for Indie, 1449 samples for Pop, 786 for Hip Hop, and 668 for Heavy Metal. As suggested in [
90], we randomly partitioned the data into three subsets for training, validation, and testing data at a ratio of 8:1:1, where the validation set is used to select the best model and the performance is evaluated on the test set. The weights and learning rate were initialized with the same settings as those used on the multi-modal piano genre dataset. We also respectively applied a frozen pre-trained ResNet-101 for the Mel spectrogram image from the audio and a frozen pre-trained RoBERTa model for the lyric data to extract the features. For the proposed DM2KMN method, we adopted the same network architecture used in the multi-modal piano genre dataset experiments, and all the initialization of learning hyperparameters remains unchanged. The only difference is that
is set to 4784, corresponding to the total number of training samples in the 4MuLA dataset.
The results of different methods on the same test set in the 4MuLA dataset are shown in
Table 6, where the average accuracies with the standard deviations are reported. The following observations can be made: (i) The performance of the single-audio MFCCs is not good due to variance in the dataset. (ii) The performance of the RoBERTa model on the lyrics is much better than the audio MFCCs because of the better representation capability of the semantic features. (iii) The simple feature concatenation of MFCCs from the audio and semantic features from the lyrics largely improves the discrimination performance, and their hybrid fusion enable to boost the performance; (iv) In comparison with other deep fusion methods, the proposed DM2KMN can obtain further gains due to that different kernel maps for the MFCCs and semantic features are considered, rather than only two kernel maps corresponding to linear kernels, and better fusion patterns between audio and text features of the music are learned, leading to significant improvements in classification.
The confusion matrices of different methods are shown in
Figure 4. It can be observed from the audio MFCCS (top left) results that 23% of the Pop, 47% of Heavy Metal, 25% of Indie and 14% of Hip Hop samples are misclassified as Rock, consistent with the common view that rock musical elements widely permeate other genres. Meanwhile, 28% of Rock, 32% of Pop and and 31% of Hip Hop samples are wrongly categorized as Indie, as independent music (Indie) incorporates independent rock and independent pop subgenres with highly similar stylistic traits. The RoBERTa model leveraging lyric data achieves higher classification accuracy owing to abundant semantic information. By using the audio and lyric data (bottom left), the average accuracy is significantly boosted because of the complementarity between audio and lyric features. The proposed DM2KMN further elevates classification performance across all genres, verifying its efficacy in music genre recognition.