Deep Multi-Modal Kernel Map Network for Music Genre Classification

Wang, Qun; Jiu, Mingyuan

doi:10.3390/a19060467

Open AccessArticle

Deep Multi-Modal Kernel Map Network for Music Genre Classification

by

Qun Wang

¹ and

Mingyuan Jiu

^2,*

¹

School of Music and Dance, Zhongyuan Institute of Science and Technology, Zhengzhou 450046, China

²

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(6), 467; https://doi.org/10.3390/a19060467 (registering DOI)

Submission received: 11 March 2026 / Revised: 26 May 2026 / Accepted: 3 June 2026 / Published: 8 June 2026

(This article belongs to the Special Issue Machine Learning Algorithms for Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

Music genre classification is an important task in the music information retrieval community that aims to categorize music samples by genre; it can help to retrieve music more easily and efficiently from huge digital music resources. There is an extensive literature on music genre classification, and in this study, we solve the problem using multi-modal information, especially based on music audio and text. We propose a deep multi-modal kernel map network that learns discriminative features in a high-dimensional kernel Hilbert space by fusing the multi-modal features. For the music audio, Mel Frequency Cepstral Coefficients (MFCCs) are extracted and a pre-trained ResNet is applied to extract the features. For the texts, the pre-trained RoBERTa model is applied to extract the semantic symbolic features. In the network’s input layer, we calculate four exact/approximated elementary kernel maps from the audio and text features; in the intermediate and final layer, we progressively compute the nonlinear combination of preceding kernel maps of different modalities, followed by a fully connected layer for classification. The network can be trained end-to-end to jointly learn the combination weights between modalities and classifier parameters. We apply the proposed network on the public GTZAN dataset, multi-modal piano genre dataset, and 4MuLA dataset, and the experimental results validate the effectiveness of the proposed deep multi-modal kernel map network for music genre classification.

Keywords:

music genre classification; deep kernel map network; kernel fusion; music information retrieval

1. Introduction

With the rapid growth of digital music resources across social networks, music information retrieval (MIR) [1,2] has become vital for improving accessibility within music recommendation systems. Music retrieval is broadly similar to image retrieval. Instead of content-based retrieval, music is retrieved based on descriptive tags such as music genre (e.g., rock, jazz, classical), instrument type (e.g., piano, violin, symphony), and composer [3]. Music genre classification [4] is therefore crucial for boosting retrieval efficiency. Traditionally, music genre classification only considers audio information, whereas recent work collects and utilizes multi-modal information—e.g., texts, scores, and cover images—to increase classification performance. In this study, we address multi-modal music genre classification by proposing a deep fusion network operating in the kernel map space of features derived from different modalities.

Regarding music genre classification [5,6,7,8], music information retrieval research traces back to Lee and Downnie’s [9] survey for online shops and streaming services, as well as composer classification [3,10,11]. Most of these methods adopt a traditional classification pipeline, i.e., feature extraction followed by a machine learning algorithm. Feature extraction can be divided into two categories: methods based on the audio and those based on the score/lyrics [12]. On one hand, since audio is considered a one-dimensional time-serial signal, Fourier transformation-based techniques [13] are usually adopted to extract the frequency or time–frequency features. This representation capability is relatively limited because the features are manually designed. Until recently, deep neural networks (e.g., LSTM [14], VGGish [15], and Transformer [16]) were designed to extract more discriminative high-level audio features. On the other hand, music score and lyric description are further kinds of music representation; these express the audio’s rich symbolic abstraction. Some methods are proposed to automatically transcribe the audio to symbolic notes or generate the audio from symbolic notes by using deep neural networks [7,17,18]. In this work, we aim to combine the audio and text/lyric information to improve music genre classification performance.

Recently, multi-modal music genre classification has attracted increasing research attention for two main reasons. On the one hand, large-scale music collections naturally contain diverse modal information, including audio signals, album cover images, and textual lyrics, all of which can be fully exploited to support genre recognition. On the other hand, the classification performance of single-modal methods has approached a bottleneck, leaving little room for further improvement. In this context, the rational utilization of complementary multi-modal information becomes an effective way to achieve substantial performance gains. In this challenging task, multi-modal fusion therefore plays a crucial role in exploring complementary information from different modalities. In the literature, there are two fashions to work with multi-modal information [19]: (i) early fusion at the feature level, where features from different modalities are fused in the network and output a decision; (ii) late fusion at the decision level, where initial decision confidence is achieved independently for each modal, which are fused to deliver a final decision. Compared to late fusion, early fusion allows the model to learn the intrinsic interactions between multi-modal signals. Oramas et al. [20] propose a multi-modal network that maximizes the similarities between different modalities. The authors first trained each modality network, and the normalized features were concatenated in the last layer for classification, which can be regarded as a simple fusion strategy. Li et al. [21] investigated feature concatenation, decision weighting, and hybrid fusion for the Mel-spectrogram features from audio data and semantic features from lyric data. Oguike and Primus [22] studied multi-modal Sotho-Tswana music genre classification by extracting the features from the visual modality, audio modality, and lyric data and by employing a late fusion strategy. In this work, we also consider music audio and text description data for multi-modal music genre classification. Our method differs from the aforementioned models in that features extracted from audio and text data are first embedded into distinct Hilbert kernel spaces via separate elementary kernels, and are then further fused within a unified kernel map network to generate highly nonlinear and discriminative representations, which effectively boost overall music genre classification performance.

Regarding classification methods, kernel learning is a classical approach to pattern classification [23,24], for instance, support vector machines (SVMs). It can be reshaped into a quadratic optimization problem and the solution is guaranteed to be optimal. The common framework for classification is first to extract the features from the data, then calculate the kernel similarity between them, and finally classify them using SVMs; kernel similarity between the data is therefore a crucial operation. Multiple kernel learning (MKL) learns a linear combination of multiple elementary kernels, and deep kernel networks (DKNs) [25] are further designed to capture complex nonlinear similarity between the data, thus exhibiting excellent performance for classification, especially when there is not sufficient data available. However, on large-scale datasets, optimization becomes infeasible due to the quadratic complexity of kernel vectors. According to the Representer Theorem [26], any positive semi-definite kernel could be written as the inner product of the corresponding kernel map in a high-dimensional Hilbert space. Based on this property, a deep kernel map network (DMN) [27,28] is also proposed to approximate its counterpart DKN to avoid heavy computation for deep kernels. However, traditional MKL, DKN, and DMN methods only consider the elementary kernels of the features from a single modal, and they do not study the learning adaptability in the multi-modal scenario. For multi-modal data, the similarity between the data can be calculated as a combination of elementary kernels from different multi-modal signals, which is regarded as multi-modal fusion. Recently, Wang et al. [29] proposed a non-sparse multi-kernel combination for multi-modal data fusion by imposing a regularized label softening term. Liu et al. [30] designed a multi-modal fusion model based on a multiple kernel learning algorithm with convolution margin-dimension constraints for sentiment analysis. These kernel-based multi-modal fusion methods usually treat the elementary kernels as independent components and learn the inter-interaction weights between the kernels. However, the elementary kernels across different modalities are potentially redundant and exhibit coarse granularity, thereby failing to capture precise feature correlations. In this work, we further advance the kernel-based multi-modal fusion such that the proposed method enables to not only learn the interactions between the kernel features in the single modality, but also to learn complex interactions between the kernel features across different modalities through end-to-end supervised learning, thereby achieving superior discrimination performance.

In this work, we study the deep nonlinear fusion strategy for audio and text features in a high-dimensional Hilbert space through the deep multi-modal kernel map network for music genre classification. Specifically, our proposed model not only captures the correlations among various elementary kernels from single-modality features, but also explicitly models cross-modal interactions between different modalities at the feature level, rather than at the kernel level. The fusion weights for different features from different modalities can automatically be learned. The study contributions are therefore as follows:

The deep multi-modal kernel map network (DM2KMN) is proposed, jointly learning the combination weights between the modalities and classifier parameters in an end-to-end fashion;
A multi-modal piano genre dataset (the dataset is available at https://github.com/IntelligentSystemGroup-ZZU/Multi-modal-Piano-Genre-Dataset (accessed on 2 June 2026)) is collected, containing audio recordings of classical piano pieces and corresponding humdrum files. To the best of our knowledge, we are the first to build a multi-modal piano genre dataset with audio and humdrum files;
Extensive experiments on the GTZAN dataset, multi-modal piano genre dataset, and 4MuLA dataset are conducted, and the results validate the effectiveness of the proposed network.

In the following passages, the contents are organized as follows: related works are discussed in Section 2, and the deep kernel network and proposed DM2KMN are discussed in Section 3, followed by the experimental results in Section 4 and finally the conclusion in Section 5.

2. Related Works

Here we discuss related works on music genre classification and kernel learning methods.

For the music genre classification task [5], music audio representation is usually the first step. There are two different types of features: sophisticated manually designed features, for instance, Fast Fourier transform-based techniques [13], and Mel Frequency Cepstral Coefficients (MFCCs) [31]. However, the discriminative ability of these features is usually relatively poor. In the last decade, extensive research has been conducted on deep neural networks (e.g., convolutional neural networks (CNNs) and long short-term memory (LSTM) [14]) to extract high-level audio features for music genre classification [32,33,34,35]. Signh and Biswas [32] performed a deep analysis on the robustness of commonly used musical and non-musical features against deep learning models and found that Mel-Scale-based features and Swaragram features showed high robustness across different datasets. They further introduced a lightweight CNN [36] incorporating a genetic algorithm-based approach with a stochastic hyperparameter selection for music genre classification. Ba et al. [35] compared different deep neural networks, such as CNN, LSTM, gated recurrent units (GRUs), and capsule neural networks (CSNs), and found that CSNs with a Mel spectrogram have produced excellent results. Yu et al. [37] proposed a deep attention model based on a bidirectional recurrent neural network for music genre classification. Chen et al. [38] proposed a capsule neural network with an upgraded version of the ideal gas molecular movement optimization algorithm. Zhang and Li [18] treated the CNN model as an ensemble system, which inputs discrete wavelet transforms, MFCCs, and short-time Fourier transform (STFT) characteristics, and the capuchin search algorithm is adopted to search each model’s hyperparameters, leading to excellent performance for music genre classification.

Another line of music genre classification focuses on the utilization of symbolic features. The experiments conducted in early research also validate that the performance of audio features is typically no better than symbolic features from music scores [12], which is a kind of symbolic abstraction. Different music properties can be obtained from the symbolic files, for instance, rhythm [39], pitch [40], harmony [41], and melody [42]. These features are usually reshaped as a histogram vector. Recent work focused on automatic transcription of audio to symbolic notes or generating audio from symbolic notes [17]. There are several other noteworthy works on multi-modal music genre classification, especially in employing LLMs to utilize lyrics. Oramas et al. [43] collected a multi-modal music genre dataset (MuMu dataset) that included cover images, text reviews, and audio tracks, and combined several feature embeddings learned from state-of-the-art deep learning networks for classification. They further proposed to apply dimensionality reduction [20] for the target labels, leading to major improvements in multi-label classification regarding not only the accuracy but also the diversity of predicted genres. Wadhwa and Mukherjee [44] proposed a multi-modal fusion network approach and a multiframe convolutional recurrent neural network by utilizing both the textual (lyrics) and musical features (Mel spectrogram) for music genre classification. Vatolkin and Mckay [45] also analyzed the performance of six modalities: audio signals, semantic tags inferred from the audio, symbolic MIDI representations, album cover images, playlist co-occurrences, and lyric texts for music classification, and showed the relative significance of different modalities. The Music4All A+A multi-modal dataset with music artists and albums was also collected for experimentation via the CLIP network in [46]. Christodoulou et al. [47] deeply discussed the definition of multi-modality across music disciplines and provided a task-based categorization of multi-modal music datasets, highlighting the direction for multi-modal music processing. It can be seen that for the multi-modal music genre dataset, the proposed methods are usually simple because the multi-modal music data are heterogeneous and not aligned; therefore, it is challenging to fuse them in a unified framework. In this work, we make full use of the audio and text features of the music, and learn their interaction relationships in the kernel map space to obtain a better fused features for multi-modal classification.

Our work is also closely related to image–text cross-modal retrieval [48], which aims to achieve information retrieval across heterogeneous modalities, including text, images, and audio, by adopting queries from a single modality. A rich body of studies has focused on text–image cross-modal retrieval. Representative methods in this field can be summarized as follows: CNN-RNN-based frameworks [49,50] adopt convolutional neural networks (CNNs) for visual representation learning and recurrent neural networks (RNNs) for textual modeling. These methods extract uni-modal features independently and construct positive and negative sample pairs according to aligned image–text annotations. In recent years, advanced techniques, including residual learning [50], character-level convolution [51], and spatial attention mechanisms [52], have been introduced into image–text matching tasks to further strengthen the discriminative capability of multi-modal features. In addition, graph neural networks are widely utilized to construct modality-specific graph structures and implement cross-modal retrieval via graph feature matching [53]. With the rapid advancement of large language models (LLMs) [54], vision–language pre-training (VLP) models have enabled efficient extraction of high-quality semantic embeddings. Benefiting from powerful universal representation capabilities, these models have significantly promoted the overall performance of cross-modal retrieval [55,56,57]. Beyond mainstream image–text retrieval research, audio–text retrieval has emerged as a comparable task, which targets retrieving semantically matched audio samples given text-based descriptive queries. Early audio–text retrieval methods heavily relied on predefined category labels to establish cross-modal correspondence [58]. Recent studies, such as [59,60], have further incorporated audio captions and natural language supervision into model training to improve alignment robustness. For example, the audio–text retrieval (ATR) framework [59] leverages well-pre-trained audio backbones to extract discriminative acoustic features from large-scale audio datasets, and integrates NetRVLAD pooling [61] to generate comprehensive joint audio–text embeddings. Moreover, optimal metric learning (OML) [60] employs CNNs to capture robust acoustic characteristics and introduces adaptive metric learning constraints to strengthen fine-grained semantic alignment between audio and textual embeddings. It is clear that although cross-modal retrieval and multi-modal music genre classification have different the objectives, both must extract the features from different modalities and represent them in the common latent feature space.

Kernel-based methods are an extension of multiple kernel learning (MKL) [62,63,64], which aims to learn a linear convex combination of elementary kernels for better pattern representation. Although different optimization algorithms (e.g., constrained quadratic programming [62], “simpleMKL” based mixed-norm regularization [64]) are proposed to guarantee the optimal theoretical solution, the main limitations of MKL are as follows: (i) the convex linear combination is hard to express more complex patterns; (ii) the capability of shallow architecture is limited. Inspired by the success of deep learning, deep nonlinear kernel networks are therefore proposed, for instance, the acyclic directed graphs between the kernels [65], nonlinear combination of polynomial kernels [66], and Ar-cosine kernels [67], which can simulate the forward pass of a large network. Recently, multiple nonlinear layers of MKL [68,69] have been proposed and several nonlinear activation functions investigated, delivering better discrimination performance. Our work also relates to kernel approximation. The main consumption of kernel-based methods is the calculation of the kernel gram matrix between the data, which can be accelerated through the inner product between the kernel maps in a high Hilbert space according to kernel theory [26]. Different algorithms have been proposed to obtain the approximated maps for different kernels, for instance, Nyström expansion from uniform random samples without replacement [70], random Fourier sampling [71] for stationary kernels and the extension to group-invariant kernels [72], convolutional kernel networks that approximate the Gaussian kernels [73,74], and deep hybrid neural–kernel networks [75] together with features and kernels. Recently, deep kernel map network [76] has been proposed to handle any deep nonlinear kernels.

3. Methodology

In this section, we first present the deep kernel network, then reformulate it as a deep kernel map counterpart in the kernel map feature space, and finally introduce the neural architecture of the proposed deep multi-modal kernel map network for music genre classification.

3.1. Deep Kernel Networks

Here, we briefly revisit the deep kernel network and illustrate its principle. A deep kernel network comprises a multilayered network to deliver a deep nonlinear kernel from a combination of several elementary kernels. In the network, the unit p in the

(l)

-th layer (corresponding to a kernel

{κ_{p}^{(l)}}_{l, p}

) is calculated over the

n_{l - 1}

kernels in the preceding

(l - 1)

-th layer

κ_{p}^{(l)} (., .) = g (\sum_{q = 1}^{n_{l - 1}} w_{p, q}^{(l - 1)} κ_{q}^{(l - 1)} (., .)),

(1)

with

p \in {1, \dots, n_{l}}

and

q \in {1, \dots, n_{l - 1}}

, where g refers to a nonlinear activation function (for instance, hyperbolic tangent for the intermediate layer and exponential function for the output layer), and

w_{p, q}^{(l - 1)}

is the linear weights between the two layers. Given

κ_{q}^{(l - 1)}

is a p.s.d kernel, it is guaranteed that the kernel

{κ_{p}^{(l)}}_{l, p}

is also p.s.d when

w_{p, q}^{(l - 1)}

is positive according to the closure property of the p.s.d w.r.t. algebraic operations such as sum and product. More proof details can be found in [25]. A set of SVMs is then implemented at the output layer for classification.

The deep kernel network aims to learn a better similarity function between the samples, which is not directly used for classification. In the training procedure, we thus firstly calculate different elementary kernel gram matrices between the training data, and then they are fed forward into the network to calculate the deep kernels, which are integrated into the SVM learning framework via the kernel trick. It is therefore infeasible to train the deep kernel network in an end-to-end fashion, and so the deep kernel network is trained in an alternative fashion: first, the network parameters are fixed to train the SVM classifier, and then the SVM is fixed and the gradient of the kernels in the output layer is calculated using SVM dual formation to back-propagate through the network.

The deep kernel network provides a convenient way to obtain a better kernel design for different classification tasks. The drawbacks of the deep kernel network are as follows: (i) the computational complexity is quadratic regarding the number of the training data, so it becomes impractical when the number of training data become large; (ii) the kernel fusion and classifier weights are learned independently, leading to a sub-optimal solution; (iii) the current deep kernel network does not consider the elementary kernels from different modalities. In the following passages, we present the deep multi-modal kernel map network that can effectively address these limitations.

3.2. Deep Kernel Map Network

According to the Representer Theorem, any positive semi-definite kernel can be written as the inner product of its corresponding kernel maps. We can therefore transform the deep kernel network into its corresponding deep kernel map network. Considering that

{κ_{p}^{(l - 1)}}

has an explicit exact/approximation kernel map

{\hat{ϕ}}_{p}^{(l - 1)} (\cdot)

. From Equation (1), the approximate kernel map

{\hat{ϕ}}_{p}^{(l)} (x)

of the data

x

for the p-th unit in the

(l)

-th layer can be recursively computed by two steps (a module of a three-layer DMN is shown in Figure 1): First, we concatenate the weighted kernel maps from the units in the

(l - 1) -

th layer:

{\hat{ϕ}}_{p}^{(l), c} (x) = {(\sqrt{w_{p, 1}^{(l - 1)}} {\hat{ϕ}}_{1}^{(l - 1)} {(x)}^{⊤} \dots \sqrt{w_{p, n_{l - 1}}^{(l - 1)}} {\hat{ϕ}}_{n_{l - 1}}^{(l - 1)} {(x)}^{⊤})}^{⊤} .

(2)

Second, for the nonlinear activation function

g (\cdot)

, we solve the following eigenproblem on a training set

S = {x_{i}}_{i = 1}^{N}

to obtain the projections:

K_{p}^{(l)} V^{(l)} = V^{(l)} Λ^{(l)},

(3)

where

K_{p}^{(l)}

is the kernel matrix on

S

in the

(l)

-th layer,

Λ^{(l)}

and

V^{(l)}

are the eigenvalues and the associated eigenvectors for the kernel

K_{p}^{(l)}

, and

U_{p}^{(l)} = V^{(l)} / \sqrt{Λ^{(l)}}

is a linear transformation matrix. The approximate kernel maps are

{\hat{ϕ}}_{p}^{(l)} {(x)}^{⊤} = (g (〈 {\hat{ϕ}}_{p}^{(l), c} (x), {\hat{ϕ}}_{p}^{(l), c} (x_{1}) 〉) \dots g (〈 {\hat{ϕ}}_{p}^{(l), c} (x), {\hat{ϕ}}_{p}^{(l), c} (x_{N}) 〉)) U_{p}^{(l)},

(4)

where

g (〈 {\hat{ϕ}}_{p}^{(l), c} (x), {\hat{ϕ}}_{p}^{(l), c} (x_{1}) 〉) \approx κ_{p}^{(l)} (x, x_{1})

; more precisely,

g (\cdot)

is a tangent hyperbolic function for the intermediate layer and is an exponential function for the output layer, and

U_{p}^{(l)} = V^{(l)} / \sqrt{Λ^{(l)}}

is a normalized linear transformation matrix. It is guaranteed that the approximate kernel

{\hat{κ}}_{p}^{l} = 〈 {\hat{ϕ}}_{p}^{(l)} (x), {\hat{ϕ}}_{p}^{(l)} (x) 〉

is the original kernel

κ_{p}^{l}

when data

x

belongs to

S

, otherwise the approximate value infinitely approaches close to the original when N is sufficiently large (the proof refers to [76]).

From Equation (4), the parameters of the l-th layer in the DMN include

\{{x_{i}}_{i = 1}^{N}, U^{(l)}, \sqrt{w^{(l - 1)}}\}

, where the first two sets of parameters are kernel map approximation parameters and

\sqrt{w^{(l - 1)}}

are the discrimination weights from a pre-learned DKN. The initial DMN is built in an unsupervised way to make it as approximate as possible to the original DKN, and then a supervised learning algorithm is studied in an end-to-end fashion to jointly learn all the parameters by using the squared hinge loss function in the LIBLINEAR toolbox [77]. Since Jiu and Sahbi [27,76] apply the DKNs and DMNs to image annotation, whose labels are not excluded, it is feasible to learn a set of independent SVMs on the network for each class and back-propagate the summation of gradients from all the classes. However, in the classical music genre classification task, we modify the final layer as a logistic regression layer with the cross-entropy loss function due to label exclusion.

3.3. Deep Multi-Modal Kernel Map Network

In the deep kernel map network, the elementary kernels in the input layer are calculated from a single module of the data. In this work, we extend the deep kernel map network to the deep multi-modal kernel map network. Assume that we have M modals of signals (such as audio and text features), represented as

{X^{m}}_{m = 1}^{M}

. The overall pipeline of the framework comprises three stages: (1) in the first stage, for each modal

X^{m}

, the extracted features are written as

{x_{1}^{m}, \dots, x_{N}^{m}}

; here, we extract the audio features and semantic features from the texts; more details are provided in Section 4.1; (2) in the second stage, a set of elementary kernels

{κ_{p}^{(1)}}

(i.e., linear, polynomial, RBF, histogram intersection kernel) are pre-computed for the features from different modalities on the data, and their corresponding kernel maps can be calculated according to Appendix A.1; (3) finally, the kernel maps can be propagated through the proposed deep multi-modal kernel map network and the network parameters are learned by end-to-end supervised learning as shown in the following. For brevity, the superscripts and subscripts of all notations are omitted unless otherwise specified.

Compared to other common networks, whose input is the features themselves, the DM2KMN actually maps the data from the raw space to a high-dimensional Hilbert space. The pre-computed exact/approximated kernel maps of elementary kernels from different modals are presented in the first input layer, and then the kernel maps in the intermediate layers are recursively forwarded in Equations (2) and (4); finally we add a logistic regression layer with

w^{L R}

to the network for classification in the last layer. The framework of the DM2KMN is shown in Figure 1.

3.4. End-to-End Supervised Learning

The deep multi-modal kernel map network is designed to not only fit its multi-modal DKN counterpart as closely as possible, but also to improve the calculation efficiency on large-scale datasets. To enhance the discrimination ability, the weights

{w_{p, q}^{(l)}}_{l, p, q}

can be learned jointly with

{{\hat{ϕ}}_{p}^{l, c} (x_{i})}_{i, l, p}

and

{U_{p}^{(l)}}_{l, p}

in Equations (2) and (4) in an end-to-end fashion.

Let

T = {(X_{i}, y_{i}^{c})}_{i = 1}^{N}

be a multi-modal training set having N samples, M modals, and C classes; here,

X_{i} = {x_{i}^{1}, \dots, x_{i}^{M}}

is a multi-modal data for each sample and

y_{i}^{c}

is its class label, with

y_{i}^{c} = + 1

if the data

x_{i}^{m}

belongs to class c; otherwise,

y_{i}^{c} = - 1

. In the logistic regression layer, a final layer with

w^{L R}

is used to calculate the decision score via:

f_{i} (X_{i}) = {w_{\cdot i}^{L R}}^{⊤} {\hat{ϕ}}_{1}^{L} (X_{i}) .

(5)

The Softmax operator is used to obtain the confidence probabilities:

p_{i}^{c} (X_{i}) = \frac{exp (f_{i}^{c})}{\sum_{k = 1}^{C} exp (f_{i}^{k})}

. The label c with maximum confidence is assigned to the sample. For the loss function, we minimize a cross-entropy loss criterion E with

ℓ_{2}

-norm penalization:

min_{{\hat{ϕ}}_{p}^{l, c}, U_{p}^{(l)}, w_{p, q}^{(l)}, w^{L R}} E = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i}^{c} log (p_{i}^{c}) + | | w^{L R} {| |}^{2} + \sum_{k = 1}^{K} | | \sqrt{w^{(k)}} {| |}^{2},

(6)

where the first term is cross-entropy loss, and the rest are

ℓ_{2}

regularization terms for the network parameters and discrimination weights. Since the whole network is an end-to-end framework, we can adopt back-propagation and the gradient descent algorithm to minimize Equation (6). The derivative of E w.r.t.

p_{i}^{c}

is computed by:

\frac{\partial E}{\partial p_{i}^{c}} = p_{i}^{c} - y_{i}^{c} .

(7)

Then the derivative of E w.r.t. the kernel map

{\hat{ϕ}}_{1}^{(L)} (X_{i})

in the output layer is:

\frac{\partial E}{{\hat{ϕ}}_{1}^{(L)} (X_{i})} = \frac{\partial E}{\partial p_{i}^{c}} w^{L R} .

(8)

The gradient of loss function w.r.t.

w^{L R}

is calculated as

Δ w^{L R} = {{\hat{ϕ}}_{1}^{(L)} (X_{i})}^{⊤} \frac{\partial E}{\partial p_{i}^{c}} .

(9)

Next, we recursively obtain the other gradients in the preceding layers

l = L - 1, \dots, 1

layer-wise. Here, we show the exact calculation form for the parameters in one module based on a back-propagation procedure. Assuming that the derivative of E w.r.t.

{\hat{ϕ}}_{p}^{(l)} (X_{i})

in layer l is already known, it is back-propagated to

{\hat{κ}}_{p}^{(l)}

according to Equation (4) by

\frac{\partial E}{\partial {\hat{κ}}_{p}^{(l)} (X, X_{i})} = {(\frac{\partial E}{\partial {\hat{ϕ}}_{p}^{(l)} (X)})}^{⊤} {[U_{p}^{(l)}]}_{i}^{⊤},

(10)

where

{[.]}_{i}

is denoted as the i-th row of a matrix. Since

{\hat{κ}}_{p}^{(l)} (X, X_{i})

is the nonlinear activation

g (\cdot)

over the approximated kernel

f_{p}^{(l)} (X, X_{i}) = 〈 {\hat{ϕ}}_{p}^{l, c} (X), {\hat{ϕ}}_{p}^{l, c} (X_{i}) 〉

, the derivative w.r.t. the approximated kernel

f_{p}^{(l)} (X, X_{i})

is:

\frac{\partial E}{\partial f_{p}^{(l)} (X, X_{i})} = g^{'} (f_{p}^{(l)} (X, X_{i})) \frac{\partial E}{\partial κ_{p}^{(l)} (X, X_{i})},

(11)

where

g^{'} (\cdot)

is the derivative of the activation function; in our case, for the tangent hyperbolic function

g^{'} (\cdot) = 1 - \tanh {(\cdot)}^{2}

and for the exponential function

g^{'} (\cdot) = g (\cdot)

. The derivative w.r.t.

{\hat{ϕ}}_{p}^{l, c} (X)

is then calculated by summing all the associated terms

f_{p}^{(l)} (X, X_{i})

:

\frac{\partial E}{{\hat{ϕ}}_{p}^{l, c} (X)} = \sum_{i = 1}^{N} {\hat{ϕ}}_{p}^{l, c} (X_{i}) \frac{\partial E}{\partial f_{p}^{(l)} (X, X_{i})} .

(12)

According to Equation (2), the derivatives w.r.t.

{\hat{ϕ}}_{q}^{(l - 1)} (X)

at the

(l - 1)

-th layer can thus be obtained by

\frac{\partial E}{\partial {\hat{ϕ}}_{q}^{(l - 1)} (X)} = \sqrt{w_{p, q}^{(l - 1)}} Frag {(\frac{\partial E}{{\hat{ϕ}}_{p}^{l, c} (X)})}_{q},

(13)

where

Frag {(\cdot)}_{q}

is the corresponding fragment of kernel maps with respect to the unit q at the

(l - 1)

-th layer of the multi-modal DKN.

Finally, the gradients of the loss function w.r.t.

U_{p}^{(l)}

,

{\hat{ϕ}}_{p}^{l, c} (X_{i})

and

\sqrt{w_{p, q}^{(l - 1)}}

can be calculated as:

Δ U_{p}^{(l)} = {({\hat{κ}}_{p}^{(l)} (X, X_{1}) \dots {\hat{κ}}_{p}^{(l)} (X, X_{N}))}^{⊤} {(\frac{\partial E}{\partial {\hat{ϕ}}_{p}^{(l)} (X)})}^{⊤}

(14)

Δ {\hat{ϕ}}_{p}^{l, c} (X_{i}) = \frac{\partial E}{\partial f_{p}^{(l)} (X, X_{i})} {\hat{ϕ}}_{p}^{l, c} (x) .

(15)

Δ \sqrt{w_{p, q}^{(l - 1)}} = Frag {(\frac{\partial E}{{\hat{ϕ}}_{p}^{l, c} (X)})}_{q} {({\hat{ϕ}}_{q}^{(l - 1)} (X))}^{⊤} + 2 \sqrt{w_{p, q}^{(l - 1)}} .

(16)

Here we treat

\sqrt{w_{p, q}^{(l - 1)}}

as an independent parameter. Once all the gradients of the deep multi-modal kernel map network are obtained, they can be updated by using gradient descent with a fixed learning rate

η_{d m n}

as follows:

\{\begin{matrix} U_{p}^{(l)} & \leftarrow U_{p}^{(l)} - η_{d m n} Δ U_{p}^{(l)} \\ {\hat{ϕ}}_{p}^{l, c} (X_{i}) & \leftarrow {\hat{ϕ}}_{p}^{l, c} (X_{i}) - η_{d m n} Δ {\hat{ϕ}}_{p}^{l, c} (X_{i}) \\ \sqrt{w_{p, q}^{(l)}} & \leftarrow \sqrt{w_{p, q}^{(l)}} - η_{d m n} Δ \sqrt{w_{p, q}^{(l)}} \\ w^{L R} & \leftarrow w^{L R} - η_{d m n} Δ w^{L R} \end{matrix}

(17)

The whole supervised learning procedure is shown in Algorithm 1.

Algorithm 1: End-to-end supervised learning for the proposed DM2KMN

4. Experiments

There are several widely used benchmark datasets for music genre classification, for instance, the GTZAN [78] and Extended Ballroom [79] datasets. However, these datasets only contain audio files; therefore, they are inappropriate for multi-modal music genre classification. Recently, multi-modal music datasets have been explored for cross-model music retrieval with different modalities, for instance, the 4MULA dataset [80] with audio and lyrics, MuMu datasets [20] integrating with audio, images, and genre tags, and the LMD-ALigned dataset [45] with six modalities: audio, lyrics, symbolic data, model-based data (e.g., semantic descriptors), album cover images and playlists. With the help of a professional piano musician, we also compiled a multi-modal piano genre dataset containing an audio file and its corresponding humdrum file including a hierarchical symbolic description about music scores.

In the following passages, to fully investigate the performance of the proposed network, we first evaluate the proposed method in a single-modality context on the GTZAN dataset, and then study its adaptability on multi-modal music datasets (i.e., multi-modal piano genre dataset and 4MuLA dataset). All the experiments were performed using a workstation with 4 cores—each 3.20 GHz (Intel Xeon(R) W-2104 CPU)—and an NVIDIA GeForce RTX 3090. It is noted that the random seeds for all the experiments were generated according to the system runtime.

4.1. Data Preprocessing

Feature extraction from audio file: Each audio piece is first divided into a set of small overlapping short windows of length 0.03 s, and MFCCs are extracted for each window and then calculated as a set of coefficients through linear cosine transformation over a log power spectrum on a nonlinear Mel scale frequency:

Mel (f) = 2585 log (1 + \frac{f}{700}),

(18)

where f is the frequency value. We can obtain a spectrogram image for each audio file, and then a pre-trained ResNet-101 model is used to extract the features to represent the audio data.

Symbolic features from text file: For the multi-modal piano genre dataset, a humdrum file is a robust metadata format with a wide range of symbolic features, and is widely used in computational musical analysis. For the 4MuLA dataset, the text information comes from the lyrics. For both datasets, we extract the symbolic features by using the pre-trained RoBERTa model, which tokenizes the input text sequence, extracts the embedding, and focuses on the [CLS] token representing the entire text sequence. The RoBERTa model actually is a robustly optimized BERT model (“BERT-Large”), that contains 24 attention mechanism heads, 24 hidden layers and 1024 hidden units. Compared to the original BERT-Large model, it was pre-trained on a larger dataset (approximated 160 G), and adopted dynamic masking strategy. In our experiments, for each text file, we firstly segment the whole texts as a set of clip with 512 tokens, and then a frozen pre-trained RoBERTa model is employed to extract a 1024-dimensional semantic features for each clip with 512 tokens, and finally perform max pooling to represent the text file.

4.2. Results on the GTZAN Dataset

The GTZAN dataset, one of the first and most widely-used benchmarks for music genre classification, comprises 1000 music clips, each with a duration of 30 s at a frequency of 21.5 kHz. All audio files are recorded in WAV format. The dataset encompasses ten distinct music genres: Pop, Reggae, Rock, Hip Hop, Jazz, Blues, Country, Disco, Classical, and Metal. Each genre category has 100 audio samples per class. Since this dataset has no lyric information, we first validated the performance of the deep kernel map network from audio features, rather than from a combination of different modalities. According to [34], the data is randomly split into 70% for training and 30% for testing. The performance was measured according to classification accuracy on the test set.

We computed the MFCCs for each audio file, and a pre-trained ResNet-101 with 101 layers was used to produce the deep features of dimensionality of 1000, where the parameters of ResNet-101 model are frozen. To compare MFCCs, we also applied a pre-trained frozen VGGish [15] model to obtain another kind of deep features of dimensionality of 128. For both kinds of features, we calculated four different elementary kernels (i.e., linear, polynomial with two orders, RBF, and histogram kernels) as well as their exact/approximated kernel maps. The standard deviation of the RBF is calculated for all the samples in the dataset. The eigenvalues with 99% energy in the kernel PCA for the RBF kernel map approximation are preserved. For the histogram intersection kernel map approximation, the maximum quantization level Q is set to be 10. We first built deep kernel networks for the MFCCs and VGGish [15] features, respectively, where the depth of the network is chosen empirically to be 3 and the unit number in the hidden layer is twice that of the input units according to [25]. The learned deep kernel network can then be reformulated as a corresponding deep kernel map network according to Section 3.2, where the energy threshold is also set to be 99% and

S

is set to all the training samples when the intermediate deep kernel maps are computed. The weights

w^{L R}

are initialized with a Gaussian distribution

0.01 \times N (0, 1)

, the learning rate is set to be

10^{- 4}

, and the maximum learning epoch is set to be 400,000. Stochastic gradient descent algorithm with a constant learning rate is applied to update the weights. The model is selected via five-fold cross-validation on the training set. To further improve the performance, the kernels of both features can be combined to build a deep multi-modal kernel network and deep multi-modal kernel map network.

Table 1 shows the results of different comparison methods on the GTZAN dataset, where the accuracy values of other comparison methods are directly taken from the references, except a three-layer DKN and three-layer DM2KMN. All the experiments were conducted on the same training/test splits. For the three-layer DKN and three-layer DM2KMN, we ran three independent runs and computed the mean accuracies with their standard deviations. It can be observed: (i) The methods based on deep neural networks (i.e., CNN, Bi-LSTM and PCNN) usually delivered better performance, demonstrating that the deep learned features are discriminative. (ii) The performance of the deep kernel network on the MFCC and VGGish features is competitive, and their counterpart deep kernel map networks deliver slightly better performance. (iii) The performance on the VGGish features is a little worse than that on MFCCs, which is in accordance with the empirical results in [32]. (iv) The deep multi-modal kernel network from MFCC and VGGish features exhibits better performance, and their kernel map counterparts also obtain impressive results. From the results, it is empirically validated that the proposed deep multi-modal kernel map is effective for multiple types of features from a single modality. In the following section, we will show the performance of the proposed network for multiple modalities, especially for the feature maps from audio files and text information.

4.3. Results on the Multi-Modal Piano Genre Dataset

We now apply the proposed deep multi-modal kernel map network to the multi-modal piano genre dataset, which contains 985 piano pieces from four genres (i.e., Baroque—243 pieces; Classical—236 pieces; Romantic—258 pieces; Modern—248 pieces) for piano music education, including well-known composers such as Johann Sebastian Bach and Ludwig van Beethoven (all the piano pieces are out of the copyright protection period). The details of the selected piano samples for each genre are shown in the Table A1. For each audio sample, we collected the audio file (MID format), and its corresponding Kern file (a large amount of a audio files and their humdrum format files for music scores are available in KernScores library from the website http://kern.humdrum.org/ (accessed on 2 June 2026)) (humdrum format) from the score file containing hierarchical description information about key signature, dynamics, tempos, notes, etc. The piano audio and humdrum data cannot be well aligned, since the Kern file is a text file encoding the high-level abstract semantics of piano scores, rather than a conventional score composed of sequential note symbols. When the corresponding humdrum file was not available, we replaced it with the MusicXML file downloaded from the Musescore website (Musescore’s website is https://musescore.com/ (accessed on 2 June 2026)), which can be easily converted into a humdrum file by using Verovio Humdrum Viewer Software. The genre label of each piano sample was double-verified by three professional piano professors, because some piano works were created in the intermediate period spanning two genres.

In the experiments, we randomly split the dataset into two subsets: 600 samples for training and the rest for testing. The sample number of each genre in the training and test sets is shown in Table 2. It can be observed that the training number for each genre is slightly different; therefore, a set of weights inversely proportional to the training sample number is assigned to mitigate data imbalance. For each sample, MFCCs and VGGish features are extracted from the audio, and symbolic features are also extracted from the humdrum files by using the pre-trained RoBERTA model. Four identical elementary kernels (i.e., linear, polynomial with two orders, RBF, and histogram kernels) and their exact/approximated kernel maps are calculated with the experiments on the GTZAN dataset. Similarly, the weights and the learning rate are initialized with the same setting as the ones on the GTZAN dataset. The discrimination model is estimated by a five-fold cross-validation on the training subset. The performance (i.e., accuracy) is evaluated on the test set.

4.3.1. Ablation Study

We first investigate the performance of different elementary kernels from different modals. In comparison to MFCC features, we also compute the deep features via VGGish [15]. Their performance is shown in Table 3. The following can be seen: (i) the performance of the VGGish features is worse than that of the MFCC features because the VGGish features are more appropriate for describing the scenario audio; (iii) the symbolic features clearly outperform the audio features by

15 %

in four elementary kernels, which validates that the semantic features of symbolic notes from the humdrum format are more discriminative.

We further investigate the performance of multi-modal deep kernel networks. It is empirically found that the 3-layer multi-modal deep kernel network achieves the best trade-off between performance and computational complexity [25]. The elementary kernel settings remain the same as those in the GTZAN dataset, such that there are in total eight elementary kernels for both modalities in the input layer. In Table 4, different hidden unit numbers are investigated, and it is observed that 16 hidden units exhibit better performance, possibly due to overfitting when the number of hidden units increases. A deep multi-modal kernel network with three layers and 16 hidden units is therefore chosen to construct the deep multi-modal kernel map network. In addition, we also study the performance of audio and text respectively in the DKNs, where the number of hidden units is twice that of the input units. The results are shown in Table 5. For the audio modality, the average accuracy is 72.03%, while that of the text modality reaches 83.46%, indicating a stronger representation capability of the semantic features.

4.3.2. Performance Comparison

According to Section 3.2, an initial DM2KMN is built using the learned 3-layer multi-modal kernel network. The hidden unit number is set to 16 according to Table 4. For the intermediate layer, all the training samples are initialized via kernel PCA computation, as shown in Equation (4), to maximize the approximation ability. We compare the multi-modal DKNs, the initial DM2KMN, and the learned DM2KMN based on three aspects: (i) accuracy; (ii) relative approximation error (RE) between the learned DM2KMNs w.r.t. their counterpart multi-modal DKN, which is defined on the given data

T

as

RE = \frac{1}{{| T |}^{2}} \sum_{x, x^{'} \in T} \frac{| 〈 {\hat{ϕ}}_{1}^{(3)} (x), {\hat{ϕ}}_{1}^{(3)} (x^{'}) 〉 - κ_{1}^{(3)} (x, x^{'}) |}{| 〈 {\hat{ϕ}}_{1}^{(3)} (x), {\hat{ϕ}}_{1}^{(3)} (x^{'}) 〉 | + | κ_{1}^{(3)} (x, x^{'}) |} \times 100 %,

(19)

where

{\hat{ϕ}}_{1}^{(3)} (x)

is the output kernel map of the learned DM2KMN, and

κ_{1}^{(3)} (x, x^{'})

is the original kernel value of the DKNs; (iii) relative importance (RI) of each kernel map in the input layer w.r.t. the output for the learned multi-modal DKN and DM2KMN, which is defined as

{RI}_{q o} = \frac{\sum_{p = 1}^{n} (w_{q, p}^{(1)} w_{p, o}^{(2)} / \sum_{k = 1}^{m} w_{k, p}^{(1)})}{\sum_{q^{'} = 1}^{m} \sum_{p = 1}^{n} (w_{q^{'}, p}^{(1)} w_{p, o}^{(2)} / \sum_{k = 1}^{m} w_{k, p}^{(1)})} \times 100 %

(20)

where the subscript q,

q^{'}

and k refer to the input unit, p is the hidden unit and o is an output unit, and

w_{q, p}^{(l)}

is the weight from unit q to unit p in the

(l)

-layer. The relative importance of an input unit to the output unit considers all the impacts from each hidden unit.

We compared different multi-modal fusion methods on the same training/test set. We re-implemented several baseline methods, for instance, CNN [34], CNN + BOW [91], DCN [44]. The comparison results are shown in Table 5. Average accuracies with the standard deviations over three independent runs are reported. The following observations can be made: (i) The performance of the CNN network for the audio and Bag-of-Word (BOW) features from the texts with early fusion in [91], a fully symmetric architecture DCN [44], and three different fusion strategies (feature concatenation, decision weighting, and hybrid fusion) in [21] are re-implemented, validating that modality fusion can improve the performance. (ii) The multi-modal DKNs obtain an average accuracy of 85.11% ± 0.30; when they are initialized into the DM2KMN, the average RE value is relatively small (0.820%), but the classification performance deteriorates to 83.03% ± 0.40 because of the feature information loss. (iii) When the initial DM2KMN is further updated in a supervised fashion, although the RE value becomes large (3.051%), the discrimination performance reaches 88.74% ± 0.54, because joint learning can further optimize toward a better solution. It is clearly seen that the proposed DM2KMN is able to significantly improve the performance. (iv) We also compare the average forward time of multi-modal DKNs and the DM2KMN in Table 5, whereby the forward time of the DM2KMN (12.27 s) for the test set is less than half that of multi-modal DKNs (31.18 s), since the complexity of the DM2KMN is linear to the training size, rather than having quadratic complexity for the multi-modal DKNs, which validates the efficiency of the DM2KMN.

The relative importance of the learned multi-modal DKN and its corresponding DM2KMN are shown in Figure 2. For the learned multi-modal DKN, different kernel mappings from audio and text modalities exert vastly disparate impacts, resulting in large standard deviations of RI values across multi-modal kernels. It is interesting that although the polynomial kernel from the audio shows the worst performance in Table 3, it still has the highest impact on performance; the possible reason is that the polynomial kernel map from the audio has a larger feature size that overwhelms the influence. After joint learning of the proposed DM2KMN, the standard deviation of RI values for different kernel maps is reduced, the RI values from the audio kernels decrease, while kernel importance from symbolic features is boosted, which is consistent with the observation that symbolic features are more discriminative. For the RI values from audio and text data, we respectively accumulate the RI values from four elementary kernel maps of the audio and semantic features to verify the impact of different modalities. For the learned multi-modal DKN, the total importance values from the audio and text modalities are 52.56% and 47.44%. The impact of the audio is slightly higher than that of the text. However, for the learned DM2KMN, their total importance values become 49.41% and 50.59%, where the text has more influence than the audio.

Figure 3 shows the confusion matrices of piano genre classification for different methods: histogram intersection kernel of symbolic features, deep kernel network with symbolic features, multi-modal DKN, and DM2KMN. We observe the following: (i) For the SVM with a single histogram intersection kernel of symbolic features (top left), 21% of Romantic piano pieces are misclassified into Modern, which is in accordance with the fact that the boundary between Romantic and Modern is ambiguous and there are many overlaps. (ii) For the deep kernel network with symbolic features (top right), the performance is slightly improved, but 10% of classical piano pieces are still misclassified into Baroque and 23% of Romantic piano pieces are misclassified into Modern. For Baroque and Modern, their classification accuracies become better. (iii) For the multi-modal deep kernel network (bottom left) and the proposed DM2KMN (top right), the average performance is boosted, especially for the Classical, Romantic and Modern piano pieces, which validate the effectiveness of the proposed method.

4.4. Results on the 4MuLA Dataset

In this study, we further employed the proposed deep multi-modal kernel map network on the public 4MuLA dataset [80], a multi-modal music genre database with five genres: Rock, Indie, Pop, Hip Hop, and Heavy Metal. It contains 5980 music tracks, with each track represented as a Mel spectrogram for the audio and lyric data. There are 1578 samples for Rock, 1491 samples for Indie, 1449 samples for Pop, 786 for Hip Hop, and 668 for Heavy Metal. As suggested in [90], we randomly partitioned the data into three subsets for training, validation, and testing data at a ratio of 8:1:1, where the validation set is used to select the best model and the performance is evaluated on the test set. The weights and learning rate were initialized with the same settings as those used on the multi-modal piano genre dataset. We also respectively applied a frozen pre-trained ResNet-101 for the Mel spectrogram image from the audio and a frozen pre-trained RoBERTa model for the lyric data to extract the features. For the proposed DM2KMN method, we adopted the same network architecture used in the multi-modal piano genre dataset experiments, and all the initialization of learning hyperparameters remains unchanged. The only difference is that

| S |

is set to 4784, corresponding to the total number of training samples in the 4MuLA dataset.

The results of different methods on the same test set in the 4MuLA dataset are shown in Table 6, where the average accuracies with the standard deviations are reported. The following observations can be made: (i) The performance of the single-audio MFCCs is not good due to variance in the dataset. (ii) The performance of the RoBERTa model on the lyrics is much better than the audio MFCCs because of the better representation capability of the semantic features. (iii) The simple feature concatenation of MFCCs from the audio and semantic features from the lyrics largely improves the discrimination performance, and their hybrid fusion enable to boost the performance; (iv) In comparison with other deep fusion methods, the proposed DM2KMN can obtain further gains due to that different kernel maps for the MFCCs and semantic features are considered, rather than only two kernel maps corresponding to linear kernels, and better fusion patterns between audio and text features of the music are learned, leading to significant improvements in classification.

The confusion matrices of different methods are shown in Figure 4. It can be observed from the audio MFCCS (top left) results that 23% of the Pop, 47% of Heavy Metal, 25% of Indie and 14% of Hip Hop samples are misclassified as Rock, consistent with the common view that rock musical elements widely permeate other genres. Meanwhile, 28% of Rock, 32% of Pop and and 31% of Hip Hop samples are wrongly categorized as Indie, as independent music (Indie) incorporates independent rock and independent pop subgenres with highly similar stylistic traits. The RoBERTa model leveraging lyric data achieves higher classification accuracy owing to abundant semantic information. By using the audio and lyric data (bottom left), the average accuracy is significantly boosted because of the complementarity between audio and lyric features. The proposed DM2KMN further elevates classification performance across all genres, verifying its efficacy in music genre recognition.

5. Conclusions

In this work, we propose a deep multi-modal kernel map network (DM2KMN) for music genre classification by using the information of audios and texts. The proposed method facilitates deep nonlinear fusion in a high-dimensional kernel feature space between both modalities. The MFCCs are calculated from the audio file and a pre-trained ResNet is used to extract the features from the image of MFCCs, and a pre-trained RoBERTa model is also employed to extract the symbolic features from the texts. The kernel maps of different elementary kernels from both modalities are first approximated and then fed forward in a deep multi-modal kernel map network. The network parameters are learned in an end-to-end supervised fashion, leading to deep full fusion between both modalities. The proposed method is experimented on the GTZAN dataset, a multi-modal piano genre dataset, and the 4MuLA dataset, and the results demonstrate the effectiveness of the proposed deep multi-modal kernel map network for music genre classification.

However, the proposed model has the following limitations: (1) the pre-trained models are adopted for feature extraction of audio and text modalities, and they are not optimized in a joint learning framework in this work, so the performance can be further enhanced by jointly fine-tuning these pre-trained models together with the DM2KMN framework; (2) due to non-alignment between audio and text information, the proposed method merely conducts naive feature fusion across different modalities, which inevitably results in inconsistent interactions between the modalities, so how to automatically learn the alignment between the audio and the lyrics/scores for better fusion is a concern for future work; (3) since most kernels lack explicit exact mapping functions, how to obtain high-precision approximated kernel maps is therefore essential for the DM2KMN. The approximation approach in this work imposes substantial computational overhead. Hence, future research shall also focus on developing more efficient kernel approximation schemes.

Author Contributions

Conceptualization, Q.W. and M.J.; data curation, Q.W., methodology, M.J.; validation, Q.W. and M.J.; writing—original draft preparation, Q.W.; writing—review and editing, M.J.; project administration and funding acquisition, M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by grants from the National Natural Science Foundation of China (No. 61806180).

Institutional Review Board Statement

“Not applicable” for studies not involving humans or animals.

Informed Consent Statement

“Not applicable” for studies not involving humans.

Data Availability Statement

The dataset will be made publicly available upon publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Appendix A.1. Kernel Map Calculation

In this section, we demonstrate how to calculate the kernel maps for several elementary kernels.

Linear kernel map. The linear kernel is calculated as

κ^{(1)} (x, x^{'}) = 〈 x, x^{'} 〉

, and it is clear to see that the kernel map function is the features themselves

ϕ (x) = x

.

Polynomial kernel map. Polynomial kernel with

(n + 1)

-order is calculated as

κ^{(1)} (x, x^{'}) = {〈 x, x^{'} 〉}^{n + 1}

. Similarly to the linear kernel, it also has an exact kernel map

ϕ (x) = x \otimes^{n} x

, where

\otimes^{n}

means Kronecker tensor product by n times.

The histogram intersection (HI) kernel map. The histogram interaction kernel is calculated as

κ^{(1)} (x, x^{'}) = \sum_{d = 1}^{s} \min (x^{d}, x^{' d})

, where

x^{d}

is the

d^{th}

dimension of

x

. Considering

x = {(x^{1}, \dots, x^{s})}^{⊤} \in X

, we map each dimension

x^{d}

as a vector

ψ (x^{d})

via quantization [76,92]:

ψ (x^{d}) = 2^{0} + 2^{1} + \dots + 2^{k (x^{d})},

(A1)

where

k (x^{d}) = ⌊Q \frac{x^{d} - ℓ_{d}}{u_{d} - ℓ_{d}}⌋

and

⌊ a ⌋

is denoted as the largest integer less than

a \in R

, and

ℓ_{d}

and

u_{d}

are, respectively, the minimum and maximum values for the

d^{th}

dimension in the training dataset.

Q \in N^{+}

is a predefined maximum quantization level. Then

ψ (x^{d})

is written as a vector with a length of Q dimensions via “decimal-to-unary” mapping with the first

k (x^{d})

values to be 1 and the rest to be 0. According to [92], the HI kernel map can be approximated as:

{\hat{ϕ}}_{p}^{(1)} (x) = {(ψ {(x^{1})}^{⊤} \sqrt{\frac{u_{1} - ℓ_{1}}{Q}}, \sqrt{ℓ_{1}}, \dots, ψ {(x^{s})}^{⊤} \sqrt{\frac{u_{s} - ℓ_{s}}{Q}}, \sqrt{ℓ_{s}})}^{⊤} .

(A2)

With a sufficiently large Q, the approximate kernel error is bounded by

\frac{1}{Q} \sum_{d = 1}^{s} (x^{d} - ℓ_{d})

. The proof refers to Proposition 1 in [76].

RBF kernel map. The RBF kernel is calculated as

κ^{(1)} (x, x^{'}) = exp (\frac{| | x - x^{'} {| |}^{2}}{2 σ^{2}})

, where

σ

is the standard deviation. Since the explicit map of the RBF kernel is infinite-dimensional, we instead obtain its approximate kernel map through kernel eigen decomposition defined by Equation (3).

Appendix A.2. Details of Multi-Modal Piano Genre Dataset

Table A1. Details of the piano works and composers across different music genres.

Period	Piano Works	Number
Baroque	—Well-Tempered Clavier I and II by Bach (48 preludes and 48 fugues)	243
	—15 Two-part Inventions by Bach
	—Other pieces by Bach including Partitas, French Suites,
	English Suites, and Toccatas
	—59 Sonatas by Scarlatti
	—Various pieces by Buxtehude, Couperin, Handel, and Rameau
Classical	—32 Sonatas by Beethoven (102 movements)	236
	—Other pieces by Beethoven such as Rondos and Variations
	—6 Sonatas by Clementi (17 movements)
	—Sonatas by Haydn (35 movements)
	—17 Sonatas by Mozart (51 movements)
	—Other pieces by Mozart including Variations, Fantasies, Rondos,
	Polonaises, and Sonatinas
	—Sonatas and Rondos by Carl Philipp Emanuel Bach
Romantic	—Several Etudes, Esquisses, Recueil de Chants, Minuets,	258
	Improvisations, and Andantes Romantiques by Alkan
	—52 Mazurkas by Chopin
	—24 Preludes by Chopin
	—Other pieces by Chopin including Waltzes, Ballades, Etudes,
	Nocturnes, and Scherzos
	—12 Transcendental Etudes By Liszt
	—3 Etudes de concert By Liszt
	—6 Grandes Etudes de Paganini by Liszt
	—Other pieces by Liszt such as Consolations, Hungarian Rhapsodies, Liebestraume
	—Individual pieces from Brahms, Field, Bizet, Glinka, Grieg,
	MacDowell, Mendelssohn, Mussorgsky, Saint-Saëns, Johann Strauss II, Schubert,
	Schumann, and Tchaikovsky
Modern	—32 Sonatas by Beethoven (102 movements)	248
	—Pieces by Debussy including Preludes, Etudes, Estampes, Arabesques, and Images
	—24 Preludes and 3 Fantastic Dances by Shostakovich
	—Pieces by Prokofiev including Visions Fugitives, 10 Pieces for Piano, Etudes,
	4 Pieces for Piano, Sonatas, and Sarcasms
	—Pieces by Satie including Gnossiennes, Nocturnes, Preludes and Valses
	—Pieces by Rachmaninoff including Preludes, Sonatas, Moments- Musicaux,
	Etudes Tableaux, and Romances
	—Individual pieces from Bartok, Gershwin, Joplin, Khachaturian, Messiaen,
	Poulenc, Ravel, Scriabin, Schoenberg, and Turpin

References

Kaminskas, M.; Ricci, F. Contextual music information retrieval and recommendation: State of the art and challenges. Comput. Sci. Rev. 2012, 6, 89–119. [Google Scholar] [CrossRef]
Sturm, B.L. The State of the Art Ten Years After a State of the Art: Future Research in Music Information Retrieval. J. New Music. Res. 2014, 43, 147–172. [Google Scholar] [CrossRef]
Alvarez, D.A.P.; Gelbukh, A.; Sidorov, G. Composer classification using melodic combinatorial n-grams. Expert Syst. Appl. 2024, 249, 123300. [Google Scholar] [CrossRef]
Grout, D.J.; Palisca, C.V. A History of Western Music; W W Norton & Co. Inc.: New York, NY, USA, 1996. [Google Scholar]
Sturm, B.L. A Survey of Evaluation in Music Genre Recognition. In Proceedings of the International Workshop on Adaptive Multimedia Retrieval, Copenhagen, Denmark, 24–25 October 2012; pp. 29–66. [Google Scholar]
Weiss, C.; Mauch, M.; Dixon, S. Timbre-Invariant Audio Features for Style Analysis of Classical Music. In Proceedings of the Joint Conference 40th ICMC and 11th SMC, Athens, Greece, 14–20 September 2014; pp. 1461–1468. [Google Scholar]
Weiss, C.; Müller, M. Tonal complexity features for style classification of classical music. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2015; pp. 688–692. [Google Scholar]
Weiss, C.; Mauch, M.; Dixon, S.; Müller, M. Investigating style evolution of western classical music: A computational approach. Music. Sci. 2019, 23, 486–507. [Google Scholar] [CrossRef]
Lee, J.H.; Downie, J.S. Survey of Music Information Needs, Uses, and Seeking Behaviours: Preliminary Findngs. In Proceedings of the ISMIR, Barcelona, Spain, 10–15 October 2004; pp. 441–446. [Google Scholar]
Galajda, J.E.; Hua, K. Automated Thematic Composer Classification Using Segment Retrieval. In Proceedings of the 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR); IEEE: Piscataway, NJ, USA, 2024; pp. 162–168. [Google Scholar]
Deepaisarn, S.; Chokphantavee, S.; Chokphantavee, S.; Prathipasen, P.; Buaruk, S.; Sornlertlamvanich, V. NLP-based music processing for composer classification. Sci. Rep. 2023, 13, 13228. [Google Scholar] [CrossRef]
Corrêa, D.C.; Rodrigues, F.A. A survey on symbolic data-based music genre classification. Expert Syst. Appl. 2016, 60, 190–210. [Google Scholar] [CrossRef]
Fu, Z.; Lu, G.; Ting, K.M.; Zhang, D. A Survey of Audio-Based Music Classification and Annotation. IEEE Trans. Multimed. 2011, 13, 303–319. [Google Scholar] [CrossRef]
Li, R.; Wu, Z.; Ning, Y.; Sun, L.; Meng, H.; Cai, L. Spectro-temporal modelling with time-frequency LSTM and structured output layer for voice conversion. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 3409–3413. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2017; pp. 776–780. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Hernandez-Olivan, C.; Beltrán, J.R. Music Composition with Deep Learning: A Review. In Advances in Speech and Music Technology: Computational Aspects and Applications; Springer International Publishing: Berlin/Heidelberg, Germany, 2023; pp. 25–50. [Google Scholar]
Zhang, Y.; Li, T. Music genre classification with parallel convolutional neural networks and capuchin search algorithm. Sci. Rep. 2025, 15, 9580. [Google Scholar] [CrossRef]
Tang, Q.; Liang, J.; Zhu, F. A comparative review on multi-modal sensors fusion based on deep learning. Signal Process. 2023, 213, 109165. [Google Scholar] [CrossRef]
Oramas, S.; Barbieri, F.; Nieto, O.; Serra, X. Multimodal deep learning for music genre classification. Trans. Int. Soc. Music. Inf. Retr. 2018, 1, 4–21. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Z.; Ding, H.; Liang, C. Music genre classification based on fusing audio and lyric information. Multimed. Tools Appl. 2023, 82, 20157–20176. [Google Scholar] [CrossRef]
Oguike, O.E.; Primus, M. Multimodal Music Genre Classification of Sotho-Tswana Musical Videos. IEEE Access 2025, 13, 28799–28808. [Google Scholar] [CrossRef]
Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27. [Google Scholar] [CrossRef]
Jiu, M.; Sahbi, H. Nonlinear Deep Kernel Learning for Image Annotation. IEEE Trans. Image Process. 2017, 26, 1820–1832. [Google Scholar] [CrossRef]
Vapnik, V. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
Jiu, M.; Sahbi, H. Deep kernel map networks for image annotation. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2016; pp. 1571–1575. [Google Scholar]
Jiu, M.; Sahbi, H. End-to-End Deep Kernel Map Design for Image Annotation. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2020; pp. 1546–1550. [Google Scholar]
Wang, P.; Qiu, C.; Wang, J.; Wang, Y.; Tang, J.; Huang, B.; Su, J.; Zhang, Y. Multimodal Data Fusion Using Non-Sparse Multi-Kernel Learning With Regularized Label Softening. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 6244–6252. [Google Scholar] [CrossRef]
Liu, J.; Wang, Z.; Wan, G.; Liu, J. A Novel Multi-modal Sentiment Analysis Based on Multiple Kernel Learning with Margin-Dimension Constraint. Int. J. Comput. Intell. Syst. 2024, 17, 207. [Google Scholar] [CrossRef]
Abdul, Z.K.; Al-Talabani, A.K. Mel Frequency Cepstral Coefficient and its Applications: A Review. IEEE Access 2022, 10, 122136–122158. [Google Scholar] [CrossRef]
Singh, Y.; Biswas, A. Robustness of musical features on deep learning models for music genre classification. Expert Syst. Appl. 2022, 199, 116879. [Google Scholar] [CrossRef]
Prabhakar, S.K.; Lee, S.W. Holistic Approaches to Music Genre Classification using Efficient Transfer and Deep Learning Techniques. Expert Syst. Appl. 2023, 211, 118636. [Google Scholar] [CrossRef]
Ahmed, M.; Rozario, U.; Kabir, M.M.; Aung, Z.; Shin, J.; Mridha, M.F. Musical Genre Classification Using Advanced Audio Analysis and Deep Learning Techniques. IEEE Open J. Comput. Soc. 2024, 5, 457–467. [Google Scholar] [CrossRef]
Ba, T.C.; Le, T.D.T.; Van, L.T. Music genre classification using deep neural networks and data augmentation. Entertain. Comput. 2025, 53, 100929. [Google Scholar] [CrossRef]
Singh, Y.; Biswas, A. Lightweight convolutional neural network architecture design for music genre classification using evolutionary stochastic hyperparameter selection. Expert Syst. 2023, 40, e13241. [Google Scholar] [CrossRef]
Yu, Y.; Luo, S.; Liu, S.; Qiao, H.; Liu, Y.; Feng, L. Deep attention based music genre classification. Neurocomputing 2020, 372, 84–91. [Google Scholar] [CrossRef]
Chen, P.; Zhang, J.; Mashhadi, A. Advanced music classification using a combination of capsule neural network by upgraded ideal gas molecular movement algorithm. Sci. Rep. 2024, 14, 30863. [Google Scholar] [CrossRef] [PubMed]
Lev, F.; Groult, R.; Arnaud, G.; Cyril, S.; Picardie, D.; Verne, J.; Giraud, M. Rhythm Extraction from Polyphonic Symbolic Music. In Proceedings of the ISMIR, Miami, FL, USA, 24–28 October 2011; pp. 375–380. [Google Scholar]
Tzanetakis, G.; Ermolinskyi, A.; Cook, P. Pitch Histograms in Audio and Symbolic Music Information Retrieval. J. New Music. Res. 2003, 32, 143–152. [Google Scholar] [CrossRef]
Ferkova, E.; Ždimal, M.; Šidlik, P. Chordal Evaluation in MIDI-Based Harmonic Analysis: Mozart, Schubert, and Brahms. Comput. Musicol. 2007, 15, 172–186. [Google Scholar]
Müllensiefen, D.; Frieler, K. Evaluating Different Approaches to Measuring the Similarity of Melodies. In Data Science and Classification; Springer: Berlin/Heidelberg, Germany, 2006; pp. 299–306. [Google Scholar]
Oramas, S.; Nieto, O.; Barbieri, F.; Serra, X. Multi-label Music Genre Classification from Audio, Text, and Images Using Deep Features. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 23–27 October 2017; pp. 1–8. [Google Scholar]
Wadhwa, L.; Mukherjee, P. Music genre classification using multi-modal deep learning based fusion. In Proceedings of the 2021 Grace Hopper Celebration India (GHCI), Virtually, 19 February–12 March 2021; pp. 1–5. [Google Scholar]
Vatolkin, I.; McKay, C. Multi-objective investigation of six feature source types for multi-modal music classification. Trans. Int. Soc. Music. Inf. Retr. 2022, 5, 1–19. [Google Scholar] [CrossRef]
Geiger, J.; Moscati, M.; Nawaz, S.; Schedl, M. Music4All A+A: A Multimodal Dataset for Music Information Retrieval Tasks. In Proceedings of the 2025 International Conference on Content-Based Multimedia Indexing (CBMI), Dublin, Ireland, 22–24 October 2025. [Google Scholar]
Christodoulou, A.M.; Lartillot, O.; Jensenius, A.R. Multimodal music datasets? Challenges and future goals in music processing. Int. J. Multimed. Inf. Retr. 2024, 13, 37. [Google Scholar] [CrossRef]
Wang, T.; Li, F.; Zhu, L.; Li, J.; Zhang, Z.; Shen, H.T. Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions. Proc. IEEE 2024, 112, 1716–1754. [Google Scholar] [CrossRef]
Wang, J.; He, Y.; Kang, C.; Xiang, S.; Pan, P.C. Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (ICMR 2015), Shanghai, China, 23–26 June 2015; pp. 347–354. [Google Scholar]
Liu, Y.; Guo, Y.; Bakker, E.M.; Lew, M.S. Learning a Recurrent Residual Fusion Network for Multimodal Matching. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2017; pp. 4127–4136. [Google Scholar]
Wehrmann, J.; Barros, R.C. Bidirectional Retrieval Made Simple. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 7718–7726. [Google Scholar]
Zhang, Q.; Lei, Z.; Zhang, Z.; Li, S.Z. Context-Aware Attention Network for Image-Text Retrieval. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 3533–3542. [Google Scholar]
Wang, H.; He, D.; Wu, W.; Xia, B.; Yang, M.; Li, F.; Yu, Y.; Ji, Z.; Ding, E.; Wang, J. CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval. In Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 700–716. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1. [Google Scholar]
Xie, C.W.; Wu, J.; Zheng, Y.; Pan, P.; Hua, X.S. Token Embeddings Alignment for Cross-Modal Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 4555–4563. [Google Scholar]
Fu, Z.; Zhang, L.; Xia, H.; Mao, Z. Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2024; pp. 26297–26306. [Google Scholar]
Wang, X.; Li, L.; Li, Z.; Wang, X.; Zhu, X.; Wang, C.; Huang, J.; Xiao, Y. AGREE: Aligning Cross-Modal Entities for Image-Text Retrieval Upon Vision-Language Pre-trained Models. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 456–464. [Google Scholar]
Chechik, G.; Ie, E.; Rehn, M.; Bengio, S.; Lyon, D. Large-scale content-based audio retrieval from text queries. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, MIR ’08, Vancouver, BC, Canada, 30–31 October 2008; pp. 105–112. [Google Scholar]
Lou, S.; Xu, X.; Wu, M.; Yu, K. Audio-Text Retrieval in Context. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2022; pp. 4793–4797. [Google Scholar]
Mei, X.; Liu, X.; Sun, J.; Plumbley, M.D.; Wang, W. On Metric Learning for Audio-Text Cross-Modal Retrieval. arXiv 2022, arXiv:2203.15537. [Google Scholar]
Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2010; pp. 3304–3311. [Google Scholar]
Lanckriet, G.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the Kernel Matrix with Semi-Definite Programming. J. Mach. Learn. Res. 2004, 5, 27–72. [Google Scholar]
Bach, F.; Lanckriet, G.; Jordan, M. Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the International Conference on Machine Learning (ICML), Banff, AB, Canada, 4–8 July 2004; pp. 1–6. [Google Scholar]
Rakotomamonjy, A.; Bach, F.R.; Canu, S.; Grandvalet, Y. SimpleMKL. J. Mach. Learn. Res. 2008, 9, 2491–2521. [Google Scholar]
Bach, F. Exploring large feature spaces with hierarchical multiple kernel learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 7–10 December 2009; pp. 1–9. [Google Scholar]
Cortes, C.; Mohri, M.; Rostamizadeh, A. Learning non-linear combinations of kernels. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 7–10 December 2009; pp. 1–9. [Google Scholar]
Cho, Y.; Saul, L. Kernel methods for deep learning. Proc. Adv. Neural Inf. Process. Syst. 2009, 28, 342–350. [Google Scholar]
Zhuang, J.; Tsang, I.; Hoi, S. Two-layer multiple kernel learning. In Proceedings of the International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 909–917. [Google Scholar]
Jiu, M.; Sahbi, H. Semi supervised deep kernel design for image annotation. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2015; pp. 1156–1160. [Google Scholar]
Williams, C.; Seeger, M. Using the Nyström method to speed up kernel machines. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 4–6 December 2001; pp. 682–688. [Google Scholar]
Rahimi, A.; Recht, B. Random Features for Large-Scale Kernel Machines. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 3–6 December 2007; Volume 20, pp. 1177–1184. [Google Scholar]
Li, F.; Ionescu, C.; Sminchisescu, C. Random Fourier Approximations for Skewed Multiplicative Histogram Kernels. In Proceedings of the DAGM Conference Pattern Recognition, Darmstadt, Germany, 22–24 September 2010; pp. 262–271. [Google Scholar]
Mairal, J.; Koniusz, P.; Harchaoui, Z.; Schmid, C. Convolutional Kernel Networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2627–2635. [Google Scholar]
Mairal, J. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016; pp. 1399–1407. [Google Scholar]
Mehrkanoon, S.; Suykens, J.A.K. Deep hybrid neural-kernel networks using random Fourier features. Neurocomputing 2018, 298, 46–54. [Google Scholar] [CrossRef]
Jiu, M.; Sahbi, H. Deep representation design from deep kernel networks. Pattern Recognit. 2019, 88, 447–457. [Google Scholar] [CrossRef]
Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
Cook, P.; Tzanetakis, G. Musical genre classification of audio signals. IEEE Trans. Speech Audio Proc. Publ. IEEE Signal Process. Soc. 2002, 10, 293–302. [Google Scholar]
Marchand, U.; Peeters, G. The Extended Ballroom Dataset. In Proceedings of the ISMIR 2016 Late-Breaking Session, New York, NY, USA, 7–11 August 2016. [Google Scholar]
da Silva, A.C.M.; Silva, D.F.; Marcacini, R.M. 4MuLA: A Multitask, Multimodal, and Multilingual Dataset of Music Lyrics and Audio Features. In Proceedings of the Brazilian Symposium on Multimedia and the Web, Virtually, 30 November–4 December 2020; pp. 145–148. [Google Scholar]
Ndou, N.; Ajoodha, R.; Jadhav, A. Music Genre Classification: A Review of Deep-Learning and Traditional Machine-Learning Approaches. In Proceedings of the 2021 IEEE International IOT, Electronics and Mechatronics Conference; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Zhang, C.; Evangelopoulos, G.; Voinea, S.; Rosasco, L.; Poggio, T. A deep representation for invariance and music classification. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2014; pp. 6984–6988. [Google Scholar]
Karunakaran, N.; Arya, A. A Scalable Hybrid Classifier for Music Genre Classification using Machine Learning Concepts and Spark. In Proceedings of the 2018 International Conference on Intelligent Autonomous Systems (ICoIAS), Singapore, 1–3 March 2018; pp. 128–135. [Google Scholar]
Sigtia, S.; Dixon, S. Improved music feature learning with deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2014; pp. 6959–6963. [Google Scholar]
Zhang, P.; Zheng, X.; Zhang, W.; Li, S.; Qian, S.; He, W.H.; Zhang, S.; Wang, Z. A deep neural network for modeling music. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (ICMR), Shanghai, China, 23–26 June 2015; pp. 379–386. [Google Scholar]
Zhang, W.; Lei, W.; Xu, X.; Xing, X. Improved Music Genre Classification with Convolutional Neural Networks. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 3304–3308. [Google Scholar]
Ashraf, M.; Abid, F.; Din, I.U.; Rasheed, J.; Yesiltepe, M.; Yeo, S.F.; Ersoy, M.T. A Hybrid CNN and RNN Variant Model for Music Classification. Appl. Sci. 2023, 13, 1476. [Google Scholar] [CrossRef]
Li, T. Optimizing the configuration of deep learning models for music genre classification. Heliyon 2024, 10, e24892. [Google Scholar] [CrossRef]
Cheng, Y.H.; Kuo, C.N. Machine Learning for Music Genre Classification Using Visual Mel Spectrum. Mathematics 2022, 10, 4427. [Google Scholar] [CrossRef]
Jandoubi, B.; Akhloufi, M.A. Deep Multimodal Classification of Musical Genres. In Proceedings of the SoutheastCon 2025, Concord, NC, USA, 27–30 March 2025; pp. 384–389. [Google Scholar]
Kamtue, K.; Euchukanonchai, K.; Wanvarie, D.; Pratanwanich, N. Lukthung classification using neural networks on lyrics and audios. In Proceedings of the 2019 23rd International Computer Science and Engineering Conference (ICSEC), Phuket, Thailand, 30 October–1 November 2019; pp. 269–274. [Google Scholar]
Sahbi, H. ImageCLEF annotation with explicit context-aware kernel maps. Int. J. Multimed. Inf. Retr. 2015, 4, 113–128. [Google Scholar] [CrossRef]

Figure 1. This illustration demonstrates an example of a three-layer multi-modal kernel map network. The top-left area corresponds to the audio modal, whose input is the spectrogram image for the music, and the bottom-left area represents the Kern files for the music where “*” is the indicator symbol. “**kern” is a spine declaration containing Kern-format musical data. The boxes with the same color have the same function. “Lin” and “HI”, respectively, stand for linear kernel map and histogram intersection kernel map. Each dashed blue rectangle denotes a sub-module for a nonlinear kernel map combination of multi-modal signals. On the right of the network, a fully connected logistic regression layer is used for classification.

Figure 2. Relative importance (in %) comparison with different kernel maps for audio and score features for the learned multi-modal DKN and DM2KMN. “A” and “S” are respectively denoted as audio and symbolic. “lin”, “poly”, “RBF”, and “HI”, respectively, stand for linear kernel map, polynomial kernel map, RBF kernel map, and histogram intersection kernel map. The label of the x-axis is written as “Modal-Kernel”, which means specific kernel maps from one modal; for instance, “A-poly” is denoted as a polynomial kernel map over the MFCC features of the audio file, and “S-RBF” is an RBF kernel map over the symbolic features from the corresponding humdrum file.

Figure 3. Confusion matrices (in %) of piano genre classification in one independent run: single histogram intersection (top left), deep kernel network with symbolic features (top right), multi-modal deep kernel network (bottom left), the proposed DM2KMN with full supervised learning algorithm (bottom right). “B”, “C”, “R”, and “M”, respectively, stand for Baroque, Classical, Romantic, and Modern period. Each cell shows the classification percentage. The darker the cell, the larger percentage value.

Figure 4. Confusion matrices (in %) of genre classification on the 4MuLA dataset: ResNet with audio MFCCs (top left), RoBERTa with lyric data (top right), MFCCs+RoBERTa with feature concatenation (bottom left), the proposed DM2KMN with full supervised learning algorithm (bottom right). “R”, “P”, “M”, “I”, and “H”, respectively, stands for Rock, Pop, Heavy Metal, Indie, and Hip Hop. Each cell shows the classification percentage, and the darker of the cell stands for the larger percentage value.

Table 1. The classification accuracy of different methods on the GTZAN dataset (in %). For the DKN and the proposed DM2KMN, average accuracies with the standard deviations over three independent runs are reported.

	Method	Accuracy
Ndou et al. [81]	SVM	79.7
Zhang et al. [82]	Multilayer representation + STFT	82.0
Karunakaran et al. [83]	Hybrid classifier on Spark	82.4
Sigtia and Dixon [84]	RELU + SGD + Dropout + FFT	83.0
Zhang [85]	KCNN ( $k = 5$ + SVM)	83.9
Zhang et al. [86]	nnet2 + STFT	87.4
Yu et al. [37]	BRNN + PCNNA with SIFT	90.0
Ashraf et al. [87]	Bi-LSTM with Bi-GRU	89.3
Prabhakar et al. [33]	BAG	93.5
Ahmed et al. [34]	CNN	92.7
Li [88]	CNN with MFCC + STFT	95.2
Cheng and Kuo [89]	VMS-YOLO	95.4
Zhang and Li [18]	PCNN + CapSA	96.1
Jiu and Sahbi [25]	3-layer DKN with MFCCs	93.3 ± 0.8
Ours	3-layer DM2KMN with MFCCs	94.0 ± 0.6
Jiu and Sahbi [25]	3-layer DKN with VGGish	85.7 ± 0.7
Ours	3-layer DM2KMN with VGGish	86.5 ± 0.9
Jiu and Sahbi [25]	3-layer DKN with MFCCs + VGGish	95.6 ± 0.3
Ours	3-layer DM2KMN with MFCCs + VGGish	96.0 ± 0.5

Table 2. The sample number of each genre in the training and test sets for the multi-modal piano genre dataset.

Subset	Baroque	Classical	Romantic	Modern
Training set	159	135	153	153
Test set	84	101	105	95

Table 3. Baseline performance of elementary kernels (in %).

Modal	Linear	Polynomial	RBF	HI
Audio-MFCC	70.13	68.83	69.87	71.69
Audio-VGGish	61.56	60.26	63.38	66.49
Symbolic features	77.14	77.14	78.96	82.60

Table 4. Performance of 3-layer multi-modal DKN with different hidden units.

Number of Hidden Units	8	16	24	32
Accuracy (in %)	84.42	84.94	84.68	84.68

Table 5. Performance comparison between different methods on the multi-modal piano genre dataset in terms of average accuracy, average RE and average forward time. Average accuracies with the standard deviations over three independent runs are reported.

Modal	Methods	Ave. Accuracy (in %)	Ave. RE (in %)	Ave. Time (in s)
Audio	Histogram kernel + SVM	71.60 ± 0.40	-	-
	CNN [34]	79.05 ± 0.30	-	-
	DKNs	72.03 ± 0.40	-	-
Text	Histogram kernel + SVM	82.42 ± 0.30	-	-
	RoBERTa + MLP [90]	83.12 ± 0.26	-	-
	DKNs	83.46 ± 0.40	-	-
Both	CNN + BOW [91]	83.81 ± 0.54	-	-
	DCN [44]	83.98 ± 0.15	-	-
	Feature concatenation [21]	84.24 ± 0.30	-	-
	Decision weighting [21]	84.07 ± 0.30	-	-
	Hybrid fusion [21]	86.93 ± 0.15	-	-
	Multi-modal DKNs	85.11 ± 0.30	-	31.18
	Initial DM2KMN	83.03 ± 0.40	0.820	-
	Learned DM2KMN	88.74 ± 0.54	3.051	12.27

Table 6. The test accuracy of different methods on the 4MuLA dataset (in %). Average accuracies with the standard deviations over three independent runs are reported.

Modality	Model	Ave. Accuracy
Audio MFCCs	ResNet + MLP [90]	42.6 ± 0.3
Lyrics	RoBERTa + MLP [90]	57.4 ± 0.3
Audio MFCCs + Lyrics	CNN + BOW [91]	86.7 ± 0.3
Audio MFCCs + Lyrics	DCN [44]	87.6 ± 0.2
Audio MFCCs + Lyrics	Feature concatenation [90]	86.1 ± 0.3
Audio MFCCs + Lyric	Decision weighting [21]	88.8 ± 0.2
Audio MFCCs + Lyrics	Hybrid fusion [21]	90.3 ± 0.2
Audio MFCCs + Lyrics	The proposed DM2KMN	90.4 ± 0.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Q.; Jiu, M. Deep Multi-Modal Kernel Map Network for Music Genre Classification. Algorithms 2026, 19, 467. https://doi.org/10.3390/a19060467

AMA Style

Wang Q, Jiu M. Deep Multi-Modal Kernel Map Network for Music Genre Classification. Algorithms. 2026; 19(6):467. https://doi.org/10.3390/a19060467

Chicago/Turabian Style

Wang, Qun, and Mingyuan Jiu. 2026. "Deep Multi-Modal Kernel Map Network for Music Genre Classification" Algorithms 19, no. 6: 467. https://doi.org/10.3390/a19060467

APA Style

Wang, Q., & Jiu, M. (2026). Deep Multi-Modal Kernel Map Network for Music Genre Classification. Algorithms, 19(6), 467. https://doi.org/10.3390/a19060467

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Multi-Modal Kernel Map Network for Music Genre Classification

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Deep Kernel Networks

3.2. Deep Kernel Map Network

3.3. Deep Multi-Modal Kernel Map Network

3.4. End-to-End Supervised Learning

4. Experiments

4.1. Data Preprocessing

4.2. Results on the GTZAN Dataset

4.3. Results on the Multi-Modal Piano Genre Dataset

4.3.1. Ablation Study

4.3.2. Performance Comparison

4.4. Results on the 4MuLA Dataset

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Kernel Map Calculation

Appendix A.2. Details of Multi-Modal Piano Genre Dataset

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI