Multi-Instance Multi-Scale Graph Attention Neural Net with Label Semantic Embeddings for Instrument Recognition

Bai, Na; Wu, Zhaoli; Zhang, Jian

doi:10.3390/signals6030030

Open AccessArticle

Multi-Instance Multi-Scale Graph Attention Neural Net with Label Semantic Embeddings for Instrument Recognition

by

Na Bai

¹,

Zhaoli Wu

^1,* and

Jian Zhang

^2,*

¹

College of Information and Electrical Engineering, Jiangsu Vocational Institute of Architectural Technology, Xuzhou 221116, China

²

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

^*

Authors to whom correspondence should be addressed.

Signals 2025, 6(3), 30; https://doi.org/10.3390/signals6030030

Submission received: 22 April 2025 / Revised: 30 May 2025 / Accepted: 5 June 2025 / Published: 24 June 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Instrument recognition is a crucial aspect of music information retrieval, and in recent years, machine learning-based methods have become the primary approach to addressing this challenge. However, existing models often struggle to accurately identify multiple instruments within music tracks that vary in length and quality. One key issue is that the instruments of interest may not appear in every clip of the audio sample, and when they do, they are often unevenly distributed across different sections of the track. Additionally, in polyphonic music, multiple instruments are often played simultaneously, leading to signal overlap. Using the same overlapping audio signals as partial classification features for different instruments will reduce the distinguishability of features between instruments, thereby affecting the performance of instrument recognition. These complexities present significant challenges for current instrument recognition models. Therefore, this paper proposes a multi-instance multi-scale graph attention neural network (MMGAT) with label semantic embeddings for instrument recognition. MMGAT designs an instance correlation graph to model the presence and quantitative timbre similarity of instruments at different positions from the perspective of multi-instance learning. Then, to enhance the distinguishability of signals after the overlap of different instruments and improve classification accuracy, MMGAT learns semantic information from the labels of different instruments as embeddings and incorporates them into the overlapping audio signal features, thereby enhancing the differentiability of audio features for various instruments. MMGAT then designs an instance-based multi-instance multi-scale graph attention neural network to recognize different instruments based on the instance correlation graphs and label semantic embeddings. The effectiveness of MMGAT is validated through experiments and compared to commonly used instrument recognition models. The experimental results demonstrate that MMGAT outperforms existing approaches in instrument recognition tasks.

Keywords:

instrument recognition; multi-instance learning; deep neural network; Graph Neural Network

1. Introduction

Instrument recognition plays a vital role in music information retrieval, particularly in identifying multiple instruments within a musical piece. In recent years, machine learning-based approaches have become the dominant methods for tackling this task [1,2]. Traditionally, commonly used acoustic features such as zero-crossing rate and Mel-frequency cepstral coefficients are extracted from the music data. These features are then processed by classifiers such as Decision Trees or Naive Bayes. However, traditional machine learning methods face significant challenges in achieving high recognition accuracy. They often overlook subtle differences between musical instruments, resulting in suboptimal recognition performance. Moreover, the effectiveness of these methods heavily depends on the manually extracted features, which may not adequately capture the complex and dynamic characteristics of musical instruments [2].

Deep neural networks (DNNs) have emerged as the dominant approach for musical instrument recognition tasks [2,3], largely due to their ability to model complex audio patterns. However, a significant challenge arises when processing raw signals from recordings of varying quality. This limitation has led to a critical shift towards 2D spectral representations [4], where time–frequency analysis through Convolutional Neural Networks (CNNs) and Transformers marks a substantial performance improvement over traditional methods. Building on this spectral transformation framework, researchers have addressed data scarcity by adopting two complementary strategies: (1) knowledge transfer through pre-training on auxiliary music datasets (such as VGGish, Passt, Dy-CNN) [5,6,7], and (2) the generation of synthetic data using Wasserstein Generative Adversarial Networks (WGANs) to overcome the constraints posed by low-quality recordings [8,9]. In addition to these advances, the field has progressed to leverage label semantics, incorporating techniques like multi-instance learning [5,6] and label augmentation [10]. Multi-task architectures with auxiliary classifiers contribute to a feedback loop, where enhanced feature learning from expanded labels leads to improved recognition, which in turn refines the modeling of label correlations. This chain of innovation (from signal representation to data augmentation and semantic enhancement) demonstrates a systematic progression in overcoming the limitations of earlier approaches while building on their strengths.

While the machine learning models discussed earlier enhance instrument recognition through effective model design and data (and label) augmentation, they do not fully address the challenges posed by the presence and correlation of multiple instruments within musical pieces of varying lengths. In many cases, the instruments of interest are present only in specific sections of the audio, rather than throughout the entire sample. Furthermore, the instruments that do appear may be scattered across different parts of the music, making it difficult to capture them consistently. Additionally, instruments with similar timbres often share comparable acoustic features, complicating the task of distinguishing them. As a result, it becomes crucial to highlight the audio segments containing the relevant instruments while also weighting instruments with similar timbres for comparison. Moreover, in polyphonic music, where multiple instruments are played simultaneously, their corresponding signals’ overlap occurs. Introducing the same overlapping audio signals into the specific features for different instruments will diminish the distinctiveness of their characteristics, ultimately impacting the accuracy of instrument recognition.

Based on the above situations, this paper proposes a label correlation-based multi-instance multi-scale graph attention neural network (MMGAT) for instrument recognition. MMGAT converts the one-dimensional sequence of music audio into a two-dimensional graph structure by constructing an instance correlation graph. In this graph, each node represents a time slice of the audio, with its features derived from the extracted time–frequency characteristics. The instance correlation graph forms connections between these nodes by creating edges based on the similarity of their features. To model the label correlation, MMGAT constructs label semantic embeddings by first using the CLIP text encoder to generate semantic vectors for instrument name labels (such as “piano” and “violin”). Based on this, MMGAT introduces learnable mask vectors into the semantic vectors to decrease noisy label effects. It then builds a multi-layer decoder to calculate the embedding features of the semantic vectors as enhanced features for distinguishing different instruments. To capture the timbre similarities of instruments across different positions within the music, MMGAT designs an instance-based multi-scale graph attention mechanism. This mechanism prioritizes instances containing the target instrument and those with instruments of similar timbres by assigning higher attention weights, facilitating instrument recognition through the instance correlation graph and label semantic embeddings within a multi-instance multi-label learning framework for the final instrument classification. To verify the effectiveness of the proposed MMGAT, experiments are conducted using a custom-built music dataset from OpenMIC and IRMAS.

The main contributions are listed as follows:

(1): The proposal of an instance correlation graph structure for instrument representation: This graph transforms the sequential linear structure of music audio into the 2-dimensional graph structure to capture the similarity of instances containing similar instruments and the similarity of unrelated instances.
(2): The proposal of a multi-instance multi-scale graph attention neural network (MMGAT) with label semantic embeddings based on the instance correlation graphs and label correlations: MMGAT designs a graph attention network with different scales on the instance correlation graphs and a masked label semantic autoencoder for instrument recognition.

This paper is structured as follows: Section 2 provides a review of existing approaches for instrument recognition. Section 3 introduces the proposed MMGAT model. Section 4 presents experimental results, comparing the performance of the MMGAT model with other commonly used instrument recognition models. Lastly, Section 5 concludes the paper and outlines potential directions for future research.

2. Related Works

Instrument recognition using machine learning has emerged as a dynamic and evolving research area, marked by a variety of approaches ranging from traditional feature engineering to modern deep learning architectures. Early methods focused on extracting acoustic descriptors, such as those developed by Essid et al. [11], who introduced hybrid MFCC-PCA features for timbral modeling. Building on this, Duan et al. [12] advanced cepstral analysis by utilizing UDC/MelUDC representations, combined with RBF-SVM classification for improved recognition. In a similar vein, Eggink et al. [13] pioneered harmonic spectral peak detection to isolate instruments within polyphonic audio contexts. The shift towards deep learning brought about significant innovations, including Gururani et al.’s [14] temporal max-pooling networks designed for detecting polyphonic activity and Han et al.’s [15] implementation of ConvGRU-enhanced CNNs applied to Mel-spectrograms for more accurate instrument identification. As the challenge of audio degradation and polyphonic interference persisted, modern approaches incorporated adaptive data augmentation techniques. For instance, Yu et al. [10] introduced auxiliary classification networks for generating synthetic training data, while Hung et al. [16] leveraged WaveGAN-powered multi-task frameworks to simultaneously address pitch and instrument recognition. The field continues to progress with the application of transfer learning strategies [17], which utilize pre-trained auditory models to boost the generalization capabilities of new models. In parallel, Multi-instance multi-label (MIML) formulations [6] have reframed polyphonic analysis as multi-label segmentation over temporal instances, providing a more comprehensive solution. This methodological convergence reflects a systematic effort to tackle the core challenges of instrument recognition by combining feature optimization, architectural advancements, and data-centric innovations, all aimed at improving robustness in complex auditory environments.

Instrument recognition research has increasingly incorporated transfer learning and multi-instance multi-label frameworks while also exploring various generative neural networks for data augmentation beyond the use of Wave-GAN. Although generative adversarial networks (GANs) and their variants continue to be prominent in the field, their practical deployment in pattern recognition systems is limited by training instability. Alternative approaches, such as diffusion models [18] and flow-based architectures [19], have shown superior image generation quality but come with substantial computational costs, making them less feasible for instrument recognition tasks. In contrast, variational Autoencoders (VAEs) [20] provide greater architectural flexibility and training stability, allowing for their effective integration as embedded subsystems tailored to specific tasks. This advantage is further extended by Conditional VAEs (CVAEs) [21], which enable label-conditioned data synthesis, offering a more controlled approach to data generation [22].

3. The Proposed MMGAT Methods

3.1. Building Instance Correlation Graph for MMGAT

MMGAT begins by constructing an instance correlation graph to represent musical instruments. While music audio is initially converted into a two-dimensional spectrogram, the relationship among these spectrograms remains a linear temporal connection. To directly capture the relationships between instruments within a musical piece, we create a graph structure, which allows for the identification of similarities between instances with similar instruments, as well as distinguishing unrelated instances.

To build this instance correlation graph, we extract features from the spectrograms using a pre-trained deep neural network called VGGish. In this paper, we use VGGish, which has 128-dimensional feature representations. Then, a 1 s music clip is transformed into a 128-dimensional feature. Based on this VGGish model, every music audio is transformed into n*128-dimensional features, where n is the duration of the music, measured in seconds, and this matrix has a corresponding label vector which has multiple activated label components. In this paper, we use graph structures to represent the correlations between audio segments. Specifically, we use VGGish to map an audio clip into multiple audio features. Suppose an audio clip is n seconds long; VGGish converts the audio for each second into a 128-dimensional feature vector. Thus, an n-second audio clip is mapped into n*128-dimensional features, with each feature representing an instance. We then construct an instance correlation graph based on these instances. Each node in the graph represents an instance, and the node features are the 128-dimensional features of the instance. The edges of the instance correlation graph are computed based on the similarity between instances, thus forming the instance correlation graph. To construct the two-dimensional relation of instruments based on the feature matrix, we build an instance correlation graph; in this graph, a node expresses a music clip (the corresponding row in the feature matrix), and every graph has n nodes, which is equal to the number of clips in the corresponding music audio. The edges between node (i, j) are built based on the corresponding weighted similarity of (i, j) in the feature matrix; this similarity can be calculated as follows:

S i m i l a r i t y_{c o s i n e (θ)} = \frac{f_{i} \cdot f_{j}}{| | f_{i} | | \cdot | | f_{j} | |} = \frac{\sum_{i = 1}^{n} f_{i} \cdot f_{j}}{\sqrt{{\sum_{i = 1}^{n} (f_{i})}^{2}} \times \sqrt{{\sum_{j = 1}^{n} (f_{j})}^{2}}}

(1)

W e i g h t_S i m i l a r i t y_{i, j} = \frac{S i m i l a r i t y_{i, j}}{\sum_{(i, j)} S i m i l a r i t y_{i, j}}

(2)

where we use cosine similarity in this paper. In the experiments, we discuss the effectiveness of different types of similarity measures. Based on the constructed feature matrix and the corresponding weighted similarities using as edges, the instance correlation graphs for each music audio instance in the dataset are constructed. Next, we show the structure of the proposed MMGAT.

3.2. The Structure of the Proposed MMGAT

Since different musical pieces have varying lengths, the corresponding feature matrices will have a different number of rows. This results in the instance correlation graphs having a variable number of nodes for each piece of music. To address this issue, we employ an attention mechanism as the primary feature extractor in MMGAT.

Focusing solely on the similarity of instrument features at smaller scales leads to more precise differentiation, which can effectively identify different instruments within the same family. However, this may also result in misidentification due to the varying characteristics of the same instrument across different pitches and qualities. Conversely, if the focus is only on larger scales, this can reduce misidentification issues associated with different pitches of the same instrument, but it may make it harder to distinguish between different instruments within the same family. To address this challenge, MMGAT incorporates multiple attention scales, allowing attention features to be computed at various levels. As a result, a multi-scale self-attention layer is employed to extract features. This layer consists of three key components: a similarity function, attention coefficients, and multi-scale attention features. The similarity function, in particular, uses a hybrid measure to assess the similarity between pairs of node features, evaluating how closely two input vectors resemble one another.

\begin{array}{l} S i m (z_{a}, z_{b}) = [C o s i n e (z_{a}, z_{b}), L 1 (z_{a}, z_{b})] \\ w h e r e, C o s i n e (z_{a}, z_{b}) = \frac{z_{a} \cdot z_{b}}{| | z_{a} | | \cdot | | z_{b} | |} = \frac{\sum_{i = 1}^{n} z_{a i} \cdot z_{b i}}{\sqrt{{\sum_{i = 1}^{n} (z_{z i})}^{2}} \times \sqrt{{\sum_{i = 1}^{n} (z_{b i})}^{2}}}, \\ L 1 (z_{a}, z_{b}) = \sum_{i} |z_{a i} - z_{b i}| \end{array}

(3)

where i is the i-th component of the input, and z_a and z_b are two input features of music. In order to capture diverse levels of influence arising from node similarity and customize this influence for the instrument recognition task, this paper designs cosine similarity as follows:

\begin{array}{l} S i m_{i j \in a d j_1} & = N N (H z_{i}, H z_{j}) \\ = R e L U (W \times [Cosine (H z_{i}, H z_{j}), L 1 (H z_{i}, H z_{j})]) \end{array}

(4)

where the parameters W and H are the used weights, z_i is the i-th input vector, z_j is thus the j-th feature vector, adj_1 is a scale of attention, and NN() is a neural network consisting of a single layer that employs the ReLU function to alleviate the gradient diminishing of cosine similarity. Building upon this similarity, we design the attention coefficient in the following manner:

α_{i j \in a d j_1} = softmax (S i m_{i_{i j \in a d j_1}}) = \frac{\exp (S i m_{i j \in a d j_1})}{\sum_{i} \exp (S i m_{i j \in a d j_1})}

(5)

where α_ij is the attention coefficient in the scale of adj_1. According to these regional attention coefficients, the extracted attention feature h_i of the current audio segment i is calculated as

h_{i} = R e L U (N N (\sum_{j} α_{i j \in a d j_1} * z_{j}))

(6)

Our objective is to embed attention features based on the designed similarity function. By utilizing these extracted features, we incorporate a multi-head attention mechanism to form a multi-head attention block as follows:

h_{i} = R e L U (N N (\sum_{j} α_{i j \in a d j_1} * z_{j}))

(7)

After computing the attention features for a neighborhood adj_1 at a small scale, we then calculate attention features for two larger neighborhood scales. Specifically, this study examines three distinct scales: 1, 2, and 3.

3.2.1. The Designed Label Correlation Embeddings

MMGAT constructs label semantic embeddings that learn semantic interdependencies between musical instrument labels, formally representing instrument classes as embeddings L where each row lc corresponds to the c-th instrument. Specifically, we first initialize these embeddings by projecting class labels into CLIP-compatible text prompts processed through CLIP’s frozen text encoder [17].

Then, it introduces learnable masking vectors for label masking

S = e x p e n d (V e c t o r (n u m_l a b e l s), e m b e d d i n g s)

(8)

where num_labels denote the number of labels, and the vector part is learnable parameters. Then, S is used for masking pre-trained semantic embedding vectors and transferring the masked pre-trained semantic vectors into feature embeddings using an encoder:

F e a t u r e = E n c o d e r (S * L)

(9)

It then builds a multi-layer decoder to calculate the embedding features of the semantic vectors to capture the label correlations. The loss function of this part is denoted as follows:

L o s s_l a b e l = M S E (S * L, r e c o n s t r u c t i o n)

(10)

where reconstruction is the reconstructed output of the decoder. The extracted label correlation features are then embedded into the attention features H in Formula (7) as the recognition features.

3.2.2. The Introduction of the Auxiliary Classifier

MMGAT integrates label augmentation into its training process. To enhance both recognition accuracy and generalization performance, the network concurrently performs the auxiliary classification of instrument groups and primary classification of instrument categories. The grouping strategies, based on onset types, are applied for the auxiliary classification task. By simultaneously carrying out both the auxiliary and primary tasks, our network operates in a multitask learning framework. These two tasks leverage shared information to produce different outputs. The structure of MMGAT is shown in Figure 1.

In MMGAT, the used backbone is the commonly used VGGish.

3.3. MMGAT Loss Function

To simultaneously optimize training efficiency and enhance feature discriminability, we develop a dual-objective framework integrating a centroid-driven regularization term. This geometric constraint explicitly enforces inter-class separation by penalizing deviations from class-specific feature centroids, thereby systematically maximizing margin distances in the latent embedding space. MMGAT’s loss function is denoted as Formula (8):

L o s s = L o s s_{p r i c i p a l} + L o s s_{a u x i l i a r y} + L o s s_l a b e l + α (L o s s_{c e n t e r_l o s s 1} + L o s s_{c e n t e r_l o s s 2})

(11)

L o s s_{p r i c i p a l} a n d L o s s_{a u x i l i a r y}

are the classification cross entropy losses for the principal classification and auxiliary classification.

L o s s_{c e n t e r_l o s s 1}

is the introduced center loss for the principal classification [10], and

L o s s_{c e n t e r_l o s s 2}

is the center loss for the auxiliary classification. α is a hyper-parameter (set 0.015 in this paper). By incorporating the center loss, the features associated with different instrument types in the primary classification become more distinct and well-separated.

In this paper, we directly use VGGish to map audio with a duration of k seconds into a k×128-dimensional matrix and then construct the corresponding instance graph using the k 128-dimensional vectors. CLIP is used to map labels to word vectors. In the recognition process that follows, our attention model consists of three layers with three scales, and each layer has only 128-dimensional hidden features. Label correlation encoding is performed using an autoencoder built with a multi-layer neural network, where both the encoder and decoder consist of two layers. As a result, the total number of parameters in the network is not large, leading to relatively fast training.

Next, we assess the effectiveness of the proposed MMGAT through experimental evaluation.

4. Experiments

The experimental framework is structured around two objectives: (1) to empirically validate the performance superiority of MMGAT in polyphonic acoustic recognition against benchmarked state-of-the-art baselines and (2) to quantitatively assess the structural efficacy of its multi-relational graph architecture in modeling instrument co-occurrence patterns and learned label semantic embeddings. Following standard evaluation protocols, this section details the dataset configuration and preprocessing pipeline before systematically dissecting performance metrics across ablation studies and comparative trials.

4.1. Dataset

In this paper, we design a combined dataset consisting of IRMAS [23] and OpenMIC. It is important to note that the IRMAS training set is single-labeled, the IRMAS testing set is multi-labeled, and the samples in the IRMAS dataset vary in length. The music audio segments in the IRMAS dataset are stereo with a 44.1 KHz sampling rate and feature different musical styles. The recordings span several decades, resulting in varying audio qualities. The goal of IRMAS is to recognize instruments in audio segments, including cello, clarinet, flute, acoustic guitar, electric guitar, organ, piano, saxophone, trumpet, violin, and human voice. However, although the testing set reflects the performance of instrument recognition models on music samples of varying lengths, the IRMAS training set only contains single-instrument samples. To verify the effectiveness of the proposed MMGAT, we introduce the OpenMIC [24] dataset to create a training dataset that includes multiple instruments with different lengths. The resulting OpenMIC-IRMAS dataset consists of both training and testing datasets containing multi-labeled samples of varying lengths. The OpenMIC-IRMAS training set is a combination of the training datasets from OpenMIC and IRMAS. Specifically, we incorporate instances from the IRMAS dataset into the OpenMIC samples, creating music segments that range from 10 to 19 s in duration. In the experiments, the OpenMIC-IRMAS dataset is split into training and testing sets. We use the training set of OpenMIC-IRMAS to train the network, with 15% of the testing set reserved for validation. The attributes of the OpenMIC-IRMAS dataset are shown in Table 1.

As Table 1 shows, the dataset is imbalanced, and the auxiliary classifier contains three classes. The relations of the auxiliary classes and principal classes are listed in Table 2.

As evidenced in Table 2, the auxiliary classifier operates on higher taxonomic granularity compared to its principal counterpart. To holistically assess MMGAT’s classification efficacy under this hierarchical paradigm, we employ rigorous multi-perspective evaluation:

(1): Class-imbalance-resistant micro-averaging for instrument-level fidelity;
(2): Macro-averaging sensitive to rare classes for taxonomy-level fairness.

Following standard evaluation protocols for multi-granular systems, these metrics are formally defined as per standard convention, ensuring direct comparability with prior hierarchical recognition frameworks.

\begin{array}{l} F 1_{m i c r o} = \frac{2 P_{m i c r o} R_{m i c r o}}{P_{m i c r o} + R_{m i c r o}}, F 1_{m a c r o} = \frac{2 P_{m a c r o} R_{m a c r o}}{P_{m a c r o} + R_{m a c r o}} \\ P_{m i c r o} = \frac{\sum_{l = 1}^{L} t p_{l}}{\sum_{l = 1}^{L} (t p_{l} + f p_{l})}, R_{m i c r o} = \frac{\sum_{l = 1}^{L} t p_{l}}{\sum_{l = 1}^{L} (t p_{l} + f n_{l})} \\ P_{m a c r o} = \frac{1}{L} \sum_{l = 1}^{L} \frac{t p_{l}}{(t p_{l} + f p_{l})}, R_{m a c r o} = \frac{1}{L} \sum_{l = 1}^{L} \frac{t p_{l}}{(t p_{l} + f n_{l})} \end{array}

where L is the number of classes. tp_l is true positive. fp_l is false positive, and fn_l is false negative for each label.

4.2. Experiment Analysis

To verify the effectiveness of the proposed MMGAT model, we compare our instrument recognition results using the OpenMIC-IRMAS dataset, and the instrument recognition results are shown in Table 3.

Table 3 systematically benchmarks instrument recognition methodologies across three paradigms: feature-engineered systems, spectrogram-driven neural architectures, and synthetic data-augmented frameworks. Traditional approaches include Bosch et al.’s SVM classifier with handcrafted features and MTF-DNN’s shallow neural network applied to engineered features, while modern architectures like Audio DNN demonstrate the superiority of raw audio CNNs over conventional methods.

The mel-spectrogram-optimized ConvNet introduces temporal max-pooling, later extended by multi-task ConvNet with auxiliary classification to refine labels. These two models use spectrograms and enhanced spectrograms as inputs and train the model using VGG neural networks. The complexity of these models is related to the structure of the neural networks, specifically the complexity of the VGG network. Data augmentation strategies diverge into generative and latent space approaches: WaveGAN ConvNet and Voting-Swin-T both synthesize training data via WaveGAN but diverge in using CNNs versus transformer-based recognition, while VAE augmentation ConvNet leverages variational autoencoders for Melody-solos-DB latent space interpolation. Advanced hybrid systems include the staged-trained ConvNet’s cross-dataset curriculum learning and AEDCN’s adversarial ACEVAE architecture embedded in fully connected layers for joint data–label augmentation. These models all incorporate generative models, and their complexity is not only related to the discriminative model but also to the generative model, which typically uses multi-layer VAE and GAN structures. The discriminative models utilize VGG convolutional networks and attention-based neural networks. The input for these models is either spectrograms or enhanced spectrograms.

Notably, the proposed MMGAT achieves state-of-the-art performance, surpassing all neural and augmentation-enhanced baselines in Table 3 through multi-modal polyphonic interaction modeling. MMGAT directly utilizes the VGGish model to extract features from each audio data input. The process involves loading the pre-trained VGGish model, providing the local path of our audio data, and the pre-trained VGGish model directly outputs the features corresponding to the audio (for a 10 s audio clip, a 10*128 feature matrix is generated). Therefore, the input to the MMGAT model consists of the features extracted by VGGish. VGGish maps an audio clip of a length of k seconds to a k*128-dimensional matrix and then constructs the corresponding instance graph from k*128-dimensional vectors. CLIP is used to map labels to word vectors. In the subsequent recognition process, we have built an attention model with three layers and three scales, each with only 128-dimensional hidden features. Label semantic embeddings are achieved using an autoencoder constructed with a multi-layer neural network, where both the encoder and decoder have two layers each. As a result, the overall parameter count of the network is relatively small, enabling faster training.

As this dataset is imbalanced, next, we evaluate the MMGAT of each instrument class in Table 4.

The results listed in Table 4 show that MMGAT achieves improvements over convolution-based and attention-based instrument recognition methods across different instrument categories. This demonstrates that the constructed instance correlation graph and label correlation embeddings capture learnable correlation information relevant to the music data. In order to visually demonstrate the experimental results, we have provided the following heatmap (Figure 2):

4.3. Ablation Experiment

The MMGAT architecture integrates three core innovations: an instance correlation graph for modeling polyphonic interaction patterns, label semantic embeddings, and a multi-scale graph attention network to hierarchically aggregate spectral–temporal features. In this ablation experiment, we first discuss the role of each component, followed by a discussion on the similarity metrics used in the experiments, and finally, we conduct a visualization analysis of the features obtained by MMGAT. MMGAT defined on mel-spectrogram is denoted as MMGAT-mel-spectrogram (the same 158-dimensional features with the input of Reference [10]), and the MMGAT without multi-scale attention is named MMGAT-single. For comparison, we also use MMGAT on the constructed instance correlation graph without using auxiliary classification in the experiments; this model is denoted as MMGAT-principal. MMGAT without using the label semantic embeddings is denoted as MMGAT_graph. MMGAT without the center loss is denoted as MMGAT_without_center. The results are shown in Table 5.

As Table 6 shows, the constructed instance correlation graph, the introduced label semantic embeddings, and the proposed instance-based multi-scale graph attention neural network of MMGAT are effective. Furthermore, we verify that the constructed similarity measure in the instance correlation graph is effective for the instrument recognition task in Table 6.

As shown in Table 6, MMGAT-Euc is an MMGAT model that uses Euclidean distance for constructing the instance correlation graph, MMGAT-L1 is an MMGAT model that uses the L1 distance function for constructing the instance correlation graph, and MMGAT-EMD is an MMGAT model that uses the Earth Mover’s Distance function for constructing the instance correlation graph. As Table 6 shows, cosine similarity reaches higher performances than the other commonly used similarity measures.

To further understand the recognition results of MMGAT, we conduct a visual analysis based on t-distributed stochastic neighbor embedding in Figure 3.

This figure shows a visualization of the features obtained from layer L of MMGAT, prior to the classification layer, after dimensionality reduction. First, we extracted the features corresponding to a batch of samples from layer L. However, the feature dimensions are quite high, making it difficult to visualize directly, so we used the t-SNE method. We employ the t-distributed stochastic neighbor embedding (t-SNE) [29] algorithm, a popular technique for reducing the dimensionality of high-dimensional data, to visualize the extracted features. The t-SNE method reveals that the features can be effectively clustered into 20 distinct classes in a two-dimensional space. This outcome suggests that the proposed MMGAT model is capable of generating relatively clear classification boundaries. As illustrated in Figure 2, the clusters produced by MMGAT exhibit greater separation compared to the other models.

5. Conclusions

In this paper, we have addressed the challenge of instrument recognition in polyphonic music by proposing a novel model called the multi-instance multi-scale graph attention neural network (MMGAT) with label semantic embeddings. Traditional models struggle with issues such as the uneven distribution of instruments across tracks and signal overlap in polyphonic music, which reduces the distinguishability of features and impacts classification accuracy. MMGAT overcomes these challenges by utilizing instance correlation graphs to model the timbre similarities of instruments and by incorporating label semantic embeddings into the feature set. The experimental results demonstrate that MMGAT significantly outperforms existing instrument recognition models, offering a more robust and accurate solution for identifying instruments in complex music tracks.

Looking ahead, future work will focus on further enhancing MMGAT’s performance by integrating advanced feature extraction methods, such as deep audio embeddings, to capture even more subtle characteristics of instruments. Additionally, the model could be expanded to handle a broader range of musical compositions, including those with more simultaneous instruments and varying audio qualities. Another promising direction is to apply MMGAT to real-time instrument recognition in live music settings, potentially opening up applications in music production and performance analysis. Moreover, incorporating contextual information, such as musical scores, lyrics, or even the broader cultural context of the music, could further refine the model’s accuracy and applicability across diverse musical genres.

Author Contributions

Conceptualization, N.B., Z.W. and J.Z.; Methodology, N.B. and J.Z.; Software, N.B., Z.W. and J.Z.; Formal analysis, Z.W.; Data curation, N.B.; Writing—original draft, Z.W. and J.Z.; Writing—review & editing, J.Z.; Visualization, Z.W.; Supervision, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the General Project of Basic Science Research in Higher Education Institutions in Jiangsu Province (23KJB520008) and the National Natural Science Foundations of China (No.62206297).

Data Availability Statement

The original data presented in the study are openly available at https://zenodo.org/records/1432913 (accessed on 23 September 2018).

Conflicts of Interest

The authors declare no conflict of interest.

References

Barbedo, J.G.A.; Tzanetakis, G. Musical instrument classification using individual partials. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 111–122. [Google Scholar] [CrossRef]
Wang, D.L.; Chen, J. Supervised speech separation based on deep learning: An overview. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1702–1726. [Google Scholar] [CrossRef] [PubMed]
Kratimenos, A.; Avramidis, K.; Garoufis, C.; Zlatintsi, A.; Maragos, P. Augmentation methods on monophonic audio for instrument classification in polyphonic music. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 156–160. [Google Scholar]
Tiemeijer, P.; Shahsavari, M.; Fazlali, M. Towards Music Instrument Classification using Convolutional Neural Networks. In Proceedings of the IEEE International Conference on Omni-layer Intelligent Systems (COINS), London, UK, 29–31 July 2024; pp. 1–6. [Google Scholar]
Szeliga, D.; Tarasiuk, P.; Stasiak, B.; Szczepaniak, P.S. Musical Instrument Recognition with a Convolutional Neural Network and Staged Training. Procedia Comput. Sci. 2022, 207, 2493–2502. [Google Scholar] [CrossRef]
Zhang, J.; Wei, T.; Zhang, M.-L. Label-specific time-frequency energy-based neural network for instrument recognition. IEEE Trans. Cybern. 2024, 54, 7080–7093. [Google Scholar] [CrossRef] [PubMed]
Banerjee, S.; Strand, R. Lifelong learning with dynamic convolutions for glioma segmentation from multi-modal MRI. In Medical Imaging 2023: Image Processing; SPIE: Bellingham, WA, USA, 2023; Volume 12464, pp. 821–824. [Google Scholar]
Reghunath, L.C.; Rajan, R. Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music. EURASIP J. Audio Speech Music Process. 2022, 1, 11. [Google Scholar] [CrossRef]
Lekshmi, C.R.; Rajeev, R. Multiple Predominant Instruments Recognition in Polyphonic Music Using Spectro/Modgd-gram Fusion. Circuits Syst. Signal Process. 2023, 42, 3464–3484. [Google Scholar] [CrossRef]
Yu, D.; Duan, H.; Fang, J.; Zeng, B. Predominant Instrument Recognition Based on Deep Neural Network With Auxiliary Classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 852–861. [Google Scholar] [CrossRef]
Joder, C.; Essid, S.; Richard, G. Temporal Integration for Audio Classification With Application to Musical Instrument Classification. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 174–186. [Google Scholar] [CrossRef]
Duan, Z.; Pardo, B.; Daudet, L. A novel cepstral representation for timbre modeling of sound sources in polyphonic mixtures. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 4–9 May 2014; pp. 7495–7499. [Google Scholar]
Eggink, J.; Brown, G. Using instrument recognition for melody extraction from polyphonic audio. J. Acoust. Soc. Amer. 2005, 118, 2032. [Google Scholar] [CrossRef]
Gururani, S.; Summers, C.; Lerch, A. Instrument activity detection in polyphonic music using deep neural networks. In Proceedings of the International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018; pp. 577–584. [Google Scholar]
Han, Y.; Kim, J.; Lee, K. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 25, 208–221. [Google Scholar] [CrossRef]
Hung, Y.N.; Chen, Y.A.; Yang, Y.H. Multitask learning for frame-level instrument recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 381–385. [Google Scholar]
Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound event detection: A tutorial. IEEE Signal Process. Mag. 2021, 38, 67–83. [Google Scholar] [CrossRef]
Pavl, S.; Craus, M. Reaction-diffusion model applied to enhancing U-Net accuracy for semantic image segmentation. Discret. Contin. Dyn. Syst.-S 2023, 16, 54–74. [Google Scholar] [CrossRef]
Yang, J.; Zhou, K.; Li, Y.; Liu, Z. Generalized out-of-distribution detection: A survey. Int. J. Comput. Vis. 2024, 132, 5635–5662. [Google Scholar] [CrossRef]
Girin, L.; Leglaive, S.; Bie, X.; Diard, J.; Hueber, T.; Alameda-Pineda, X. Dynamical variational autoencoders: A comprehensive review. arXiv 2020, arXiv:2008.12595. [Google Scholar]
Kim, J.; Kong, J.; Son, J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 18–24 July 2021; pp. 5530–5540. [Google Scholar]
Grosche, P.; Mu, M.; Kurth, F. Cyclic tempogram—A mid-level tempo representation for music signals. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010. [Google Scholar]
Nam, J.; Herrera, J.; Slaney, M.; Smith, J.O., III. Learning sparse feature representations for music annotation and retrieval. In Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal, 8–12 October 2012; pp. 565–570. [Google Scholar]
Humphrey, E.; Durand, S.; McFee, B. OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition. In Proceedings of the ISMIR 2018, Paris, France, 23–27 September 2018; pp. 438–444. [Google Scholar]
Essid, S.; Richard, G.; David, B. Musical instrument recognition by pairwise classification strategies. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1401–1412. [Google Scholar] [CrossRef]
Bosch, J.J.; Janer, J.; Fuhrmann, F.; Herrera, P. A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal, 8–12 October 2012; pp. 559–564. [Google Scholar]
Plchot, O.; Burget, L.; Aronowitz, H.; Matëjka, P. Audio enhancing with DNN autoencoder for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 5090–5094. [Google Scholar]
Zhang, J.; Bai, N. Augmentation Embedded Deep Convolutional Neural Network for Predominant Instrument Recognition. Appl. Sci. 2023, 13, 10189. [Google Scholar] [CrossRef]
Van der, L.; Hinton, G. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. The MMGAT structure.

Figure 2. A heatmap for the OpenMIC-IRMAS dataset.

Figure 3. A mini-batch feature visual analysis of generated clusters from MMGAT and AEDCN.

Table 1. Instruments in OpenMIC-IRMAS.

Instruments	Abbreviations	Training
accordion	acc	374
banjo	ban	592
bass	bas	415
cello	cel	598
clarinet	cla	396
cymbals	cym	814
drums	dru	828
flute	flu	472
guitar	gur	1066
mallet_percussion	mal	522
mandolin	mad	652
organ	org	482
piano	pia	885
saxophone	sax	830
synthesizer	syn	823
trombone	tro	635
trumpet	tru	828
ukulele	ukulele	1127
violin	vio	779
voice	voi	764

Table 2. The relation of the auxiliary classes and principal classes in OpenMIC-IRMAS.

Instruments	Abbreviations	Auxiliary Classes
accordion	acc	Soft onset
banjo	ban	Hard onset
bass	bas	Soft onset
cello	cel	Soft onset
clarinet	cla	Soft onset
cymbals	cym	Hard onset
drums	dru	Hard onset
flute	flu	Soft onset
guitar	gur	Hard onset
mallet_percussion	mal	Hard onset
mandolin	mad	Hard onset
organ	org	Soft onset
piano	pia	Hard onset
saxophone	sax	Soft onset
synthesizer	syn	Hard onset
trombone	tro	Hard onset
trumpet	tru	Hard onset
ukulele	ukulele	Hard onset
violin	vio	Soft onset
voice	voi	Other

Table 3. Performance of MMGAT and baselines.

Model	F1 Micro	F1 Macro
SVM [25]	0.33	0.24
Bosch et al. [26]	0.46	0.40
MTF-DNN (2018) [14]	0.30	0.26
Audio DNN [27]	0.52	0.50
ConvNet (2017) [15]	0.59	0.51
Muti-task ConvNet (2020) [10]	0.61	0.56
Kratimenos et al. (2022) [8]	0.60	0.53
WaveGAN ConvNet (2021)	0.60	0.60
Voting-Swin-T (2022)	0.60	0.60
Staged trained ConvNet	0.60	0.60
VAE augmentation ConvNet	0.61	0.60
AEDCN (2023) [28]	0.61	0.61
MMGAT	0.63	0.62

Table 4. Performance of MMGAT on each instrument class.

Model	acc	ban	bas	cel	cla	cym	dru	flu	gur	mal	mad	org	pia	sax	syn	tru	tro	uku	vio	voi	F1
SVM	0.19	0.21	0.13	0.17	0.27	0.46	0.41	0.33	0.31	0.25	0.29	0.16	0.32	0.14	0.17	0.18	0.21	0.14	0.21	0.29	0.24
MTF-DNN	0.17	0.23	0.22	0.19	0.29	0.43	0.45	0.36	0.28	0.32	0.21	0.26	0.33	0.18	0.19	0.21	0.20	0.17	0.23	0.32	0.26
ConvNet	0.37	0.43	0.33	0.51	0.41	0.64	0.61	0.45	0.43	0.50	0.62	0.53	0.66	0.49	0.55	0.55	0.56	0.42	0.54	0.62	0.51
Muti-task ConvNet	0.41	0.44	0.41	0.54	0.51	0.69	0.65	0.57	0.47	0.55	0.64	0.52	0.72	0.56	0.66	0.54	0.51	0.40	0.66	0.68	0.56
WaveGAN ConvNet	0.57	0.52	0.51	0.55	0.52	0.71	0.73	0.61	0.47	0.59	0.67	0.53	0.67	0.62	0.69	0.54	0.55	0.43	0.71	0.73	0.60
Voting-Swin-T	0.55	0.51	0.45	0.62	0.49	0.75	0.75	0.63	0.49	0.60	0.62	0.46	0.78	0.61	0.71	0.54	0.57	0.47	0.74	0.73	0.60
AEDCN	0.57	0.53	0.51	0.57	0.54	0.72	0.74	0.66	0.50	0.61	0.66	0.56	0.71	0.64	0.70	0.57	0.56	0.43	0.71	0.71	0.61
MMGAT	0.56	0.54	0.49	0.64	0.52	0.76	0.76	0.64	0.55	0.64	0.64	0.43	0.79	0.68	0.72	0.55	0.57	0.52	0.73	0.74	0.62

Table 5. The ablation experiment of MMGAT on different models.

Model	F1 Micro	F1 Macro
MMGAT-mel-spectrogram	0.60	0.60
MMGAT-single.	0.62	0.60
MMGAT-principal	0.62	0.60
MMGAT_without_center	0.62	0.61
MMGAT_graph	0.62	0.62
MMGAT	0.63	0.62

Table 6. The ablation experiment different similarity measures in MMGAT.

Model	F1 Micro	F1 Macro
MMGAT-Euc	0.60	0.60
MMGAT-L1	0. 61	0.59
MMGAT-EMD	0.62	0.61
MMGAT	0.63	0.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bai, N.; Wu, Z.; Zhang, J. Multi-Instance Multi-Scale Graph Attention Neural Net with Label Semantic Embeddings for Instrument Recognition. Signals 2025, 6, 30. https://doi.org/10.3390/signals6030030

AMA Style

Bai N, Wu Z, Zhang J. Multi-Instance Multi-Scale Graph Attention Neural Net with Label Semantic Embeddings for Instrument Recognition. Signals. 2025; 6(3):30. https://doi.org/10.3390/signals6030030

Chicago/Turabian Style

Bai, Na, Zhaoli Wu, and Jian Zhang. 2025. "Multi-Instance Multi-Scale Graph Attention Neural Net with Label Semantic Embeddings for Instrument Recognition" Signals 6, no. 3: 30. https://doi.org/10.3390/signals6030030

APA Style

Bai, N., Wu, Z., & Zhang, J. (2025). Multi-Instance Multi-Scale Graph Attention Neural Net with Label Semantic Embeddings for Instrument Recognition. Signals, 6(3), 30. https://doi.org/10.3390/signals6030030

Article Menu

Multi-Instance Multi-Scale Graph Attention Neural Net with Label Semantic Embeddings for Instrument Recognition

Abstract

1. Introduction

2. Related Works

3. The Proposed MMGAT Methods

3.1. Building Instance Correlation Graph for MMGAT

3.2. The Structure of the Proposed MMGAT

3.2.1. The Designed Label Correlation Embeddings

3.2.2. The Introduction of the Auxiliary Classifier

3.3. MMGAT Loss Function

4. Experiments

4.1. Dataset

4.2. Experiment Analysis

4.3. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI