Multi-Instance Multi-Scale Graph Attention Neural Net with Label Semantic Embeddings for Instrument Recognition
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsI would like to thank the authors for this quality of work. The paper reads well and brings something new to the field of audio signal processing. The paper proposes a new model called MMGAT for instrument recognition in music. However, i have some comments and questions that need to be considered in order to improve the quality of your draft:
-
The integration of graph structures and multi-scale attention effectively addresses the challenge of uneven instrument distribution in audio.
-
The ablation studies and comparisons with diverse baselines (e.g., SVMs, CNNs, GAN/VAE-augmented models) provide robust empirical validation of MMGAT’s superiority.
- Why did you choose STFCF? Might CWT improve classification results because it provides better resolution than STFCF?
Please plot the spectograms! - Onset types (soft/hard) are used for auxiliary labels. How were these labels assigned?
- Why was VGGish chosen over more recent models (e.g., Passt, BEATs)?
- How does MMGAT manage graphs with varying node counts during training?
- or the t-SNE plots (Figure 2), were hyperparameters (perplexity, learning rate) kept consistent between MMGAT and AEDCN? How sensitive are the visual results to these choices?
Thnk you,
Author Response
Dear reviewer,
Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.
We are uploading (a) our point-by-point response to the comments (below) (response to the reviewer), (b) a clean updated manuscript without highlights (PDF main document).
Best regards,
< Jian Zhang > et al.
Comment 1: Why did you choose STFCF? Might CWT improve classification results because it provides better resolution than STFCF? Please plot the spectograms!
Author response and actions:
Thank you very much for your valuable suggestion. After carefully reviewing all of your feedback, we realized that the ambiguity and inaccuracy in the expression of our paper were caused by the inappropriate organization and description in Section 3.1. Please allow us to provide an explanation and make the necessary revisions.
In this paper, when processing the audio data, we actually directly utilized the pre-trained VGGish model to extract features from each audio clip. The process involves loading the pre-trained VGGish model, providing the local path to our audio data, and the VGGish model directly outputs the features corresponding to the audio (for example, a 10-second audio clip results in a 10*128 feature matrix). We did not directly perform the process of converting audio data into a spectrogram. The intention behind listing this part was to explain how VGGish works, but it inadvertently led to some confusion in the content of the paper. Therefore, we did not directly use the spectrogram, and we have removed this section in the revised version of the manuscript.
We greatly appreciate your suggestion. The method for constructing time-frequency features has been discussed in our previous work, and we kindly refer you to reference [6] for more details.
Comment 2: Onset types (soft/hard) are used for auxiliary labels. How were these labels assigned?
Author response and actions:
Thank you very much for your valuable suggestion. The introduction of auxiliary classification can be seen as an enhancement of the supervision information in the original classification model. Originally, the supervision information only included the specific categories of each instrument. After introducing auxiliary classification, additional category information was incorporated to assist the classification process. This idea is inspired by the work in reference [10]. The labels for auxiliary classification can be organized in various forms. For example, one approach is to use instrument families (such as percussion, strings, etc.) to assist in classification and improve classification performance. Another approach, which is adopted in this paper, uses the onset type as additional classification information. Reference [10] has validated the effectiveness of the latter approach, particularly in relation to family information. The utilization of auxiliary classification information is implemented within a multi-task framework.
Based on the classification method outlined in the literature, we divided the instrument onset types into hard onset and soft onset, as shown in the table below:
Instruments |
Abbreviations |
auxiliary classes |
accordion |
acc |
Soft onset |
banjo |
ban |
Hard onset |
bass |
bas |
Soft onset |
cello |
cel |
Soft onset |
clarinet |
cla |
Soft onset |
cymbals |
cym |
Hard onset |
drums |
dru |
Hard onset |
flute |
flu |
Soft onset |
guitar |
gur |
Hard onset |
mallet_percussion |
mal |
Hard onset |
mandolin |
mad |
Hard onset |
organ |
org |
Soft onset |
piano |
pia |
Hard onset |
saxophone |
sax |
Soft onset |
synthesizer |
syn |
Hard onset |
trombone |
tro |
Hard onset |
trumpet |
tru |
Hard onset |
ukulele |
ukulele |
Hard onset |
violin |
vio |
Soft onset |
voice |
voi |
Other |
Comment 3: Why was VGGish chosen over more recent models (e.g., Passt, BEATs)?
Author response and actions:
Thank you very much for your valuable suggestion. There are currently several pre-trained models available for use. The main reason we chose to use VGGish is for the sake of comparison, as many of the algorithms involved in our comparison are based on the VGG model. If we were to use a more effective pre-trained model, it would be difficult to determine whether the observed performance improvement is due to the pre-trained model itself or the algorithms we designed. We greatly appreciate your insightful suggestion.
Comment 4: How does MMGAT manage graphs with varying node counts during training?
Author response and actions:
Thank you very much for your valuable suggestion. In the revised paper, MMGAT directly calls the VGGish model to extract features from each audio data. The process involves loading the pre-trained VGGish model, providing the local path to our audio data, and the pre-trained VGGish model directly outputs the corresponding features (for instance, a 10-second audio clip results in a 10*128 feature matrix). Therefore, the input to the MMGAT model is the feature extracted by VGGish.
We directly use VGGish to map an audio clip of k seconds into a k*128-dimensional matrix, and then construct the corresponding instance graph using the k 128-dimensional vectors. The labels are mapped into word vectors using CLIP. In the subsequent recognition process, our attention model consists of 3 layers with 3 scales, where each layer has only 128-dimensional hidden features. Label correlation encoding is performed using an autoencoder constructed with a multi-layer neural network, with both the encoder and decoder consisting of two layers.
For the instance-related graph, we use an attention mechanism for computation. Each time, we calculate the features of the nodes, and the feature update for each node depends only on its neighboring set. The information aggregation weights are automatically adjusted through attention coefficients. As a result, this approach is not restricted by the structure of the entire graph, which is a key advantage brought by the attention mechanism.
Comment 5: The t-SNE plots (Figure 2), were hyperparameters (perplexity, learning rate) kept consistent between MMGAT and AEDCN? How sensitive are the visual results to these choices?
Author response and actions:
Thank you very much for your valuable suggestion. The comparison in Figure 2 was conducted primarily because both the MMGAT and AEGCN models utilize center loss, which aims to bring data points within the same class closer together while pushing apart data points from different classes. However, the effectiveness of this loss function is closely related to the separability of the features extracted by the model.
To make this comparison, we used the same batch of data and passed it through the trained MMGAT and AEGCN models. We then extracted the features used for classification from both models, applied the t-SNE method for dimensionality reduction, and visualized the results, which led to the generation of Figure 2. Throughout this process, we made every effort to ensure the consistency of the comparison.
Finally, thank you very much for your valuable suggestion.
Author Response File: Author Response.doc
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper proposes a multi-instance multi-scale graph attention neural network for instrument recognition
and music information retrieval. The NN designs an instance correlation graph to model the
existence and qualitative timbre similarity of instruments in different positions from the
viewpoint of multi-instance learning. According to the instance correlation graphs the NN regognizes
the instruments and features. The effectiveness is verified compared with another common-used
instrument recognition models. Results show that MMGAT outperforms existing approaches
The paper needs substantial improvements for its publication:
1) Data preprocessing stage described in section 3.1 might be enhanced with a block diagram.
The visual representation of these procedures will clarify the description paragraphs.
2) It is not clearly defined what is an instance correlation graph. Perhaps section 3.2 can
present some application cases to show some outputs of this graphs from different instruments
in order to know the patterns to be found by the NN
3) The features used as input for the VGGish model have not been defined. This model is a
well-known structure for using transfer-learning in audio classification frameworks. Perhaps
a couple of references might be enough instead of table 1
4) The auxiliary classification stage of Figure 1 has not been clearly defined. Please indicate
what is the main motivation for using this stage
5) Perhaps a heatmap could enhance table 5, in order to check the highest scores rather than
the boldface types
6) The introduction of the ablation experiment must be enhanced to understand what is this
stage doing or what is different from the previous classification task
7) Figure 2 presents really poor quality and the legends are not present to indicate the
labels that are represented by the colors. It must requite an enhancement
8) The t-sne algorithm must be explained. How dimensionality was reduced. Perhaps a diagram
or illustration from some input-output could be enough
9) The discussion section about the algorithms compared in table 4 must be extended in terms
of the features, level of complexity, number of hyperparameters (optional) and input features,
in order to support the obtained results
10) Perhaps joining similar instrument classes, such as percussions, strings, etc can reduce
the extension of table 5 to show higher classification scores
Author Response
Dear reviewer,
Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.
We are uploading (a) our point-by-point response to the comments (below) (response to the reviewer), (b) a clean updated manuscript without highlights (PDF main document).
Best regards,
< Jian Zhang > et al.
Comment 1: Data preprocessing stage described in section 3.1 might be enhanced with a block diagram. The visual representation of these procedures will clarify the description paragraphs.
Author response and actions:
Thank you very much for your valuable suggestions. After carefully reviewing all of your feedback, we realized that due to inappropriate organization and description in Section 3.1, the paper contained ambiguities and inaccuracies. Please allow us to provide an explanation and make the necessary revisions.
In this paper, when processing the audio data, we directly used the VGGish model to extract the features for each audio sample. Specifically, we loaded the pre-trained VGGish model, provided the local paths of our audio data, and the pre-trained VGGish model directly output the features corresponding to the audio (for example, a 10-second audio clip generates a 10x128 feature matrix). We did not directly go through the process of converting audio data into spectrograms. The intention behind mentioning this was to explain the workings of VGGish, but this led to ambiguity in the content of the paper. Therefore, we have removed this section in the revised version. The method for constructing time-frequency features is not the focus of this paper, so we have simplified this part of the content. We appreciate your suggestion, and we have discussed the method for constructing time-frequency features in our previous work. Please refer to reference [6] for further details.
Comment 2: It is not clearly defined what is an instance correlation graph. Perhaps section 3.2 can present some application cases to show some outputs of this graphs from different instruments in order to know the patterns to be found by the NN
Author response and actions:
Thank you very much for your valuable suggestions. As mentioned earlier, during the instrument recognition process, we directly used the pre-trained VGGish model to extract features from the audio data. A 10-second audio clip, after feature extraction with VGGish, produces a 10x128 feature matrix, where 10 represents the 10 seconds, with each second being treated as an audio segment. This segment is mapped by VGGish into a 128-dimensional feature. These 10 128-dimensional feature vectors correspond to the instrument label information, but we do not know which specific feature among these 10 corresponds to the instrument that needs to be recognized. From this perspective, we can treat these 10 features as 10 instances in a multi-instance, multi-label learning framework. These instances have temporal dependencies, as the delay characteristics of the same instrument can vary, which may be reflected in the correlations between instances. Therefore, we treat each 128-dimensional audio feature as an instance and construct an instance correlation graph.
Comment 3: The features used as input for the VGGish model have not been defined. This model is a well-known structure for using transfer-learning in audio classification frameworks. Perhaps a couple of references might be enough instead of table 1.
Author response and actions:
Thank you very much for your valuable suggestions. As mentioned earlier, when processing the audio data, we directly used the VGGish model to extract features for each audio sample. In this process, we loaded the pre-trained VGGish model, with the input being each segment of audio data and the output being the corresponding feature of the audio. In the revised version, we have removed Table 1.
Comment 4: The auxiliary classification stage of Figure 1 has not been clearly defined. Please indicate what is the main motivation for using this stage.
Author response and actions:
Thank you very much for your valuable suggestions. Essentially, the introduction of auxiliary classification can be seen as an enhancement of the original classification model by incorporating additional supervisory information. Initially, the supervision only contained the specific categories of each instrument, but with the introduction of auxiliary classification, extra category information is added to assist with the classification process. This idea is derived from the work in reference [10]. The labels for auxiliary classification can be organized in various ways. For example, one approach is to use the instrument family, where family information (such as percussion instruments, orchestral instruments, etc.) is introduced to aid classification and improve performance. Another approach, which we adopted in this paper, is to use the excitation method as the auxiliary classification information. Reference [10] validated the effectiveness of the latter approach using family information. The use of auxiliary classification information is implemented within a multi-task learning framework.
Comment 5: Perhaps a heatmap could enhance table 5, in order to check the highest scores rather than the boldface types
Author response and actions:
Thank you very much for your valuable suggestion. We have provided the corresponding heatmap after Table 5 to more intuitively display the classification results.
Comment 6: The introduction of the ablation experiment must be enhanced to understand what is this
stage doing or what is different from the previous classification task
Author response and actions:
Thank you very much for your valuable suggestion. We apologize for the lack of clarity in our previous explanation. In this revised version, we have rewritten the introduction to the ablation experiment as follows:
The MMGAT architecture integrates three core innovations: an instance correlation graph for modeling polyphonic interaction patterns, a label correlation encoding, and a multi-scale graph attention network for hierarchically aggregating spectral-temporal features. In this ablation experiment, we first discuss the role of each component, followed by a discussion of the similarity metrics used in the experiments. Finally, we conduct a visualization analysis of the features obtained by MMGAT. The MMGAT defined on mel-spectrogram is referred to as MMGAT-mel-spectrogram (which uses the same 158-dimensional features as the input in Reference [10]), and the MMGAT without multi-scale attention is referred to as MMGAT-single. For comparison, we also use MMGAT on the constructed instance correlation graph without the auxiliary classification in the experiments. This model is denoted as MMGAT-principal. The MMGAT that does not use label correlation embeddings is denoted as MMGAT_graph.
Comment 7: Figure 2 presents really poor quality and the legends are not present to indicate the labels that are represented by the colors. It must requite an enhancement
Author response and actions:
Thank you very much for your valuable suggestion. We apologize for not clearly explaining the role and origin of Figure 2 in our paper. This figure shows the visualization of the features obtained from layer L of MMGAT, prior to the classification layer, after dimensionality reduction. First, we extracted the features corresponding to a batch of samples from layer L. However, the feature dimensions are quite high, making it difficult to visualize directly, so we used the t-SNE method. t-SNE works by preserving the relationships between similar data points in high-dimensional space and mapping them into a lower-dimensional space as effectively as possible. The reason we introduced center loss is to encourage the class clusters to be as far apart as possible during classification, which helps to make the classification process easier. In this paper, we introduced two center losses, and compared to the AEDCN model, which also uses two center losses, our proposed MMGAT shows more significant results.
Comment 8: The t-sne algorithm must be explained. How dimensionality was reduced. Perhaps a diagram
or illustration from some input-output could be enough
Author response and actions:
Thank you very much for your valuable suggestion. We have added the following explanation:
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique suitable for visualizing high-dimensional data. Its core idea is to map high-dimensional data points to a lower-dimensional space while preserving the relationships between similar data points as much as possible. In t-SNE, the similarity between data points in high-dimensional space is first calculated, and then a distribution is sought in the lower-dimensional space that matches the similarity between data points as closely as possible.
The specific process of t-SNE involves two main steps:
- Calculating similarity in high-dimensional space: The similarity between each pair of data points is measured by calculating the Euclidean distance, with a Gaussian distribution used to represent the similarity. Similar data points will have smaller distances.
- Mapping to lower-dimensional space: t-SNE minimizes the distances between points in the lower-dimensional space so that similar points remain close together, while points that are far apart in high-dimensional space are kept at a certain distance. This process is optimized using the "KL divergence" (Kullback-Leibler Divergence).
Comment 9: The discussion section about the algorithms compared in table 4 must be extended in terms
of the features, level of complexity, number of hyperparameters (optional) and input features,
in order to support the obtained results
Author response and actions:
Thank you very much for your valuable suggestions. Table 3 systematically benchmarks instrument recognition methodologies across three paradigms: feature-engineered systems, spectrogram-driven neural architectures, and synthetic data-augmented frameworks. Traditional approaches include Bosch et al.'s SVM classifier with handcrafted features and MTF-DNN's shallow neural network applied to engineered features, while modern architectures such as Audio DNN demonstrate the superiority of raw audio CNNs over conventional methods. These instrument recognition models, based on the aforementioned techniques, utilize spectrograms as inputs for deep neural networks. These models are relatively simple, but their performance in instrument recognition is limited.
The mel-spectrogram-optimized ConvNet introduces temporal max-pooling, later extended by Multi-task ConvNet with auxiliary classification to refine labels. These two models use spectrograms and enhanced spectrograms as inputs and train the model using VGG neural networks. The complexity of these models is related to the structure of the neural networks, specifically the complexity of the VGG network. Data augmentation strategies diverge into generative and latent space approaches: WaveGAN ConvNet and Voting-Swin-T both synthesize training data via WaveGAN but diverge in using CNNs versus transformer-based recognition, while VAE-Augmentation ConvNet leverages variational autoencoders for Melody-solos-DB latent space interpolation. Advanced hybrid systems include the Staged-trained ConvNet’s cross-dataset curriculum learning and AEDCN’s adversarial ACEVAE architecture embedded in fully connected layers for joint data-label augmentation. These models all incorporate generative models, and their complexity is not only related to the discriminative model but also to the generative model, which typically uses multi-layer VAE and GAN structures. The discriminative models utilize VGG convolutional networks and attention-based neural networks. The input for these models is either spectrograms or enhanced spectrograms.
Notably, the proposed MMGAT achieves state-of-the-art performance, surpassing all neural and augmentation-enhanced baselines in Table 3 through multi-modal polyphonic interaction modeling. MMGAT directly utilizes the VGGish model to extract features from each audio data input. The process involves loading the pre-trained VGGish model, providing the local path of our audio data, and the pre-trained VGGish model directly outputs the features corresponding to the audio (for a 10-second audio clip, a 10*128 feature matrix is generated). Therefore, the input to the MMGAT model consists of the features extracted by VGGish. VGGish maps an audio clip of length k seconds to a k*128-dimensional matrix, and then constructs the corresponding instance graph from k*128-dimensional vectors. CLIP is used to map labels to word vectors. In the subsequent recognition process, we have built an attention model with 3 layers and 3 scales, each with only 128-dimensional hidden features. Label correlation encoding is achieved using an autoencoder constructed with a multi-layer neural network, where both the encoder and decoder have two layers each. As a result, the overall parameter count of the network is relatively small, enabling faster training.
Comment 10: Perhaps joining similar instrument classes, such as percussions, strings, etc can reduce the extension of table 5 to show higher classification scores
Author response and actions:
Thank you very much for your valuable suggestion. In our view, the task of instrument recognition aims to identify the specific instrument categories present in an audio clip, so we tested the model's performance in recognizing each category. However, in polyphonic music, the simultaneous playing of different instruments causes their signals to overlap, making it challenging to distinguish which instruments are playing, whether in the time domain or frequency domain. As a result, the recognition performance is not ideal. We attempted to incorporate instrument family information to improve the recognition performance, but our experiments showed that the performance gain from family information was smaller than the improvement achieved by introducing vibrational information. Therefore, we introduced the auxiliary and principal classes, as shown in Table 2, in an attempt to enhance the instrument recognition performance. Thank you again for your suggestion.
Author Response File: Author Response.doc
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper introduces an approach to musical instrument recognition using a Multi-Instance Multi-scale Graph Attention Neural Network (MMGAT) . The model is designed to address the challenges posed by polyphonic music, where multiple instruments may overlap in time and exhibit similar timbral characteristics. The authors propose a graph-based representation of temporal audio segments combined with multi-scale attention mechanisms to better capture relationships among instruments.
The methodology is innovative in its use of instance correlation graphs and hybrid similarity metrics , such as cosine similarity, to model complex interactions between instruments. The empirical results show consistent improvements over established baselines including WaveGAN-ConvNet, VAE-based CNNs, and AEDCN. The authors’ attempts to handle variable-length audio inputs, overlapping instruments, and subtle timbral differences are particularly noteworthy.
Despite these strengths, several limitations need to be addressed. The related work section , although comprehensive, lacks critical synthesis—it reads more like a catalog of existing methods than an analytical discussion. Additionally, the writing occasionally suffers from redundancy , and while the ablation studies are informative, the overall novelty of the contribution appears somewhat incremental, as the proposed architecture builds upon well-established models such as GAT and VGGish.
Further areas for improvement include:
- A lack of discussion on computational efficiency and model complexity .
-Limited exploration of model robustness across diverse musical genres and recording conditions.
-Insufficient analysis of attention mechanism interpretability .
-Opaque description of dataset construction—particularly the integration of OpenMIC and IRMAS datasets —which would benefit from clearer justification and methodological transparency.
In summary, this is a promising submission with solid experimental validation and relevance to the field of music information retrieval (MIR) . However, it requires major revisions , including a clearer articulation of contributions, stronger justification of novelty, and tighter organization before it can be considered for acceptance.
Questions for the authors
1. How does your model mitigate the risk of overfitting introduced by synthetic combinations in the OpenMIC-IRMAS dataset? A more detailed explanation of how the training strategy ensures generalization—especially when dealing with artificially mixed data—is needed.
2. Can you provide a more explicit justification of the novelty of MMGAT compared to existing multi-instance attention and graph neural network approaches? Particularly in relation to transformer-based architectures, what specific design choices make MMGAT distinct or superior?
3. What is the computational cost of MMGAT relative to other models in terms of training time, inference speed, or number of parameters? Including such comparisons would enhance the practical value of the proposed method.
4. How does the model perform across different music genres or under varying levels of noise? Assessing performance on diverse or degraded audio could help establish its real-world applicability and robustness.
5. To what extent are the learned attention weights interpretable in identifying key instrumental cues? Can the model highlight which time frames or frequency bands were most influential in recognizing specific instruments?
6. What motivated the choice of specific similarity metrics (e.g., cosine similarity, EMD) in the graph construction phase? What trade-offs were observed across different metrics, and were alternatives explored and discarded?
Author Response
Dear reviewer,
Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.
We are uploading (a) our point-by-point response to the comments (below) (response to the reviewer), (b) a clean updated manuscript without highlights (PDF main document).
Best regards,
< Jian Zhang > et al.
Comment 1: How does your model mitigate the risk of overfitting introduced by synthetic combinations in the OpenMIC-IRMAS dataset? A more detailed explanation of how the training strategy ensures generalization—especially when dealing with artificially mixed data—is needed.
Author response and actions:
Thank you very much for your valuable suggestions. In constructing this dataset, we adopted a strategy of separately combining the training and testing datasets. The OpenMIC dataset itself consists of 10-second music clips, with both its training and testing datasets being multi-label. However, since all audio clips in the OpenMIC dataset have a fixed length, it is difficult to directly assess the algorithm's adaptability to music data of varying lengths. Therefore, we aimed to create a more comprehensive dataset where each data sample has a variable length. The IRMAS dataset is a commonly used instrument recognition dataset, with its training data being single-label and its testing data being multi-label. When combining the training datasets, to avoid overfitting, we only combined the IRMAS training samples by category. For instance, training samples with instrument label A from IRMAS were only combined with samples of the same instrument label A from the OpenMIC training set. Each combination was done by appending to the original data, which, from a musical perspective, is like adding a new segment of performance from the same instrument to an existing audio clip. This operation did not introduce noise into the training dataset. Our testing dataset is a combination of the OpenMIC and IRMAS datasets. For the testing dataset, we did not perform data concatenation, but rather directly combined the test samples from both datasets. As a result, training this dataset effectively tests the model’s generalization ability across both the OpenMIC and IRMAS test datasets.
Comment 2: Can you provide a more explicit justification of the novelty of MMGAT compared to existing multi-instance attention and graph neural network approaches? Particularly in relation to transformer-based architectures, what specific design choices make MMGAT distinct or superior?
Author response and actions:
Thank you very much for your valuable suggestions. In the revised manuscript, we focus on the instrument recognition task for polyphonic music, where the music is multi-label, and the labels exhibit certain correlations. This is because musical performances exhibit relatively fixed instrument collaboration patterns under different styles and themes, which manifest as correlations between instrument labels. Therefore, this paper proposes a label correlation-based multi-instance multi-scale graph attention neural network (MMGAT) for instrument recognition. MMGAT designs an instance correlation graph to model the existence and quantitative timbre similarity of instruments in different positions from the perspective of multi-instance learning. To capture the collaborative patterns of instrument performance, MMGAT constructs a correlation embedding of instrument labels and designs an instance-based multi-instance multi-scale graph attention neural network to recognize different instruments based on the instance correlation graphs and the features of label correlations. The MMGAT proposed in this paper differs from transformer-based architectures. MMGAT uses a pre-trained VGGish model to extract video features and CLIP to extract semantic features of the labels. As a result, thanks to the pre-trained models, the number of multi-scale attention layers used in this paper is much lower than that in transformer-based architectures. This leads to a lower computational complexity in our approach. The key to improving MMGAT's performance in instrument recognition tasks lies in the introduction of the instance graph, label correlation, and multi-scale attention.
Comment 3: What is the computational cost of MMGAT relative to other models in terms of training time, inference speed, or number of parameters? Including such comparisons would enhance the practical value of the proposed method.
Author response and actions:
Thank you very much for your valuable suggestions. In fact, if the training process for both audio data to audio feature extraction and the embedding of label word vectors were included, the model's complexity would be extremely high. However, in this paper, we have used pre-trained VGGish and CLIP models for these two computationally intensive parts, eliminating the need for training. We directly use VGGish to map audio with a duration of k seconds into a k×128-dimensional matrix and then construct the corresponding instance graph using the k 128-dimensional vectors. CLIP is used to map labels to word vectors. In the recognition process that follows, our attention model consists of three layers with three scales, and each layer has only 128-dimensional hidden features. The label correlation encoding is performed using an autoencoder built with a multi-layer neural network, where both the encoder and decoder consist of two layers. As a result, the total number of parameters in the network is not large, leading to relatively fast training. However, the inference process requires calling pre-trained models for feature mapping and constructing the graph structure, which means that the inference speed does not have an advantage compared to other algorithms.
Comment 4: How does the model perform across different music genres or under varying levels of noise? Assessing performance on diverse or degraded audio could help establish its real-world applicability and robustness.
Author response and actions:
Thank you very much for your valuable suggestions. The dataset we used contains music from different styles, spanning a considerable period of time, which results in significant variations in recording quality. However, we did not initially consider the model's tolerance to noise. To address this, we decided to introduce noise and test its impact on model performance. We downloaded a segment of Gaussian white noise audio typically used for "HIFI burn-in" purposes and input this audio into VGGish to convert it into a 128-dimensional feature vector. After normalizing the feature vector, we added it to the dataset with probabilities of 10% and 20%, respectively, and tested the model's performance. For a fair comparison, we applied the same approach to introduce noise into the Voting-Swin-T and AEDCN models. For the convolution-based model Multi-task ConvNet, we mapped the Gaussian noise to the 158-dimensional time-frequency features mentioned in the paper, then normalized and introduced it into the time-frequency features of normal music data with probabilities of 10% and 20%. The model performance is as follows:
F1 score |
10% |
20% |
None |
Voting-Swin-T |
0.58 |
0.56 |
0.60 |
Muti-task ConvNet |
0.58 |
0.54 |
0.61 |
AEDCN |
0.60 |
0.56 |
0.61 |
MMGAT |
0.61 |
0.58 |
0.63 |
From the results in the table, we observe that although MMGAT outperforms the other comparison models under different noise conditions, it does not demonstrate a stronger resistance to noise when considering the decline in F1 score affected by noise.
Comment 5: To what extent are the learned attention weights interpretable in identifying key instrumental cues? Can the model highlight which time frames or frequency bands were most influential in recognizing specific instruments?
Author response and actions:
Thank you very much for your valuable suggestions. In fact, the frequency-domain features are primarily reflected in the 128-dimensional vectors mapped from VGGish. Therefore, the method proposed in this paper does not effectively highlight which frequency bands are critical for instrument recognition. Each segment of audio is divided in chronological order, with a k-second audio segment being mapped by VGGish into k 128-dimensional features. If we introduce a mask matrix into the feature extraction process of these k 128-dimensional features and only retain the top-k most activated vectors during computation, we could identify which time slices are key instances for instrument recognition. However, this approach would reduce the overall performance of the instrument recognition task, which is why we did not report this result in the revised version.
Comment 6: What motivated the choice of specific similarity metrics (e.g., cosine similarity, EMD) in the graph construction phase? What trade-offs were observed across different metrics, and were alternatives explored and discarded?
Author response and actions:
Thank you very much for your valuable suggestions. To be frank, we determined which similarity metric works better through experience and experimentation. Intuitively, our approach to processing audio data involves transforming the audio into 128-dimensional vectors using VGGish. Since the cosine similarity function is the most commonly used method to measure the similarity between vectors, we initially chose cosine similarity. However, when using the feature vectors obtained from VGGish for classification tasks, these vectors can also be viewed as modeling the data distribution from a probabilistic perspective. The Earth Mover's Distance (EMD) is an effective way to measure the distance between probabilities, so we introduced EMD as a metric for measurement and computation. Ultimately, the choice of cosine similarity was validated through experimental results.
Author Response File: Author Response.doc
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThanks to the authors for answering all the provided comments. I still have some concerns about the instance correlation map. Although in the point-by-point responses the authors commented that the features are directly extracted as embeddings from the VGGISH network, it is still not clear in the manuscript what is exactly an instance correlation map. Perhaps a little explanation might be enough (section 3.1). Perhaps the second paragraph could start different to describe this idea.
Author Response
Commnet:
In this paper, we use graph structures to represent the correlations between audio segments. Specifically, we use VGGish to map an audio clip into multiple audio features. Suppose an audio clip is 10 seconds long; VGGish converts the audio for each second into a 128-dimensional feature vector. Thus, a 10-second audio clip is mapped into 10*128-dimensional features, with each feature representing an instance. We then construct an instance correlation graph based on these instances. Each node in the graph represents an instance, and the node features are the 128-dimensional features of the instance. The edges of the instance correlation graph are computed based on the similarity between instances, thus forming the instance correlation graph.
Res:
Thank you for your comments.
In this version, we have made some adjustments to the phrasing of the paper. This paper uses label information as embeddings to improve instrument recognition performance. The label information used not only contains the correlation information of the labels but also includes the semantic features of the labels themselves. Therefore, we have changed "label correlation embedding" to "label semantic embedding" to more accurately describe the algorithm presented in this paper. Based on this, we have also revised sections such as the Abstract.
Regarding the construction of the instance correlation graph, we have made the following adjustments:
To build this instance correlation graph, we extract features from the spectrograms using a pre-trained deep neural network called VGGish. In this paper, we use a VGGish which has 128 dimensional feature representations. Then, a 1-second music clip is transformed into a 128 dimensional feature. Based on this VGGish model, every music audio is transformed into n*128 dimensional features, where n is the duration of the music, measured in seconds, and this matrix has a corresponding label vector which has multiple activated label components. In this paper, we use graph structures to represent the correlations between audio segments. Specifically, we use VGGish to map an audio clip into multiple audio features. Suppose an audio clip is n seconds long; VGGish converts the audio for each second into a 128-dimensional feature vector. Thus, an n-second audio clip is mapped into n*128-dimensional features, with each feature representing an instance. We then construct an instance correlation graph based on these instances. Each node in the graph represents an instance, and the node features are the 128-dimensional features of the instance. The edges of the instance correlation graph are computed based on the similarity between instances, thus forming the instance correlation graph. To construct the 2 dimensional relation of instruments based on the feature matrix, we build an instance correlation graph, in this graph, a node expresses a music clip (the corresponding row in the feature matrix), and every graph has n nodes which is equal to the number of clips in the corresponding music audio. The edge between node (i, j) is built based on the corresponding weighted similarity of (i, j) in the feature matrix.
Reviewer 3 Report
Comments and Suggestions for Authors- Make a separate section called Discussion
- The conclusion is too short and should be enriched
Author Response
Comment:
Make a separate section called Discussion
The conclusion is too short and should be enriched
Res:
Thank you for your comments.
In this version, we have made some adjustments to the phrasing of the paper. This paper uses label information as embeddings to improve instrument recognition performance. The label information used not only contains the correlation information of the labels but also includes the semantic features of the labels themselves. Therefore, we have changed "label correlation embedding" to "label semantic embedding" to more accurately describe the algorithm presented in this paper. Based on this, we have also revised sections such as the Abstract.
For the conclusions and outlooks, we have made the following adjustments:
In this paper, we have addressed the challenge of instrument recognition in polyphonic music by proposing a novel model called the multi-instance multi-scale graph attention neural network (MMGAT) with label semantic embeddings. Traditional models struggle with issues such as the uneven distribution of instruments across tracks and signal overlap in polyphonic music, which reduces the distinguishability of features and impacts classification accuracy. MMGAT overcomes these challenges by utilizing instance correlation graphs to model the timbre similarities of instruments and by incorporating label semantic embeddings into the feature set. The experimental results demonstrate that MMGAT significantly outperforms existing instrument recognition models, offering a more robust and accurate solution for identifying instruments in complex music tracks.
Looking ahead, future work will focus on further enhancing MMGAT’s performance by integrating advanced feature extraction methods, such as deep audio embeddings, to capture even more subtle characteristics of instruments. Additionally, the model could be expanded to handle a broader range of musical compositions, including those with more simultaneous instruments and varying audio qualities. Another promising direction is to apply MMGAT to real-time instrument recognition in live music settings, potentially opening up applications in music production and performance analysis. Moreover, incorporating contextual information, such as musical scores, lyrics, or even the broader cultural context of the music, could further refine the model’s accuracy and applicability across diverse musical genres.