Review Reports
- Lekshmi Chandrika Reghunath1,
- Rajeev Rajan2 and
- Cristian Randieri3,4,*
- et al.
Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous Reviewer 4: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors(1)List the musical devices that need to be classified in a table.
(2)The position of Figure 1 is reversed.
(3)In line of 105 on page 4, there is a reference citation problem about [xxx].
(4)The used datasets are not clearly introduced in the paper. It should be complemented for illustrations.
(5)The computational complexity of the used model should be analyzed for comparison.
(6)Most of the abbreviations have no definitions. It is suggested to providing a list of abbreviations.
Author Response
Comment 1: List the musical devices that need to be classified in a table.
Response: Thank you for this suggestion. We have now added a comprehensive table (Table 1)-Page number 6 ,line number 144 listing all instrument classes considered in this study, corresponding to the IRMAS dataset categories. The table includes the instrument name, abbreviation, and number of samples per class, providing a clearer overview of the classification task.
Comment 2: The position of Figure 1 is reversed.
Response: We appreciate the reviewer’s observation. The figure orientation was intentionally retained to preserve visual clarity. When arranged in the conventional left-to-right order, the functional modules and directional flow became visually congested, making it difficult to identify each block in the dual-branch architecture. Therefore, the reversed layout was deliberately chosen to make the interaction between the Mel-spectrogram and binary-mask branches easier to interpret. Although in the revised manuscript, Figure 1 has been redrawn with higher resolution and improved visual clarity. The text labels, block boundaries, and directional arrows have been enlarged and color-adjusted for better readability.
Comment 3: In line of 105 on page 4, there is a reference citation problem about [xxx].
Response : We thank the reviewer for noticing this. The citation issue was due to a formatting oversight and has been corrected in the revised manuscript. The correct reference is now properly linked, and all other citations have been cross-checked for accuracy and consistency.
Comment 4: The used datasets are not clearly introduced in the paper. It should be complemented for illustrations.
Response : We agree with this observation. In the revised Section 3.1 (Dataset and Preprocessing), Page number 5,line numbers 138-145 we have expanded the dataset explanation to include:
-
Structure and instrument categories of the IRMAS dataset,
-
Number of training and testing samples per class,
-
Sampling rate and audio duration specifications, and
-
Example Mel-spectrogram illustrations for selected instruments.
These additions enhance the clarity of our experimental setup and provide a complete understanding of the dataset characteristics.
Comment 5: The computational complexity of the used model should be analyzed for comparison.
Response: Thank you for highlighting this important aspect. We have now updated Table7 ,page number 16,linenumber 424-428 to include both the model parameter count and inference time for the proposed model, as well as for other architectures including CNN and Transformer-based models. Additionally, a detailed computational analysis has been incorporated in Section 6.5, where we compare the efficiency and complexity of our proposed architecture against these baseline models. This addition provides a clearer understanding of the trade-offs between model performance and computational cost.
Comment 6 : Most of the abbreviations have no definitions. It is suggested to providing a list of abbreviations
Response : We appreciate the reviewer’s valuable suggestion. In the revised manuscript, all abbreviations are clearly defined upon their first occurrence in the text. Additionally, a comprehensive “List of Abbreviations” has been included at the beginning of the paper to enhance readability and ensure clarity for the readers.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes a multi instrument polyphonic music dominant instrument recognition method based on dual branch convolutional neural network and cross attention fusion mechanism. By jointly learning Mel spectrum and pitch mask features, efficient and high-precision classification performance is achieved. But there are the following issues:
- Although the paper mentions' cross attention fusion ', there is no systematic comparison of its differences and innovations from traditional attention mechanisms such as self attention and multimodal attention. Therefore, the theoretical improvement points should be clearly defined in the methodology section.
- Figure 1 is not clear enough.
- The core cross attention fusion process is only described in text, lacking mathematical formulas for Q, K, V mapping and weighted calculation, making it difficult to reflect the reproducibility of the model mechanism.
- The experiment is only based on the IRMAS dataset and lacks migration validation on other publicly available music datasets, which cannot prove the model's generalization ability.
- The code and training script have not been made public, and the hyperparameter settings have not been specified, making it difficult to ensure the reproducibility of the experiment.
- Table 4 mentions the advantage of parameter quantity, but does not provide engineering indicators such as training time, video memory usage, or inference delay, which cannot reflect the lightweight value of the model.
- The outlook for future work is vague. The conclusion suggests that the direction of "combining Transformer or RNN" is too broad, and further guidance can be provided on how to introduce cross modal attention optimization in temporal modeling.
Author Response
Comment 1: Although the paper mentions' cross attention fusion ', there is no systematic comparison of its differences and innovations from traditional attention mechanisms such as self attention and multimodal attention. Therefore, the theoretical improvement points should be clearly defined in the methodology section
Response : We sincerely thank the reviewer for this insightful comment. In the revised version, we have expanded the Methodology section (Section 4.2) ,Page number 8, line number 238 to include a clear theoretical comparison between the proposed cross-attention fusion and traditional attention mechanisms, namely self-attention and multimodal attention.
Specifically,
-
Self-attention operates within a single feature domain, modeling intra-modal dependencies.
-
Multimodal attention aligns features between modalities but typically uses concatenation or co-attention.
-
In contrast, our cross-attention fusion explicitly learns mutual feature relevance between Mel-spectrogram and pitch-mask branches, allowing the model to focus on pitch-salient spectral regions and improving discriminability for overlapping instruments.
This theoretical clarification highlights the novelty and motivation behind our design and differentiates our method from existing attention-based fusion strategies.
Comment 2: Figure 1 is not clear enough.
Response: We appreciate the reviewer’s observation. In the revised manuscript, Figure 1 has been redrawn with higher resolution and improved visual clarity. The text labels, block boundaries, and directional arrows have been enlarged and color-adjusted for better readability.
Comment 3: The core cross attention fusion process is only described in text, lacking mathematical formulas for Q, K, V mapping and weighted calculation, making it difficult to reflect the reproducibility of the model mechanism.
Response : We thank the reviewer for this valuable suggestion. In the revised manuscript, we have expanded Subsection 4.2.2, Page number 9, line number 264 (Cross-Attention Fusion) to include a mathematical formulation of the proposed fusion mechanism. The updated section now explicitly defines the Query (Q), Key (K), and Value (V) mappings, as well as the attention weight computation and fusion output equations. These additions clarify how feature interactions between the Mel-spectrogram and pitch-mask branches are established and optimized during training. Furthermore, the revised explanation ensures better reproducibility and theoretical transparency of the cross-attention module.
Comment 4: The experiment is only based on the IRMAS dataset and lacks migration validation on other publicly available music datasets, which cannot prove the model's generalization ability.
Response : We thank the reviewer for this valuable comment. We fully acknowledge the importance of cross-dataset validation; however, the IRMAS dataset was deliberately selected due to its unique structure and relevance to predominant instrument recognition tasks. Specifically, the training subset of IRMAS contains fixed-length audio clips with single predominant instruments, while the test subset comprises variable-length samples with multiple predominant instruments. This design inherently provides a natural generalization challenge within the dataset itself, as the model must adapt from single-instrument learning to multi-instrument recognition during testing. In our preliminary experiments using 1,305 single-predominant samples, the model achieved a high F1-score of 0.79, demonstrating strong recognition under controlled conditions. The study was then expanded to multiple predominant cases for fair comparison with existing works.
To ensure fairness and comparability with existing works, we subsequently expanded the study to include multiple predominant instrument recognition, aligning with the standard evaluation setup adopted in prior IRMAS-based studies. Moreover, compared to other available datasets such as MedleyDB and OpenMIC, IRMAS is specifically curated for predominant instrument detection and remains the benchmark dataset for this task in the literature. The other datasets are fully polyphonic for both training and testing and are not directly suited for predominant-instrument-specific evaluation. Hence, IRMAS provides the most controlled and interpretable environment to validate the proposed model’s capability and to enable fair comparison with existing architectures reported in prior studies.
Comment 5: The code and training script have not been made public, and the hyperparameter settings have not been specified, making it difficult to ensure the reproducibility of the experiment.
Response : We appreciate the reviewer’s comment regarding reproducibility. A detailed description of hyperparameter settings has now been included in the revised manuscript, page number 10, line number 291 to enhance transparency. Furthermore, the complete code and training scripts will be made publicly available via GitHub upon acceptance of the paper.
Comment 6 : Table 4 mentions the advantage of parameter quantity, but does not provide engineering indicators such as training time, video memory usage, or inference delay, which cannot reflect the lightweight value of the model.
Response : We thank the reviewer for this valuable suggestion. Table 7 has been updated in the revised manuscript and explained in page number 17, line number 424 total number of model parameters and inference time, providing a clearer demonstration of the model’s lightweight efficiency.
Comment 7: The outlook for future work is vague. The conclusion suggests that the direction of "combining Transformer or RNN" is too broad, and further guidance can be provided on how to introduce cross modal attention optimization in temporal modeling.
Response : We sincerely thank the reviewer for this valuable suggestion. In the revised manuscript, we have expanded the Conclusion section ,Page number 17, line number 442 to clearly explain how cross-modal attention optimization can be incorporated into temporal modeling. Specifically, our current framework employs cross-attention for frame-level (spatial) fusion between spectral and pitch-based features. As a future direction, we plan to extend this mechanism along the temporal axis to model evolving dependencies across time.
This can be achieved by:
-
Extending cross-attention along the time axis, allowing the network to align spectral and pitch cues not only per frame but also across consecutive time steps, thereby capturing their temporal evolution.
-
Integrating temporal modeling blocks, such as Transformer encoders or bidirectional LSTMs, after cross-attention fusion to learn sequential dependencies and refine fused representations.
-
Optimizing cross-modal attention dynamically, where attention weights vary across time to emphasize frames with strong inter-modal interactions (e.g., note onsets or harmonic overlaps).
These additions will enable the model to learn richer temporal relationships and enhance the robustness of instrument recognition in complex polyphonic music.
Reviewer 3 Report
Comments and Suggestions for Authors- With identical F1 scores to the concatenation baseline (Table 3), what specific metric—beyond t-SNE—proves cross-attention's necessity over a simpler, cheaper fusion method?
- Your pitch branch depends entirely on the pYIN algorithm. How does your model handle inevitable pYIN errors, and what evidence shows it doesn't simply learn to ignore this branch when the input is unreliable?
- The cross-attention module constitutes ~60% of your parameters. Justify this significant complexity and cost, given the marginal performance gain over the MelCNN-only model.
- Your Abstract (Micro F1: 0.64) and Results (Micro F1: 0.79) directly contradict. Which is correct, and what error caused this fundamental discrepancy in your primary metric?
- Why omit augmentation or transfer learning despite IRMAS imbalance, unlike [14], and how does this impact class bias?
- Why single-dataset (IRMAS) evaluation, ignoring generalization to MedleyDB/OpenMIC despite robustness claims?
Author Response
Comment 1: With identical F1 scores to the concatenation baseline (Table 3), what specific metric—beyond t-SNE—proves cross-attention's necessity over a simpler, cheaper fusion method?
Response : We thank the reviewer for the insightful observation. The previously reported identical Micro F1 values were due to a typographical error. The corrected results are now presented in the revised Table 4, where the Mel + Mask (Concat) baseline achieves a Micro F1 of 75.0 and Macro F1 of 70.0, while the Cross-Attention model achieves 79.0 (Micro F1) and 73.0 (Macro F1).
To further demonstrate the necessity and effectiveness of the proposed cross-attention fusion beyond t-SNE visualization, we have included additional quantitative analyses in the revised manuscript:
-
Instrument-wise Precision, Recall, and F1 Scores: Figure 4 presents class-level performance metrics, showing that the cross-attention model achieves higher accuracy for harmonically overlapping instruments such as violin, cello, and guitar.
-
Computational Analysis: Table 6 has been updated to include the total parameter count, training time, and inference time for all compared models, thereby providing a clearer understanding of computational efficiency
Comment 2: Your pitch branch depends entirely on the pYIN algorithm. How does your model handle inevitable pYIN errors, and what evidence shows it doesn't simply learn to ignore this branch when the input is unreliable?
Response : We thank the reviewer for this insightful observation. We acknowledge that the pYIN algorithm may introduce pitch estimation errors, particularly in complex polyphonic regions. To mitigate this, the proposed model incorporates several design and validation strategies to ensure robustness and effective utilization of the pitch branch:
-
-
Cross-Attention Fusion as a Reliability Filter:
The cross-attention mechanism adaptively weighs the contribution of the pitch branch (mask features) based on its relevance to the spectral features. When pYIN-derived pitch masks are noisy or unreliable, the attention weights corresponding to those regions are automatically reduced, preventing the model from overfitting to erroneous cues. -
Noise-Augmented Training:
During training, we intentionally introduce perturbations to the pitch mask (such as random dropout and temporal masking). This regularization teaches the model to rely on both spectral and pitch cues dynamically, rather than depending solely on the pYIN output. -
Ablation Evidence:
The ablation study (Table 4) shows that the MaskCNN-only model performs significantly lower (Micro F1 = 70.0) than the MelCNN-only model (Micro F1 = 75.0), but combining both through cross-attention improves performance to 79.0 (Micro F1). This demonstrates that while the pitch branch alone is insufficient, its complementary information meaningfully enhances recognition when fused via attention
-
Comment 3: The cross-attention module constitutes ~60% of your parameters. Justify this significant complexity and cost, given the marginal performance gain over the MelCNN-only model.
Response : We acknowledge the reviewer’s observation regarding the parameter contribution of the cross-attention module. The design intentionally allocates a higher proportion of parameters (~60%) to this module, as it serves as the core fusion mechanism that enables effective interaction between the Mel-spectrogram and binary pitch-mask representations. While the MelCNN-only model captures timbral and spectral cues, it lacks explicit conditioning on harmonic or pitch-related regions. The cross-attention layer learns contextual dependencies between these complementary feature spaces, allowing the network to selectively attend to pitch-relevant spectral structures while suppressing irrelevant or noisy regions.
Although the performance gain over the MelCNN-only baseline appears modest in overall accuracy (approximately 2–3%), the improvement is more pronounced in challenging polyphonic segments and under variable pitch reliability conditions, as shown in the confusion and robustness analyses. Furthermore, the cross-attention module contributes to better generalization on unseen instrument combinations without requiring additional data augmentation. Han et al. \cite{Han} employed a deep convolutional neural network that required approximately 1.446 million parameters. In contrast, our proposed dual-branch Mel–Mask Cross-Attention model achieves comparable or superior performance using only 0.878 million parameters, representing a significant reduction in model complexity by nearly 40%. Despite its lower parameter count, the cross-attention fusion mechanism enables effective interaction between spectral and pitch-based cues, thereby enhancing discriminative power without the need for deeper convolutional layers. This demonstrates that the proposed model offers an optimal balance between computational efficiency and recognition accuracy.
Comment 4 : Your Abstract (Micro F1: 0.64) and Results (Micro F1: 0.79) directly contradict. Which is correct, and what error caused this fundamental discrepancy in your primary metric?
Response : We sincerely apologize for the confusion. The difference arises because two separate experimental setups were conducted on the IRMAS dataset. The IRMAS dataset comprises single-predominant training samples and variable-length testing clips that may contain multiple predominant instruments. In the initial stage, we evaluated the model on 1305 single-predominant test files, achieving a Micro F1 score of 0.79. For the final benchmarking and fair comparison with existing state-of-the-art methods, we expanded the evaluation to the complete IRMAS test set of 2874 files, which includes multiple predominant instruments and longer excerpts, resulting in a Micro F1 score of 0.64.
Comment 5 : Why omit augmentation or transfer learning despite IRMAS imbalance, unlike [14], and how does this impact class bias?
Response : We appreciate this valuable comment. All baseline models considered for comparison were evaluated on the IRMAS dataset without augmentation or transfer learning; therefore, to ensure a fair and direct comparison, our cross-attention model was also trained under the same conditions. While augmentation techniques such as those in [14] can further improve performance, the primary objective of this study is to demonstrate the effectiveness of cross-attention fusion between Mel-spectrogram and binary mask features for predominant instrument recognition. Future work will explore augmentation-based enhancements to extend this framework.
Comment 6 : Why single-dataset (IRMAS) evaluation, ignoring generalization to MedleyDB/OpenMIC despite robustness claims?
Response: We appreciate the reviewer’s valuable comment. The IRMAS dataset was intentionally chosen as it is the standard benchmark for predominant instrument recognition, providing single-predominant fixed-length training samples and variable-length multi-predominant testing samples. This structure uniquely supports the evaluation of generalization from clean single-instrument to complex polyphonic mixtures. In contrast, MedleyDB and OpenMIC contain fixed-length, fully polyphonic audio for both training and testing, making them less suitable for predominant instrument recognition. Furthermore, all baseline models listed in Table 7 were also evaluated exclusively on IRMAS to ensure a fair comparison. Future work will extend this study to other datasets to explore cross-dataset generalization.
Reviewer 4 Report
Comments and Suggestions for Authors1) Is the information exchange in the Cross-Attention module unidirectional or bidirectional?
The paper states, “Mel-spectrogram features serve as queries, and binary mask features as keys and values.”
This implies that attention flows only from the Mel branch to the Mask branch, suggesting a unidirectional cross-attention design.
- Suggestion
Using only unidirectional attention may limit the pitch branch’s ability to fully incorporate information from the Mel branch.
It would be beneficial to experiment with bidirectional cross-attention (mutual attention) or modify the transformer encoder–decoder structure to enable two-way feature fusion.
Such reciprocal interaction could be particularly advantageous in polyphonic audio scenarios where complex interdependencies exist between spectral and pitch cues.
2) Has the impact of pYIN algorithm errors on the generated binary pitch mask been quantitatively evaluated?
Pitch estimation is inherently uncertain, and pYIN tends to produce higher false detection rates in complex sounds such as vocals and guitars.
However, the paper does not report any assessment of the mask’s accuracy or uncertainty.
- Suggestion
It is recommended to conduct an ablation study to evaluate how the quality of the pitch mask affects model performance.
For example, comparing results using a “noisy pitch mask” versus a “ground-truth pitch mask (synthetic dataset)” could demonstrate how robust the cross-attention mechanism is to noise.
Alternatively, incorporating mask confidence as an attention weight could evolve the model toward an uncertainty-aware attention framework.
3) How was the class imbalance in the IRMAS dataset handled?
The paper only describes the validation split and optimizer settings, without detailing any strategies for addressing class imbalance (e.g., oversampling, class weighting, or focal loss).
Since IRMAS exhibits significant variation in sample counts across instrument classes, additional balancing measures are likely necessary to improve macro F1 performance.
- Suggestion
Applying a class-balanced loss (e.g., focal loss or effective number of samples reweighting) could enhance the macro F1 score.
This approach would particularly help improve recall for underrepresented instruments such as trumpet, saxophone, and cello.
Moreover, employing data augmentation techniques based on pitch shifting or time stretching could further increase data diversity and model robustness.
Author Response
Comment 1 : Is the information exchange in the Cross-Attention module unidirectional or bidirectional?The paper states, “Mel-spectrogram features serve as queries, and binary mask features as keys and values.”
This implies that attention flows only from the Mel branch to the Mask branch, suggesting a unidirectional cross-attention design.
Response : We sincerely thank the reviewer for this constructive suggestion. The proposed model currently employs a unidirectional cross-attention mechanism, where the Mel-spectrogram features function as queries, and the binary mask features act as keys and values. This design was chosen to enable the model to focus on pitch-salient spectral regions that are most relevant to predominant instrument recognition.
We fully agree that introducing bidirectional or mutual cross-attention could enhance the interaction between the spectral and pitch modalities. Accordingly, this aspect has been acknowledged and added in the Future Directions section of the revised manuscript, highlighting our plan to explore a two-way attention or encoder–decoder fusion mechanism in future work to strengthen cross-modal representation and generalization.
Comment 2 : Has the impact of pYIN algorithm errors on the generated binary pitch mask been quantitatively evaluated?
Response : We thank the reviewer for this valuable suggestion. In response, we have added a mask robustness evaluation to analyze the model’s sensitivity to pYIN-related pitch estimation errors. Since IRMAS does not provide ground-truth F0 annotations, we derived a pseudo–ground-truth mask by applying median filtering to the pYIN contour and retaining only high-confidence frames (voiced probability ≥ 0.75). A noisy mask variant was then generated by introducing controlled random dropouts, mel-bin jitter, and spurious activations to simulate realistic pYIN inaccuracies. The model was evaluated using both mask types, and the results (included as Table 6 in the revised manuscript) show only a marginal drop in Micro/Macro F1 scores (≈2–3%), demonstrating that the proposed cross-attention fusion remains robust under moderate pitch estimation noise.
Comment 3 : How was the class imbalance in the IRMAS dataset handled?
Response : We appreciate the reviewer’s insightful observation. We acknowledge that the IRMAS dataset is indeed imbalanced, with notable disparities among instrument classes such as trumpet, cello, and saxophone. In the current study, our primary objective was to evaluate the effectiveness of cross-attentive fusion between spectral and pitch features, rather than optimize the classifier through data-level balancing techniques. Hence, to ensure a fair comparison with existing baseline models (e.g., Han et al. [8], Pons et al. [9]), which were also trained on unaugmented IRMAS data, we maintained the same data distribution without oversampling or class reweighting.
We agree that incorporating class-balanced loss functions (e.g., focal loss or inverse-frequency weighting) and data augmentation methods such as pitch shifting and time stretching could further enhance macro F1 performance, especially for underrepresented instruments. Accordingly, we have noted this as a potential future extension in the revised manuscript
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have completed the paper revisions. It is suggested to accept it for publication.
Author Response
We sincerely thank the reviewer for the positive feedback and for recommending our manuscript for publication. We appreciate the time and effort invested in evaluating our work. We are grateful for the constructive comments provided during the review process, which helped us improve the quality and clarity of the manuscript.
Thank you once again for your kind recommendation.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe author has completed most of the modifications as required, but there are the following issues:
1. The titles and color code explanations of some of the charts in the text are somewhat tightly connected to the main text. For example, the t-SNE visualization color code explanation in Figure 3 is too close to the body of the image. It is recommended to add a line spacing or reduce the size of the color code in the layout to make the reading path smoother. This adjustment does not affect the content, but it can enhance the professional aesthetic of the entire manuscript.
2. In the Training Strategy and Optimizer Comparison section, the list and settings of optimizers are somewhat lengthy. For example, the optimizer comparison description on pages 13-14 can be moderately compressed, while ensuring that the meaning expression of hyperparameters appearing for the first time is more concentrated and not overly expanded, in order to maintain a tight pace and moderate information density in the paper.
Author Response
We sincerely thank the reviewer for the constructive and helpful suggestions. We have carefully addressed all comments, and the corresponding revisions have been incorporated into the manuscript as detailed below.
Comment 1: The titles and color code explanations of some of the charts in the text are somewhat tightly connected to the main text. For example, the t-SNE visualization color code explanation in Figure 3 is too close to the body of the image. It is recommended to add a line spacing or reduce the size of the color code in the layout to make the reading path smoother. This adjustment does not affect the content, but it can enhance the professional aesthetic of the entire manuscript.
Response :
Thank you for this valuable observation. We have revised Figure 3 by increasing the vertical spacing between the t-SNE subplots and the legend section. Additional padding has been added to ensure that the legend is visually separated from the plots, resulting in a clearer reading flow and improved aesthetic presentation. This adjustment enhances the graphical layout without altering the content of the figure.
Comment 2: In the Training Strategy and Optimizer Comparison section, the list and settings of optimizers are somewhat lengthy. For example, the optimizer comparison description on pages 13-14 can be moderately compressed, while ensuring that the meaning expression of hyperparameters appearing for the first time is more concentrated and not overly expanded, in order to maintain a tight pace and moderate information density in the paper.
Response:
Thank you for this helpful suggestion. The optimizer discussion has been rewritten to a more concise form while retaining all essential details about the compared methods and their settings. Repeated explanations of hyperparameters have been removed, and the narrative has been tightened to maintain a clearer and more focused presentation.
Reviewer 3 Report
Comments and Suggestions for AuthorsAuthors completed my comments
Author Response
We sincerely thank the reviewer for the positive feedback and for recommending our manuscript for publication. We appreciate the time and effort invested in evaluating our work. We are grateful for the constructive comments provided during the review process, which helped us improve the quality and clarity of the manuscript.
Thank you once again for your kind recommendation.