Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment

Electronics 2025, 14(20), 3982; https://doi.org/10.3390/electronics14203982

by Zhili Zhao¹

, Min Luo^2,*

, Xiaoman Qiao³

, Changheng Shao¹ and Rencheng Sun¹

Reviewer 1: Anonymous

Reviewer 2:

Çiğdem ACI

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Electronics 2025, 14(20), 3982; https://doi.org/10.3390/electronics14203982

Submission received: 18 August 2025 / Revised: 3 October 2025 / Accepted: 9 October 2025 / Published: 11 October 2025

(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

（1）In equation (8), the math symbol is not defined.

（2）Give the differences of GSNR ana SNR, GSIR and SIR, GSAR and SAR.

（3）The separation metric is totally based on data, it is suggested to use signal separation figure for performance evaluation.

（4）The computational complexity of the proposed method should be analyzed.

（5）The nearest references are less.

Author Response

Dear Reviewer,

Thank you very much for your valuable suggestions and careful review of our manuscript (Manuscript ID: electronics-3853912, Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment). We highly appreciate the time and effort you have devoted to improving our work.

We have carefully addressed each of your suggestions one by one, made corresponding revisions to the manuscript, and prepared detailed responses to each comment.

For the specific content of the revisions and our responses to your comments, please refer to the attached document.

We hope the revised version meets your expectations and look forward to your further feedback.

Sincerely,
Min Luo (Corresponding Author)
On behalf of all co-authors

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

First, the theoretical contribution of the study must be demonstrated. While the paper describes what the attention modules do, a more in-depth theoretical explanation of why these specific attention mechanisms (EAM and E-CBAM) are particularly effective in the context of music source separation, beyond empirical results, would strengthen the paper. For instance, how do these attention mechanisms specifically address the "high acoustic similarity and synchronous temporal evolution" mentioned in the introduction?
The ablation study focuses on the placement of attention modules. While it mentions the superiority of ECA over CAM, a more detailed breakdown comparing E-CBAM against its constituents (ECA alone, SAM alone, or even a standard CBAM adapted for 1D) in more diverse settings could offer deeper insights into which specific component of E-CBAM contributes most to its success.
The description of EAM mentions "learning fusion weights through a lightweight attention network." More specifics on the architecture and complexity of this internal "lightweight attention network" would be beneficial for reproducibility and understanding its computational overhead.
The discussion around AP4 and AP5 hinting at overfitting risks with "too many attention mechanism modules" could be elaborated. Providing quantitative results or a more detailed analysis of this overfitting phenomenon would make the argument stronger.
The paper discusses how attention helps emphasize relevant features. Visualizations of the attention maps generated by EAM or E-CBAM across different stages or for different sources would significantly enhance the interpretability of the model and provide direct evidence of its claimed "attention-driven" behavior.
While promising results are shown for vocal and accompaniment separation, the future work section only briefly mentions "multi-instrument scenarios." A preliminary discussion or acknowledgment of potential challenges and strategies for scaling ADTDCN to more complex, multi-instrument separation tasks would be useful.
While the paper mentions that E-CBAM aims to be efficient, a more explicit discussion of the computational complexity (e.g., FLOPs, parameter count) of ADTDCN compared to baselines, especially considering the added attention modules, would be valuable for practical applications.
The algorithm and hyperparameter choices throughout the article must be explained.
The abbreviations table should be expanded.

Author Response

Dear Reviewer,

We have carefully addressed each of your suggestions one by one, made corresponding revisions to the manuscript, and prepared detailed responses to each comment.

For the specific content of the revisions and our responses to your comments, please refer to the attached document.

We hope the revised version meets your expectations and look forward to your further feedback.

Sincerely,
Min Luo (Corresponding Author)
On behalf of all co-authors

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Contributions:

This paper proposes an attention-driven time-domain convolutional network for vocal and accompaniment source separation. My comments are as follows:

Major comments:

(Page 14) The major contribution is signal separation—the performance is presented on SI-SNR. The measures of accuracy, precision, recall rates, and F1 score were missing in the experiments.
(Page 3) A brief introduction of the related works is missing.
(Page 5) How can the embedding attention module be integrated into Fig. 1?
(Page 5) The input signal should start from X in Fig. 1, as shown in Eq. (1).
(Page 6) The dimensions for the input and features should be provided in Fig. 2.
(Page 6 How can the accompaniment embedding be obtained?
(Line 232 on page 6) The symbol Feature is inconsistent with eq. (2).
(Line 234 on page 7) The symbol “i” should be the index rather than the embedding.
(Line 276 on page 8) The definition of M_i is incorrect.
(Page 8) The dimensions of E_i and M_i should be provided in the equation. (7).
(Page 8) How can the V work in eq. (8)? In addition, the operator was not defined.
(Lines 305 and 306 on page 8) How can the phase and waveform vary continuously?
(Page 8) The detailed network structure for ECA and SAM should be provided in Fig. 3.
The feature sizes and the operator definition were also missing in Fig. 3.
(Page 9) Is SI-SNR a good cost function?
(Page 9) e_target, e_interf, and e_artif in eq. (10) should be presented by equations to improve the clarity.
(Page 10) The denominator is expressed incorrectly.
Some state-of-the-art references were missing.

Minor comments:

(Page 6) The "Feature" can be presented by a symbol in eq. (2). Equation (3) also has the same problem.
(Lines 238 and 246) “Where” should be revised as “where”. Please check the usage throughout this paper.
(Page 13) The sub-grid lines should be removed in Fig. 5. In addition, the x-label was missing.

Comments on the Quality of English Language

The quality of the English language should be improved.

Author Response

Dear Reviewer,

We have carefully addressed each of your suggestions one by one, made corresponding revisions to the manuscript, and prepared detailed responses to each comment.

For the specific content of the revisions and our responses to your comments, please refer to the attached document.

We hope the revised version meets your expectations and look forward to your further feedback.

Sincerely,
Min Luo (Corresponding Author)
On behalf of all co-authors

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The paper presents an attention-driven time-domain convolutional network designed for single-channel music source separation, specifically targeting the separation of vocals from accompaniment. The authors introduce two main contributions: an Embedding Attention Module (EAM) for adaptive source weighting and an Efficient Convolutional Block Attention Module (ECBAM) to improve local feature extraction. Results on public music datasets indicate significant performance improvements compared to existing approaches.

The paper addresses a relevant and challenging problem, and the methodological innovations are of interest. However, there are several points where the quality, clarity, and impact of the work could be

The importance of feature extraction in music signal processing should be better emphasized. Explain why extracting discriminative features is essential for source separation tasks and how it supports downstream applications.
Clarify why separating sources from music pieces matters. Providing real-world motivations and potential use cases will help readers outside the specialist community understand the significance of the work.

Figures: Figure 1 and others should be placed after their first reference in the text to improve readability. I have seen that you often use this, so I will not repeat this advice again, it also applies to the other occurrences.
Terminology: In lines 181–182, when introducing EAM and ECBAM, provide sufficient context. Describe the underlying principles of these modules, their typical limitations, and why their combination is advantageous.

VAT-SNet architecture: Introduce adequately the topic
Provide a more thorough introduction before presenting equations.
For each equation, list all variables in a bulleted list with definitions to make it easier for readers to follow.
Define GSIR (Global Source-to-Interference Ratio) and GSAR (Global Source-to-Artifacts Ratio). Provide intuitive explanations of what they measure and why they are suitable performance indicators in this context.
The description of ML-based approaches should be expanded. Summarize the essential characteristics of the methods employed.
Justify the choice of methods and highlight their strengths and weaknesses. This will help readers understand the rationale behind your design choices.
A detailed description of the dataset and experimental setup is necessary: How many audio tracks were used? How were they split into training, validation, and test sets? What preprocessing steps were applied? Were hyperparameters optimized (e.g., learning rate, batch size, number of layers)? If yes, how?

Clearly interpret the results: what do the observed improvements in GSIR and GSAR mean in terms of perceptual audio quality?
Discuss possible artifacts that still remain despite the improvements. Are there cases (e.g., highly polyphonic or percussive music) where the model struggles?

The paper would benefit from a section on practical applications.
Add outline future goals, such as: Extending the model to multi-instrument separation. Real-time or low-latency implementation. Domain adaptation for different music genres. Integration with generative models to reconstruct missing components.
Improve the flow of the manuscript by ensuring each concept (EAM, ECBAM, VAT-SNet, metrics) is introduced before being applied.
Ensure consistency in notation and terminology throughout the text.

The paper should explicitly acknowledge the limitations of the proposed approach, such as:

Generalization across genres: Performance may vary significantly depending on musical style (e.g., classical vs. electronic).
Computational complexity: The introduction of multiple attention mechanisms may increase training and inference costs. How feasible is the model for real-time applications?
Overfitting risk: If the dataset is not sufficiently large or diverse, the model may fail to generalize.
Interpretability: Attention mechanisms improve results but can make the model harder to interpret. Readers would benefit from a discussion of what the attention layers are focusing on in practice.
Single-channel constraint: The method is designed for single-channel input, which might limit applicability in scenarios where multi-channel (stereo or spatial) information is available and could be leveraged for better performance.

Author Response

Dear Reviewer,

We have carefully addressed each of your suggestions one by one, made corresponding revisions to the manuscript, and prepared detailed responses to each comment.

For the specific content of the revisions and our responses to your comments, please refer to the attached document.

We hope the revised version meets your expectations and look forward to your further feedback.

Sincerely,
Min Luo (Corresponding Author)
On behalf of all co-authors

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have completed the paper revisions according to the reviewers' comments.

Author Response

Dear Reviewers,

We sincerely appreciate your second-round review feedback on our manuscript titled "Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment" (Manuscript ID: electronics-3853912). We have carefully reviewed your comments—whether it is the confirmation of "having completed revisions", the specific suggestion of "sorting the abbreviation table alphabetically", the recognition of "significant improvement in manuscript quality" and "dual enhancements in presentation and content", or the valuable conclusion of agreeing to the manuscript’s acceptance for publication, all of which have greatly encouraged and affirmed us.

In response to the specific requirements in the second-round comments, we have strictly implemented the revision of "sorting the abbreviation table in alphabetical order", and at the same time rechecked and confirmed that all previous revision contents are in line with the professional guidance direction of each of you. The rigorous attitude and detailed feedback you demonstrated during the review process not only helped us gradually improve the manuscript quality to meet the publication standards, but also allowed us to experience the professional empowerment of academic review on research results.

We would like to express our sincere gratitude again for your instruction and support. If we need to cooperate in completing the relevant procedures of manuscript publication or provide supplementary explanations in the future, we will respond promptly and make every effort to ensure the efficient progress of the work.

Sincerely,
Min Luo (Corresponding Author)
On behalf of all co-authors

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Sort the abbreviation table alphabetically.

Author Response

Dear Reviewers,

Sincerely,
Min Luo (Corresponding Author)
On behalf of all co-authors

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have significantly improved the quality of this paper. It can be accepted for publication.

Comments on the Quality of English Language

The quality of the English language is acceptable.

Author Response

Dear Reviewers,

Sincerely,
Min Luo (Corresponding Author)
On behalf of all co-authors

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The authors addressed the reviewer's comments with attention and modified the paper with the suggestions provided. The new version of the paper has improved both in the presentation and in the contents

Author Response

Dear Reviewers,

Sincerely,
Min Luo (Corresponding Author)
On behalf of all co-authors

Author Response File: Author Response.pdf

Article Menu

Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment

Further Information

Guidelines

MDPI Initiatives

Follow MDPI