Next Article in Journal
A Method for the Front-End Design of Electric SUVs Integrating Kansei Engineering and the Seagull Optimization Algorithm
Previous Article in Journal
Building Consensus with Enhanced K-means++ Clustering: A Group Consensus Method Based on Minority Opinion Handling and Decision Indicator Set-Guided Opinion Divergence Degrees
Previous Article in Special Issue
Secure and Lightweight Firmware Over-the-Air Update Mechanism for Internet of Things
 
 
Article
Peer-Review Record

Enhancing Subband Speech Processing: Integrating Multi-View Attention Module into Inter-SubNet for Superior Speech Enhancement

Electronics 2025, 14(8), 1640; https://doi.org/10.3390/electronics14081640
by Jeih-Weih Hung *, Tsung-Jung Li and Bo-Yu Su
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Electronics 2025, 14(8), 1640; https://doi.org/10.3390/electronics14081640
Submission received: 23 February 2025 / Revised: 10 April 2025 / Accepted: 16 April 2025 / Published: 18 April 2025
(This article belongs to the Special Issue IoT Security in the Age of AI: Innovative Approaches and Technologies)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Review of “Enhancing Subband Speech Processing: Integrating …….”

The manuscript primarily explores the effectiveness of the Inter-subNet and multi-view attention (MVA) module in comparison with existing methods. These components are then combined to propose a new architecture for speech enhancement. However, a major concern is that the proposed network does not show any significant improvement over the baseline models.

Major Concerns:

  1. Lack of Discussion on Complex-Valued Speech Enhancement Networks
    • The introduction should also cover complex-valued speech enhancement (SE) networks such as DCCTN, CFTNet, and AERO, which have demonstrated advantages in phase reconstruction. This would provide a more comprehensive background on existing advancements in the field.
  2. Clarification of SIL Block Output
    • In Fig. 1, the input to the SIL block is the magnitude spectrum, but the output is represented as a complex spectrum (Mr +jMi ). The authors should clearly explain how this transformation occurs within the block.
  3. Terminology in Attention Mechanisms
    • In Fig. 2, the terms “global” and “local” attention refer to operations applied along the time and frequency axes, respectively. To avoid ambiguity, it would be more appropriate to explicitly label them as time-axis attention and frequency-axis attention.
  4. Unclear Key Contributions
    • The authors should clearly define the key contributions of the paper, as both Inter-subNet and MVA appear to be adaptations of existing baseline networks rather than novel innovations.
  5. Experimental Setup Limitations
    • The training and testing Signal-to-Noise Ratios (SNRs) used in the experiments are relatively high. To better validate the robustness of the proposed model, the authors should also test under negative SNR conditions, where speech enhancement techniques typically show greater impact.
  6. Minimal Performance Improvement
    • Table 1 presents objective evaluation scores for both the baseline and the proposed architectures. However, the reported improvements are marginal (e.g., 0.16% in STOI), which is statistically insignificant.
    • The same issue persists in Table 2 and Table 3, where the results indicate almost no improvement over the baseline.
  7. Lack of Comparison with State-of-the-Art Models
    • The study does not compare the proposed network with state-of-the-art speech enhancement models. A comparison with leading SE architectures would provide better insight into the model’s effectiveness and practical applicability.

 

While the paper presents an interesting combination of Inter-subNet and MVA modules, the proposed architecture does not demonstrate substantial performance gains over the baseline. Additionally, key methodological aspects require further clarification, and experimental validation should be expanded to include more challenging noise conditions. The manuscript would benefit from a deeper discussion of complex-valued SE networks and a comparative evaluation against state-of-the-art methods.

Comments on the Quality of English Language

The quality of the paper must be improved.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This study explores improvements to Inter-SubNet, a deep learning-based speech enhancement model, by incorporating a Multi-View Attention (MA) module to refine speech features. The goal is to enhance speech quality, intelligibility, and signal reconstruction accuracy by strategically positioning the MA module within the processing pipeline. The authors propose two variations: IS-IMA (Internal MA), where MA is embedded inside the first SubInter block to enhance features early, and IS-EMA (External MA), where MA is applied between two SIL blocks to refine extracted features before final enhancement. The effectiveness of the models was tested  on the VoiceBank-DEMAND dataset, which contains speech corrupted by real-world noises at various signal-to-noise ratios (SNRs). The training set consists of 11,572 utterances from 28 speakers, while the test set includes 824 utterances from 2 speakers with five types of noise at SNRs of 2.5, 7.5, 12.5, and 17.5 dB. Spectrograms are generated using a 32 ms Hanning window with a 16 ms frame shift, and performance is assessed using PESQ (speech quality), STOI (speech intelligibility), SI-SNR (signal reconstruction accuracy), COVL (perceived quality), and CSIG (signal distortion control).

The results indicate that both IS-IMA and IS-EMA outperform the baseline Inter-SubNet model, but their effectiveness varies depending on the enhancement goal. IS-EMA achieves the highest intelligibility (STOI = 0.946) and best signal reconstruction (SI-SNR = 18.782), suggesting that placing MA later in the processing pipeline allows for better refinement of extracted speech features. Conversely, IS-IMA yields higher overall quality (COVL = 3.469) and reduces signal distortion more effectively (CSIG = 4.058), indicating that early feature enhancement helps suppress noise but may limit later refinements. An ablation study further examines the impact of different attention mechanisms, showing that channel and local attention improve IS-EMA's performance, while global attention benefits IS-IMA. The findings emphasize that the placement of the MA module significantly affects speech enhancement outcomes, with IS-EMA excelling in intelligibility and signal reconstruction, while IS-IMA is more effective at reducing distortion and improving overall perceived quality. The choice between these two approaches depends on the specific requirements of a given speech enhancement task.

Overall, the paper is well-written and well-structured. I have a few suggestions that, if implemented, could provide further insights into the differences between the two approaches. Specifically, I think the authors should consider comparing IS-IMA and IS-EMA at different signal-to-noise ratios (SNRs) separately rather than averaging results across all conditions. The paper reports results using a range of SNRs (0, 5, 10, and 15 dB for training; 2.5, 7.5, 12.5, and 17.5 dB for testing), but it is unclear how each method performs at individual SNR levels. Breaking down results by SNR could clarify whether one approach is more effective in low-SNR versus high-SNR conditions, offering a more detailed understanding of their respective strengths in varying noise environments.

While the paper presents a thorough evaluation, I also believe the authors should provide a more detailed interpretation of their results. The reported improvements in speech quality, intelligibility, and reconstruction accuracy are valuable, but what do these results actually mean in practical terms? For example, why does IS-EMA perform better in intelligibility while IS-IMA is better at reducing distortion? How do the differences in Multi-View Attention placement affect specific aspects of speech perception? Explaining the underlying reasons behind these performance differences would strengthen the impact of the study and help readers understand the significance of each approach beyond the reported metrics.

Additionally, I wonder whether the authors have considered the possibility of incorporating Multi-View Attention (MA) in both early and late positions within the model. The study examines the effect of placing MA either inside the first SubInter block (IS-IMA) or between the two SIL blocks (IS-EMA), but it may be worth exploring whether a dual-MA setup—where one MA module enhances raw speech features before deep feature extraction and a second MA module refines extracted representations before final enhancement—could combine the advantages of both IS-IMA and IS-EMA. While this would introduce additional computational complexity, an ablation study could provide insight into whether this approach yields further improvements. I encourage the authors to discuss this possibility and its potential trade-offs.

Finally, I am also curious about the role of individual attention mechanisms (channel, global, and local attention) at different layers. The ablation study investigates the effect of isolating each attention mechanism, but another interesting avenue for future research could be whether different attention types should be applied at different stages of the model rather than all together. For example, local attention might be more beneficial in early layers to capture fine-grained phonetic details, while global attention could be more effective in later layers to ensure speech coherence. Similarly, channel attention could be tested separately at different points to determine where it is most beneficial. A structured, layer-specific approach to attention could lead to a more targeted enhancement process. While this may be beyond the scope of the current study, discussing it in the future work section could provide meaningful directions for further research.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This manuscript integrates Multi-view Attention (MA) and Inter-SubNet to enhance key features, ensuring comprehensive speech signal processing. However, the way of the results are expressed needs improvement to clearly show the contribution of this manuscript. Therefore,  some important parts are not clearly explained and need to be improved, as follows:

1. Multi-view Attention (MA) is repeatedly defined multiple times in the manuscript. After it is defined, "Multi-view Attention" can be referred to as "MA" throughout in the manuscript.
2. There is no comparison with other research. It needs to be provided.
3. The proposed method is underrepresented in the manuscript and more details should be provided.
4. The results section needs to be rewritten. Comparing it with other studies can make the contribution of this study easier to understand.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

1. Limited Improvement Over Baseline:

Although the manuscript presents experimental results, the performance improvement over the baseline is marginal. As a result, the novelty and significance of the proposed approach remain unclear. It would be beneficial if the authors could include a more in-depth discussion regarding the model’s behavior under various SNR conditions—particularly very low SNR levels, where the robustness of speech enhancement systems is most critical.

Additionally, the authors are encouraged to evaluate or consider comparisons with more recent and relevant models. For instance, a recent DCCTN-based model offers state-of-the-art performance and could serve as a stronger benchmark or inspiration:

Mamun, N., & Hansen, J. H. (2024). Speech enhancement for cochlear implant recipients using deep complex convolution transformer with frequency transformation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Inclusion of this recent model (or a rationale for its exclusion) would greatly improve the contextual relevance of the work.

 

2. Lack of Ablation Study:

One major limitation of the current version of the manuscript is the absence of an ablation study. Such a study is essential to demonstrate the relative contribution of each component of the proposed framework. Without it, it is difficult to determine which parts of the model are most effective and whether all design choices are justified.

I strongly recommend that the authors include a comprehensive ablation study—e.g., by incrementally removing or modifying components of the proposed architecture and measuring the corresponding performance change.

While the manuscript has seen some improvement, the core contributions are still limited, and key experimental justifications (particularly in terms of performance significance and component-wise effectiveness) are lacking. Addressing the concerns above would substantially strengthen the paper’s impact and clarity.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

In the manuscript, author provides more details on comparison criteria and comparisons with other studies, highlighting the improvements and contributions of this study. Integrating a Multi-view Attention module as a front-end or intermediate component in Inter-SubNet has been confirmed to enhance performance. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop