Next Article in Journal
Spatial Knowledge Acquisition for Pedestrian Navigation: A Comparative Study between Smartphones and AR Glasses
Previous Article in Journal
“Who Should I Trust with My Data?” Ethical and Legal Challenges for Innovation in New Decentralized Data Management Technologies
 
 
Article
Peer-Review Record

Ensemble System of Deep Neural Networks for Single-Channel Audio Separation

Information 2023, 14(7), 352; https://doi.org/10.3390/info14070352
by Musab T. S. Al-Kaltakchi 1, Ahmad Saeed Mohammad 2,* and Wai Lok Woo 3
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Information 2023, 14(7), 352; https://doi.org/10.3390/info14070352
Submission received: 24 April 2023 / Revised: 13 June 2023 / Accepted: 16 June 2023 / Published: 21 June 2023
(This article belongs to the Topic Advances in Artificial Neural Networks)

Round 1

Reviewer 1 Report

1. Page No. 12, line no. 459, "Figures 3 (a) and 3 (b) depict the results (b)." where is figure (b)?

2. The presentation of the paper is excellent but I am still searching for a fruitful outcome for the upcoming research community... Authors are requested to elaborate this.

3. Apart from DNN what can be the other potential method to conduct this study?

4. Is there any baseline solution? Authors are requested to prepare a comparative result comparing to the baseline result.

English language is good.

Using of complex/compound language can be decreased, simple sentence can be increased.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors presented an ensemble-based speech enhancement technique, based on the automatic selection of features to classify the time-frequency bins of the input to create an ideal binary mask. The data that was used performed moderately well on mixtures of a single speech source and music.

There are several major issues with the manuscript:

- The PESQ, STOI and SDR are not comparable to the current speech enhancement state of the art (such as Demucs-Denoiser, Sudo-rm-rf or ConvTASNEt). This is important, since the proposed technique is quite complex in its nature (given all the feature selection that is carried out), and there was no comparison with state of the art. The authors should contextualize the reader with how their results compare to current techniques.

- Although the central contribution of the proposted technique is that of the automatic selection of features, the complete set of features are based on a gammatone filter bank (this is even confirmed in Figure 1). Thus, there is a type of bottleneck of this contribution, since all features are based on this. The authors should extensively justify this decision and/or compare with other models trained with other type of features.

- No inference response times are given, which is something that several state-of-the-art speech enhancement techniques are aiming to reduce. Authors should report this as well, in the context of the state of the art.

 

The are some minor issues that should be corrected:

- The Introduction section (from lines 28 to 135) provides a lengthy presentation of what could be considered as related work, and is not until the end of the third page that the main contribution is presented. It is recommended to provide a brief summary of this work to contextualize the user of why the main contribution is important, and move this lengthy presentation to another "Related Work" section.

- Figures 9, 10 and 11 are outside the scope of their corresponding explaining text.

It is recommended that the authors submit the manuscript to a professional academic proofreading service to fix its considerable amount of grammatical and style errors.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The paper presents a pipeline based on acoustic features and stacked DNNs for single channel audio separation. The flavor of the paper is mixed. While the introduction is very rich and detailed, it fails to position the author's work in the literature. What are the pros and cons of each existing method and how does the proposed method fill in the "gaps" in the literature? I found very difficult to determine that.

The proposed approach is very well presented and very detailed - some minor text editing is required, though. The experimental setup is very rich, including speaker dependent and independent scenarios. However,  some subjective evaluation - in parallel to the objective scores - should be carried out in order to determine the performance of the models. 

Moreover, the disussion is somehow short. While so many algorithms exist for this specific task, the authors fail to discuss, for example, the complexity of their approach compared to others.  Finally, it would be GREAT to have a dedicated webpage with some samples to listen to (mixture, speech, music, recovered signals) from each algorithm. The authors have the results, why not sharing them?  I would love to listen to the samples and see for myself the merits of the proposed method.

Minors:

1. Intro: A reference to CASCA would be great.

2. Page 2, line 77: It's Cycle-GAN (I think)

3. Intro: A reference to MSE would also be good.

4. Page 6, line 262: What is u?

5. Page 7 (and elsewhere): P mj is in text form.

6. No need to bold the frame of each figure.

7. Figure 4 should be replaced with a higher-quality one.

8. Page 19, line 615 (sigma has a strange note next to it).

It seems that each author has contributed to the writing of different parts in the manuscript. While most parts of the paper are quite well-written, the introduction is not at the same level as the rest of the manuscript (for example, the first sentence does not make any sense). I would suggest proofreading the introduction. The rest of the paper is fine.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors responded adequately to this reviewers minor issues and some major issues, however, there are two major issues that remain unresolved, although they can be considered minor at this point, given the authors' response:

 

- Although the central contribution of the proposted technique is that of the automatic selection of features, the complete set of features are based on a gammatone filter bank (this is even confirmed in Figure 1). Thus, there is a type of bottleneck of this contribution, since all features are based on this. The authors should extensively justify this decision and/or compare with other models trained with other type of features.

The authors' justification in their response, although adequate, is not found within the manuscript. It should be added.

 

- No inference response times are given, which is something that several state-of-the-art speech enhancement techniques are aiming to reduce. Authors should report this as well, in the context of the state of the art.

Although the authors response is appreciated, including training time as part of the reported results, the reviewer asked for "response time during inference". This is the time it takes the system to provide an inference given an input, to see if it is able to run in real-time.

The authors improved upon the quality of english language from last revision.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop