Next Article in Journal
Special Issue “Carbazole Derivatives: Latest Advances and Prospects”
Next Article in Special Issue
Mouth Sounds: A Review of Acoustic Applications and Methodologies
Previous Article in Journal
Anomaly Detection of Control Moment Gyroscope Based on Working Condition Classification and Transfer Learning
Previous Article in Special Issue
A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
 
 
Article
Peer-Review Record

Enhanced Multiple Speakers’ Separation and Identification for VOIP Applications Using Deep Learning

Appl. Sci. 2023, 13(7), 4261; https://doi.org/10.3390/app13074261
by Amira A. Mohamed 1,2,*, Amira Eltokhy 3 and Abdelhalim A. Zekry 2
Reviewer 1:
Reviewer 2:
Appl. Sci. 2023, 13(7), 4261; https://doi.org/10.3390/app13074261
Submission received: 18 February 2023 / Revised: 20 March 2023 / Accepted: 24 March 2023 / Published: 28 March 2023
(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

Round 1

Reviewer 1 Report

Many thanks for sharing your interesting research. The paper introduced an approach for improving real-time speech quality based on speaker recognition and speech separation algorithms.

Although the proposed method combining two existing methods, the novelty is still unclear. It is necessary to validate why these existing methods have been selected based on their pros and cons in comparion to state of the art.

The structure of the paper can be improved. Now, the literature review is mixed with the methodology section. It makes people difficult to understand the originalities of the paper. Here is my humble suggestion:

1, Introduction:

·         the problem,

·         the difficulties for solving the problem,

·         the author’s considerations, and the proposed method benefits

·         the proposed method.

2, Literature review:

·         Speaker recognition methods with their pros and cons, what are the advantages of the used features, how will these advantages benefits the proposed method.

·         Speech separation methods with their pros and cons. Etc.

The figures quality can be improved. In the pdf format, the figures are very rough and hard to be read. In addition, a clear figure caption will help people to read. For example, In figure 1, explaining the marked color of the signal (target speech ) and noise (unwanted signal) would help.

The results need to be compared with t state of the art to show the importance and original contribution of the proposed work.

 

Line 35-36: “robots” is confusing. Better to use a more specific description, .e.g. “speech processing system”. Also, the sentence structure can be improved.

Line 61-62: why using a sample rate of “8000 bit/s”? If the resolution of the ADC is 8 bits, there will be only 1000 samples per second? Also, “ sample frequency” should be presented in units of Hz/kHz

Line 102: “Thus a”-> “Thus, a”

Line 144: “ a kind of deep learning”, please revise.

Line:171-172: this sentence is confusing, please revise.

Line 185-186: Please define “relative speakers” and “random speakers”

Line 189-190: I understand the author used a customized dataset to test the reliability. But, the details of the recording speakers ( i.e. the numbers, male or female) need to be provided. Also, is there any overlapping between the training and testing dataset?

Line 197: “Excellent results” better to use figure or number to show how high the accuracy is.

Line 216-217: better to compare the proposed algorithm with other published works.

Line 222: Please define each of the terms used in equation (4).

Line 230-231: “the used model train faster for smaller number of speakers” please revise.

Line 243-244: Need to provide the setup details (e.g. size of the dataset) of the tests with different speakers. For example, If 4 speakers training/testing dataset is larger than the 2 speakers one, it is for sure the processing time increases.

Line 248: Please explain or cite a reference for the quantification of the CPU performance.

Line 251: maybe it is my oversight, I cannot find the definition of “BFS, SSSP, CC, BC”. I don’t really understand figure 8.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

The manuscript has some grammar and stylish corrections (variables used in equations and paragraphs don't match, incomplete expressions, incorrect abbreviations).Also, some acronyms are used before its definition or are not defined (RNN, MFCC, GMM, BC, BFS, CC, DOBFS, PR, SSSP, etc.).

The figures are in low resolution, which makes it difficult to read.The authors state that "The time for these audio files is 7 seconds for each mix file, with a sampling frequency 8000 bit/s".What is the sampling frequency in Hz and what is the resolution of the codec?

In section 2, it is unclear what are the authors' proposal and the related work.

Please extend and clarify the explanation for the proposed method.

Please add the WSJ0-2mix dataset reference. A comparison of the proposed and existing method should be performed in terms of SI-SNRi.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Thanks for the modifications. Now the manuscript looks much better. 

However, I have further concerns for the updated version.

 

Line 213-215: check the text front format.

 

Line 265 and line 265, please check sections title

 

Figure 6, the figure quality can be improved.

 

Figure 8, it may not be a good idea to show the results using the program code/ output code. Suggest to remove it or plot accuracy as a function of the testing people number (i.e. histogram of accuracy with 2, 3, 4 speakers).

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

The authors have attended all my suggestions. I recommend this paper for publication.

Author Response

Dear Reviewer2,

Thank you very much for recommending our manuscript for publication.

Round 3

Reviewer 1 Report

Thanks for the updates. 

Back to TopTop