Next Article in Journal
Improved Buffer-Aided Multi-Hop Relaying with Reduced Outage and Packet Delay in Cognitive Radio Networks
Previous Article in Journal
A Deep Learning Approach to EMG-Based Classification of Gait Phases during Level Ground Walking
 
 
Article
Peer-Review Record

Speech Enhancement for Secure Communication Using Coupled Spectral Subtraction and Wiener Filter

Electronics 2019, 8(8), 897; https://doi.org/10.3390/electronics8080897
by Hilman Pardede 1, Kalamullah Ramli 2,*, Yohan Suryanto 2, Nur Hayati 2 and Alfan Presekal 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2019, 8(8), 897; https://doi.org/10.3390/electronics8080897
Submission received: 28 June 2019 / Revised: 5 August 2019 / Accepted: 13 August 2019 / Published: 14 August 2019
(This article belongs to the Section Computer Science & Engineering)

Round 1

Reviewer 1 Report

This paper describes a speech enhancement algorithm based on spectral subtraction. It is a variant of Spectral Subtraction method. Here the noise subtracted from the signal is obtained through a Wiener filter.

The issue is important indeed for electronic devices. Unfortunately the manuscript contains so many inaccuracies that it is even impossible to say whether the algorithm works or not.


General comments


First of all, the use of the name of the variables: the author use 'm' to indicate frequency (why not 'f'?). However in many points they refer to the previous frame as 'm-1'. The explanation of what they intend as 'm' and 'k' 

is given at the beginning of Section 4 while it should be given at the beginnin of Section 2.

In eq.(6) they write X=H*Y, while in (15) they write D=H*Y, where X is the recovered signal and D the noise. This deserves a much longer explanation.

In (5) they use 'N' to represent noise, which is indicate in the following as 'D'.

Why in (12) they use the constant 10^-6? Is it an empirical constant? Does it come from previos knowledge or published research? Please explain. In general, keep in mind that every constant must be every well 

justified, otherwise it may seem that it is suggested by God.

Twenty sentences is an unacceptable amount of data to convince anyone. The right number could be the entire dataset, where the results can be averaraged over the entire set. What is the problem to run a code on more data?

The  list of utterance in fig.2 is useless for the general reader;  the phonetic transcription could be much better.

When the author use other methods such as SS, KLT, Martin, Hirsch etc., do they use a publicy available code (from where they downloaded? do they use Matlab?) or they programmed  the algorithms by themselves?

Figures 6 - 8 are uselessand serve only to convince the reader that the authors run some code. The completely missing part is: how the algorithm works at different SNRs? They do not even say what is the SNR of the noisy speech they

have with different coders. 

It seems that the authors use only coders. Where is the encryption? What type of encryption they user?

The results reported in Table 3 show that quality with SS is worst than that obtained with noisy speech. This is absolutely impossible. There is something wrong in the algorithm chain. You must control the algorithm.

An interesting result could be the computational complexity of the algorithm.


Particular comments


Many phrases in the text are not written in standard English. For example, what does it mean '... are generally not hold...'?  Please control the text with a mother language corrector, not just Google translate.

At line 138 there is an error in the noise updating formula.

Be careful to always indicate Dn the noise from Wiener and D the other. The signs over D such as hat (^), tilde or line are not distinguishable.

There are other language errors but if you control careful you may correct them all.


Author Response

Thank you very much for the suggestions. Attached is the reply for the comments

Author Response File: Author Response.pdf

Reviewer 2 Report

Contributions:

This paper presents a speech enhancement method for secure communication by combining Wiener filter and spectral subtraction for noise estimation. I think this paper requires major revisions. My comments are given below:

1.     (Line 131 in page 4) The statement “… clean signal y(t)” is not correct. It should be “x(t)”.


2.     (Page 4) The expression of eq. (6) is not adequate. X(m,k) should be the estimated spectrum of clean speech. A hat for the symbol X(m,k) is required.


3.     (Page 5) The marked area should be presented by a dotted line in Fig. 1.


4.     (Line 1 after Fig. 1 of Page 5) The statement “where m is the frequency index and k is the frame index.” is not correct. It should be revised as ““where m and k denote the frame and frequency indices, respectively.”.     


5.     (Page 5) The expression of eq. (8) is wrong.


6.     (Page 6) The expression of eq. (11) is not correct. The value of q for the second condition is missed.


7.     (Page 6) The authors utilized posteriori SNR given in eqs. (13) and (14) to estimate the priori SNR psi given in eq. (12). I think this method is not good. Please give an utterance as an example to present the performance by providing true priori SNR and estimated one.


8.     (Page 10) It is not available to distinguish which compared method is better by observing Fig. 5. Please try to provide another example by using an utterance which contain speech-pause regions.


9.     (Line 1 after eq.(13) in page 6) The symbol Φn is a typo.


10.  (Line 2 after eq. (9) in page 5) The expression of midY(m,k) is unclear. I think it is a typo.


11.  The presentation of the caption for each sub-figure in Figs. 6-8 is not correct. Please prepare them according the authors’ guide of the journal.

   

12.   (Line 58 in page 2) The abbreviation SS is re-defined. It has been defined at line 54.


13.  Sections 2 and 3 should be combined into one section.


14.  Sections 5 and 6 also can be combined into one section.


15.  (Line 118 in page 4) The sentence from “Usually, …” should be changed to next paragraph.


16.   (Page 7) The font size in Fig. 2 should be smaller.


17.  The sub-grid lines in Tables 1-3 should be removed.


18.  The presentation of caption for Table 2 is too redundant. Some statements can be moved to text. Please revise it.


19.  The English usage is not satisfied. It is strongly recommended to be proofread by a native English speaker to improve the written quality.

A.     (Line 94 in page 3) The word during is repeated.

B.     (Line 3 of page 4)”… with limited window’s length.” can be revised as “”… with limited window length.”.

C.     (Lines 125 to 126 in page 4) The descriptions “Usually, they are determined heuristically based on the estimate of the signal-to-noise ratio (SNR). Their examples are [19,21,22].” should be revised as “They are determined heuristically based on the estimate of the signal-to-noise ratio (SNR) [19,21,22].”.

D.     (The last line of page 4)”…fourier transform…” should be revised as ”…Fourier transform…”.

E.     (Line 1 after Fig. 1 of Page 5)”… is fed into the our noise estimator…” should be revised as “”… is fed into the proposed noise estimator…”.

F.      (Line 206 in page 9)The statement “Figure 5 compares the noise estimates of our method with other noise estimators:” should be revised as “Figure 5 shows the comparisons of the noise estimates for our method with other noise estimators:”.


Author Response

Attached is the response for the reviewers comments

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The manuscript has improved since the first submission.

Several critical points I foundare now quite clear and I think that this contribution could be useful for many journal readers.


The only further point is the following. My request to give the testing utterances in phonetic transcription is due to my interest in the utterance  pronunciation. Still I do not know, however, if the word cepat you transcribe as c-e-p-a-t  is pronounced as /tj/ /e/ /p/ /a/ /t/ or /ch/ /e/ /p/ /a/ /t/ or what type of vowels are used or whatever. The point is useful, in my opinion, to know what is the consonant/vowel content of the testing material to have more details on how your approach works. This can be simply done using IPA symbols or telling what kind of symbols are used.



Author Response

Thank you for the suggestions.

We have revised Table 1 to show the phonetic pronunciation of each utterance using IPA symbols. We hope the presentation of table 1 would suffice to give clear  view on readers on their pronounciation.

Reviewer 2 Report

The revised version of this paper has been much improved by the authors. It requires minor revision before publication.


In my previous comment 7: “The presentation of the caption for each sub-figure in Figs. 6-8 is not correct. Please prepare them according the authors’ guide of the journal.”. The authors still did not prepare the figure captions according to the authors’ guide of the journal. For example, the sub-caption “(a) Original clean speech” should be revised as “(a)”. The other statements should be moved to caption. In addition, the position of sub-caption should be at the bottom of each figure.


Author Response

Thank you for pointing out our mistakes on the caption for multiple figures. We have modified the captions of Figs. 4-9 accordingly to meet the standard for MDPI journals.

Back to TopTop