Auditory Device Voice Activity Detection Based on Statistical Likelihood-Ratio Order Statistics
Round 1
Reviewer 1 Report
This manuscript presented a technique to reduce the false rejection error of an LRT-based VAD. The DFT filter bank is used in this method, and a decrease of false rejection probability is observed in the experiments. This is an interesting work. Before considering the publication, few questions need to be addressed before considering the publication.
1. Compared with previous method showing in the literature, what are the shining points of this work? Please clarify it in the introduction part.
2. This technique is good to reduce the false rejection probability between 0 and 20dB. What's the performance in SNRs larger than 20dB?
Author Response
I would like to thank the reviewer for the constructive comments that help me improve the paper quality. I have tried my best to revise the original manuscript by taking into account all the comments raised by the reviewer and responded to each of the individual comments in this letter. These responses are listed as follows.
Author Response File: Author Response.pdf
Reviewer 2 Report
Two chapters have the same order numbers: 2. Auditory Device VAD Implementation, 2. Conventional Statistical LRT-Based VAD.
Line 47. It is worth to add one more important VAD method which uses detection of 4 Hz frequency in speech. This frequency is a result of an average time of 0.25 second of syllables.
Lines 97-98. It is much better to write R=16 as “the frame shift length via STFT approach” than “down sample factor”.
Lines 99 – 101. There is a lack of uniqueness in indexing vector components of x and . For R=16 and K=128 indexes have the form 16 l + k+1 where E.g. for frame l=1 and k=35 index is equal to 66, similar like the index for frame l=2 and k=51.
Line 100. Why (LPF) is called „the prototype” filter?
Line 101. It is difficult to understand the text in line 101 and English should be improved. Probably x is the input of the LPF filter and is a filter output. It should be presented in Fig. 1.
The entire paragraph from line 97 to 108 should be reworded. There are a lot of facts too briefly described in it.
In line 107, it's better to write 2q instead of k = 2q.
It is difficult to understand the lines 109 - 113 and it is not comfortable to seek an explanation in the cited literature [20-22, 26].
Lines 118-119. It is not true that X(l) is „noisy speech”. According to (1), X(l) can be the noise only.
Formula (5) contradicts intuition. This is the ratio of the probability of speech presence to the probability of speech absent. The function on the right side of (5) is strongly decreasing monotonic function of SNR. It follows that the stronger is the speech signal relative to noise, the less likely it is to detect the presence of speech.
Line 123. What is DD?
In line 124 it is better to write “where 0 ≤ ? < 1” instead of “where ?(0 ≤ ? < 1)”.
In line 125 (7) “is expressed as” SNR. This name is strange because SNR ratio is always defined as the ratio of the signal power to the power of background noise.
Figures should be placed after the texts in which they are presented. This means that Figures 2-4 should be presented below.
Line 196. It should be written „by the author” instead of “by the first author” because there are no co-authors of the paper.
Line 203. It should be written “Figure 4” instead of “Figure 3”.
Author Response
I would like to thank the reviewer for the constructive comments that help me improve the paper quality. I have tried my best to revise the original manuscript by taking into account all the comments raised by the reviewer and responded to each of the individual comments in this letter. These responses are listed as follows.
Author Response File: Author Response.pdf