Next Article in Journal
Scale-Adaptive Simulation of Unsteady Cavitation Around a Naca66 Hydrofoil
Next Article in Special Issue
Summarization of Spanish Talk Shows with Siamese Hierarchical Attention Networks
Previous Article in Journal
PPDC: A Privacy-Preserving Distinct Counting Scheme for Mobile Sensing
Previous Article in Special Issue
Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech
 
 
Article
Peer-Review Record

An Analysis of the Short Utterance Problem for Speaker Characterization

Appl. Sci. 2019, 9(18), 3697; https://doi.org/10.3390/app9183697
by Ignacio Viñals *, Alfonso Ortega *, Antonio Miguel * and Eduardo Lleida *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2019, 9(18), 3697; https://doi.org/10.3390/app9183697
Submission received: 11 July 2019 / Revised: 27 August 2019 / Accepted: 29 August 2019 / Published: 5 September 2019

Round 1

Reviewer 1 Report

This paper studies the effect of short utterances on the embedding representation by ivectors in speaker recognition.  It is argued that a mismatch in the distribution of phonemes in enrollment and test segments, in its most extreme case the complete missing of phonemes, is the main cause for the degradation of speaker recognition performance with short utterance.  Although this is shown experimentally in a convincing way, no concrete solution is proposed to deal with the situation of poorly overlapping phonetic content between enrollment and test segments. 

In the text leading to (1) it is sort of posed that this holds for i-vectors, but I'd prefer a thorough derivation from the traditional theory of ivectors. 

In the low dimensional simulation (l221), it is not really clear what is done exactly.  I don't see how the two speakers come in and how they are parameterized, and why the blue speaker behaves differently from the red speaker. 

Formula (5) is definitely wrong, please fix. 

In 5.2, why did you decide to sample short utterances yourself, and not use the official 10sec excerpts defined in SRE10?

In 5.3, the process for finding short balanced segments needs to be detailed.  How, exactly, did you do this?

In studying the KL2 distance (line 377), with respect to what index are the probability distributions defined?  Is that the phonemic index found by the DNN?

In the paragraph 383, you ignore what happens to the KL2 distance for non-targets,  This goes up, too, but if anything, this is going to help discrimination.  You need to say something about it. 

In the same paragraph, you refer to table 4, but that has not been introduced yet, so I don't see why it's referenced there. 

In line 400, the claim that target trials are the ones mainly responsible for the degradation does not follow from the analysis above, IMHO.  Figure 6 is more convincing towards that claim. 

In paragraph 454, I don't think you can argue that errors move from target (misses) to non-target (FAs). The operating point is chosen based on the minimum DCF, and that is, I assume, optimized per condition.  So the particular ratio of misses and FAs is, in a way, arbitrary.  You can, however, look at the distribution of llh-scores, and see if there is a shift in either targets or non-targets or both. 

Some smaller remarks:

 - l35, "is" -- "has been"

- l43, "voiceprint" -- -please refrain from using this term in relation to speaker recognition. 

- l90 "such as [19, 20]" -- please expand to what these techniques are

- l95 "the compensation ... are also balanced" --- I don't think this is what you mean

- l171 "are benefitted with" -- "benefit from" 

- l171 "Any" -- "Every"

- l172 "tends to" -- "becomes like"

- l174 "latest" -- "last"

- l174 "In" -- "As a"

- l180 "tends to" -- "is similar to"

- l182 "its" -- "their"

- l183 "its" -- "their"

- l185 "according to" -- "by"

- l187 "few" -- "little

- l187 "raises" -- "increases"

- l190 "with" -- remove

- l239 "Thanks to" -- "Following" or "With"

- l406 "LDC" -- this is new here

- l420 "Discriminative" -- "Decision"

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper is very good and well written. It provides good explanations of i-vectors. But some of the explanations may be wrong and need clarification. In particular, Eq. 2 is over-simplified and may not agree with the equation for i-vector extraction. This leads to problematic interpretation in Fig. 2-4. Note that the posterior mean (i-vector) of latent factor w has two summation over the Gaussians, one in the numerator and one in the denominator. In Eq. 6 of Dehak's paper, both N(u) and F(u) involve summations over Gaussians. 

 

- Eq. 2 is intuitive but it is unclear how the i-vector extraction equation can be written in the form of Eq. 2. The authors are suggested to express the i-vector equation and explain how it can be written in Eq. 2. 

 

- Lines 255-267: The authors state that \alpha_i is the (prior) probability of feature vectors to be sampled from the component i of the GMM. But the i-vector extraction equation (the posterior mean of w) is more complicated than Eq. 2. If we consider G(A_i) as the first-order sufficient statistics times the posterior covariance and \alpha_i is the i-th component of T times the inverse of the i-th UBM covariance, then \alpha_i is not the prior prob.

 

- Lines 263-269: The concept is problematic here because \alpha_i depends on the zeroth order sufficient statistics. You may have non-uniform posteriors for the 4 Gaussians but it does not mean that identical non-uniformaity happens in \alpha_i.

 

- Fig. 4: If you want two phonemes to have no contribution, you should not set their corresponding \alpha_i to 0. Instead, you should set the posterior prob. of the corresponding Gaussian to 0, i.e., \gamma_i(x)=Pr(mixture=i|x)=0. Then, use these posteriors to compute \alpha_i, which may not equal to 0.

 

- When the authors selected the frames randomly or in a balanced manner, did they perform VAD first or perform it after the selection?

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

In the article authors successfully defend hypothesis that quality of speaker recognition/verification mostly depends on phonetic content of learning/verifying utterances than from the length of them. Advantages of the article: 1) Clearly stated hypothesis. 2) Busily done verification of hypothesis based on speech corpus data. Disadvantages of the article: 1) The hypothesis was not used in any production system of speaker identification/verification. 2) The scientific soundness of the article can be improved if more then 50% of citations will be younger than 5 years. 3) Some typo errors should be corrected. Eg. dual logarithm in equation (5).

 

Author Response

The authors thank the reviewer for his/her effort giving us the corresponding suggestions.

Regarding to the suggestions, we accept the critics to the manuscript and we have included modifications to fix some of the typos the reviewer talks about, including those errors in equation 5.

Round 2

Reviewer 2 Report

The authors have addressed most of my concerns. But the main one (the definition of \alpha_i) is still problematic and requires further clarification. My comments and suggestions are as follows.

1) In Eq. 4-Eq. 7, $\alpha_i = \frac{N_i(A)}{N(A)}$. The authors need to define $N(A)$. According to the text, it seems that N(A) is the number of frames in utterance A.

2) In Eq. 7, $\Sigma$ (without any subscript) is a function of both A and ${\alpha_i}_{i=1^M}$. It is better not to use $\Sigma$ as it confuses with $\Sigma_i$. The problematic part of Eq. 7 is that the authors ignore the dependence of the function $\Sigma$ on ${\alpha_i}_{i=1}^M$. Suppose we define $\alpha = \{\alpha_1,\ldots,\alpha_M\}$. Then, Eq. 7 could be written as

\[

   (\Sigma(A,\alpha))^{-1} \sum_{i=1}^M\alpha_i\Gamma_i(A)

   = \sum_{i=1}^M \alpha_i(\Sigma(A,\alpha))^{-1}\Gamma_i(A)

\]

Note that in this equation, $\alpha_i$ appears in the inverse as well as the summation, which is what I mean in the previous report. The authors may argue that the linear combination weights are \beta_i as shown in the above equation. But in that case, \beta_i will not be the posterior probability. Note also that \alpha_i is not the posterior probability of each feature vector (as claimed by the authors) because N_i(A) is the zeroth order sufficient statistics. It is the sum over the posterior probabilities across all frames in the utterance.

3) Note that the authors define A as $A=\{a_i,\ldots,a_N}$, where N is the number of frames. The use of A_i, where i=1,\ldots,M, will cause confusion.

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 3

Reviewer 2 Report

My comments and suggestions have been well addressed. The paper is now ready for publication.

Author Response

The authors want to thank reviewer for the time and effort spent reading the manuscript and submitting his/her recommendation. The suggestions were very valuable for the improvement of our work.

Back to TopTop