Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination
Round 1
Reviewer 1 Report
This paper propose a combination of the GE2E loss function and cross-entropy loss to improve speaker verification performance by generating speaker features using multiple attention-intensive techniques.
The speaker verification process remains an actual and interesting topic. However, I have several recommendations for the authors in order to improve the presentation of their paper.
Major issue
- The authors are proposing a new method in the section 3, but the technique is very poorly presented both to conceptual and mathematical level. I suggest to the authors to explain the technique in detail and to highlight precisely their contributions, not only in one paragraph/subsection. In current phase, the proposed technique seems to be superficially treated.
Minor issue
- Please remove the title of the figure 2;
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
- The paper mentioned that Class-GE2E loss was used. However, from fig 1, it seems that GE2E loss was used only in pre-training and final training was done using cross entropy. Is that the case? Please clarify.
- If (1) is true, then what is the purpose of using GE2E for pretraining. Why can't we use cross entropy directly? How does the result compare with using cross entropy directly? Actually, I am expecting some combination of GE2E and cross entropy.
- I would like to author to explain the rationale behind the sorted multi-head attention.
- According to
https://doi.org/10.48550/arXiv.2106.09650
- The paper only compare the sorted multi-head attention with single-head attention and multi-head attention. However, according to paperswithcode.com, the result by wav2vec on VoxCeleb1 dataset EER 1.88 which is much lower than 4.39 achieved by the author. You have to justify what is special in this sorted multi-head attention.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
This paper, “Class-GE2E: Speaker Verification Using Self Attention and Transfer Learning with Loss Combination”, proposes an advanced multi-head attention method that utilizes a sorted-vector of the frame-level features to consider higher correlation.
Regarding the paper, I have the following comments.
- The introduction section, along with the related work, is insufficient.
- The authors should mention their contribution in the introduction section as the readers are also interested in finding how this work is similar and different from the existing work? Hence the manuscript needs to be revised, keeping this in mind. Also, mention the novelty of your work.
- Authors are advised to add subsections on attention mechanism, self-attention and other higher-order pooling with attention. A comparative table could be added for reference.
- In section 3, you should discuss the existing methods before the proposed method.
- What type of pre-processing of the data set is used? How many samples are used for training and testing? Such information is to be presented in the paper.
- What features were selected? How are they selected?
- Give a block diagram for the complete speaker verification system for better understanding.
- In this paper, authors have used Cosine Similarity, though other similarity measures exist like Euclidean Distance. Manhattan Distance, Jaccard Similarity. Minkowski Distance etc. So why these methods are not used?
- How are the results validated?
- There are some grammar and typo errors, hence in the revised manuscript, try to fix them.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Good luck!
Reviewer 2 Report
Nil.
Reviewer 3 Report
Ok with the revised manuscript.