Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination

Electronics 2022, 11(6), 893; https://doi.org/10.3390/electronics11060893

by Ara Bae and Wooil Kim^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2022, 11(6), 893; https://doi.org/10.3390/electronics11060893

Submission received: 9 February 2022 / Revised: 5 March 2022 / Accepted: 9 March 2022 / Published: 12 March 2022

(This article belongs to the Section Computer Science & Engineering)

Round 1

Reviewer 1 Report

This paper propose a combination of the GE2E loss function and cross-entropy loss to improve speaker verification performance by generating speaker features using multiple attention-intensive techniques.

The speaker verification process remains an actual and interesting topic. However, I have several recommendations for the authors in order to improve the presentation of their paper.

Major issue

The authors are proposing a new method in the section 3, but the technique is very poorly presented both to conceptual and mathematical level. I suggest to the authors to explain the technique in detail and to highlight precisely their contributions, not only in one paragraph/subsection. In current phase, the proposed technique seems to be superficially treated.

Minor issue

Please remove the title of the figure 2;

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper mentioned that Class-GE2E loss was used. However, from fig 1, it seems that GE2E loss was used only in pre-training and final training was done using cross entropy. Is that the case? Please clarify.
If (1) is true, then what is the purpose of using GE2E for pretraining. Why can't we use cross entropy directly? How does the result compare with using cross entropy directly? Actually, I am expecting some combination of GE2E and cross entropy.
I would like to author to explain the rationale behind the sorted multi-head attention.
According to

https://doi.org/10.48550/arXiv.2106.09650

" jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective"
The paper only compare the sorted multi-head attention with single-head attention and multi-head attention. However, according to paperswithcode.com, the result by wav2vec on VoxCeleb1 dataset EER 1.88 which is much lower than 4.39 achieved by the author. You have to justify what is special in this sorted multi-head attention.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

This paper, “Class-GE2E: Speaker Verification Using Self Attention and Transfer Learning with Loss Combination”, proposes an advanced multi-head attention method that utilizes a sorted-vector of the frame-level features to consider higher correlation.

Regarding the paper, I have the following comments.

The introduction section, along with the related work, is insufficient.
The authors should mention their contribution in the introduction section as the readers are also interested in finding how this work is similar and different from the existing work? Hence the manuscript needs to be revised, keeping this in mind. Also, mention the novelty of your work.
Authors are advised to add subsections on attention mechanism, self-attention and other higher-order pooling with attention. A comparative table could be added for reference.
In section 3, you should discuss the existing methods before the proposed method.
What type of pre-processing of the data set is used? How many samples are used for training and testing? Such information is to be presented in the paper.
What features were selected? How are they selected?
Give a block diagram for the complete speaker verification system for better understanding.
In this paper, authors have used Cosine Similarity, though other similarity measures exist like Euclidean Distance. Manhattan Distance, Jaccard Similarity. Minkowski Distance etc. So why these methods are not used?
How are the results validated?
There are some grammar and typo errors, hence in the revised manuscript, try to fix them.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Good luck!

Reviewer 2 Report

Nil.

Reviewer 3 Report

Ok with the revised manuscript.

Article Menu

Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination

Further Information

Guidelines

MDPI Initiatives

Follow MDPI