Class-GE2E: Speaker Veriﬁcation Using Self-Attention and Transfer Learning with Loss Combination

: Recent studies prove that speaker veriﬁcation performance improves by employing an attention mechanism compared to using temporal and statistical pooling techniques. This paper proposes an advanced multi-head attention method, which utilizes a sorted vector of the frame-level features to consider a higher correlation. In this study, we also propose a transfer learning scheme to maximize the effectiveness of the two loss functions, which are the classiﬁer-based cross entropy loss function and metric-based GE2E loss function, to learn the distance between embeddings. The sorted multi-head attention (SMHA) method outperforms the conventional attention methods showing 4.55% in equal error rate (EER). The proposed transfer learning scheme with Class-GE2E loss function signiﬁcantly improved our attention-based systems. In particular, the EER of the SMHA decreased to 4.39% by employing transfer learning with Class-GE2E loss. The experimental results demonstrate that our effort to include a greater correlation between frame-level features for multi-head attention processing, and the combining of two different loss functions through transfer learning, is highly effective for improving speaker veriﬁcation performance.


Introduction
Speaker verification determines whether a speaker is registered in the system. In the enrollment step, the speaker feature obtained by inputting the user's speech into the feature extractor is stored in the system. Similarity scores are calculated between the feature of a test user, obtained in the same way as those of enrolled users stored in the system. If the similarity score is above a threshold (θ), the test user is accepted, and if not, it is rejected. Figure 1 shows the speaker verification diagram.

Introduction
Speaker verification determines whether a speaker is registered in the system. In the enrollment step, the speaker feature obtained by inputting the user's speech into the feature extractor is stored in the system. Similarity scores are calculated between the feature of a test user, obtained in the same way as those of enrolled users stored in the system. If the similarity score is above a threshold ( ), the test user is accepted, and if not, it is rejected. Figure 1 shows the speaker verification diagram. In conventional methods, the i-vector was used as the speaker feature. As the neural network is widely used in research, studies on a new feature to mimic the i-vector have been vigorously explored [1,2]. Recently, many neural-network-based methods of extracting speaker embeddings have mainly been employed in speaker recognition systems [3][4][5][6][7][8][9]. In neural machine translation (NMT), an attention mechanism was introduced to assign large weights to the features that are useful for generating new domain features. The In conventional methods, the i-vector was used as the speaker feature. As the neural network is widely used in research, studies on a new feature to mimic the i-vector have been vigorously explored [1,2]. Recently, many neural-network-based methods of extracting speaker embeddings have mainly been employed in speaker recognition systems [3][4][5][6][7][8][9]. In neural machine translation (NMT), an attention mechanism was introduced to assign large weights to the features that are useful for generating new domain features. The attention mechanism is currently used in various areas such as computer vision and speech recognition, including speaker recognition. In speaker recognition, the focus has mainly been on how to generate features containing more effective speaker information [10][11][12][13][14][15][16]. In neural-network-based speaker recognition systems, for each speaker's utterance, an important frame-level representation is captured to generate speaker embedding with a fixed length. Temporal pooling [5], also known as average pooling, averages frame-level representations extracted from neural networks on a time axis, and statistics pooling [6] calculates the average and standard deviation. These methods generated speaker embedding with a fixed length. The attention mechanism was introduced to reflect information on the frames which is more important for creating speaker embedding. Related works, compared to the temporal pooling technique where all frames have the same importance, proved that the attention mechanism was able to extract more effective target speaker characteristics that distinguished from other speakers [12].
To learn the feature extractor model used as the front end, it is also important to select and use an effective object function. The method of classifying the speaker ID by adding softmax to the fully connected layer was widely used so that the front-end model could fully reflect speaker identity [3,6,7]. Many classifiers use cross-entropy with multiple classes as a loss function. Recently, the triple loss function [5] has been proposed to learn more distinguishing characteristics from other speakers. This learning is performed so that the distance from the positive sample, which is another speech sample from the same speaker as the anchor input, is smaller, and the distance from the negative sample, which is a speech sample from different speakers, is larger. However, triple loss has a problem in that it requires a long training time owing to the generation of many combinations for one utterance. To solve this problem, the GE2E loss function was suggested [17], which is trained to select the utterances of several speakers to form a mini-batch with a larger value than the utterances of the same speaker and a lower value than the utterances of other speakers through a similarity matrix. Since the GE2E loss function uses similarity scores computed from a mini-batch, it does not directly consider speaker characteristics in training.
In this paper, we propose an improved attention method in order to obtain more robust speaker embedding. In the proposed attention technique, the multi-head attention weights are calculated from the sorted frame-level feature considering the similarity among the feature elements. We also propose a neural model that uses classifier-based loss and GE2E loss functions together in transfer learning to obtain more effective speaker characteristics and embedding for distinguishing speakers. Section 2 describes the attention layers with single-head and multi-head mechanisms based on LSTM as baseline systems. The proposed methods, including sorted multihead attention and classifier-GE2E with transfer learning, are presented in Section 3. The experimental settings related to the data, model structure, and results are discussed in Section 4. The paper is concluded in Section 5.

Related Works
A frame-level representation can be obtained by using an acoustic-level representation such as MFCCs and a Mel spectrogram as an input to the neural network. In our system, we used the log-Mel spectrogram for acoustic features and the LSTM for the neural network model [9,11,17,18].
Speaker information can be generated by using a simple average, with all frames considered to have equal weights. However, each frame may or may not contain important information. The attention mechanism takes this into account and concentrates on frame information, which is important for generating speaker information, to generate a more effective speaker embedding. Single-head attention evaluates the importance from one person's perspective [19] and multi-head attention evaluates the importance from multiple people's perspectives [20]. In this study, both methods were used as baseline systems.

Single-Head Attention (SHA) Layer
To measure the importance of each frame, the frame-level feature H ∈ R D×T obtained from the neural network is nonlinearly transformed, and the score A is calculated, as shown in (1), through the trainable parameters W ∈ R D×D , b ∈ R D and V ∈ R D×D : In (1), x is the log-Mel spectrogram, and σ() denotes an activation function, which is used as tanh() in our study. The attention weight A ∈ R T×1 is obtained by using the score from the softmax function, making it a probability value, and using it as the weight for the frame. The obtained attention weight is broadcast to the dimensional size of the frame, followed by the summing of the frame-level features and weights to focus on important information to create an utterance-level feature E ∈ R D×T . The process of generating speaker embedding through the attention layer is expressed in (2): Here, the operator denotes an element-wise multiplier.

Multi-Head Attention (MHA) Layer
The overall process is the same as that of single-head attention; however, the largest difference is that frame-level features are divided by the number of heads, each passing through the attention layer, as expressed in (3). Thus, the number of generated speaker features is equal to the number of heads. As shown in (3), frame-level features and attention weights are weighted to form an utterance-level feature. The obtained utterance-level features E i are concatenated and used as the speaker embedding. Here, i = [1, 2, . . . , head] indicates the index of the head:

Generalized End-to-End Loss
In order to generate speaker embedding representing speaker characteristics, a structure where utterance-level representations are put into a fully connected layer and classified as speaker IDs was mainly used. As a loss function, cross-entropy with a multi-class was generally employed. However, cross entropy does not sufficiently differentiate other speakers' embeddings since it is not designed to directly optimize embedding similarity. Therefore, loss functions to learn embedding that directly optimized distance metrics were proposed. Contrast loss and triplet loss demonstrated promising performances; however, they have the disadvantage of taking a long time in training. Recently, generalized end-toend (GE2E) loss is being popularly employed in speaker recognition systems that show an improved performance. These systems construct a mini-batch by selecting multiple speakers and training the same speaker's utterance as a similarity score matrix to have a larger value than that of the other speakers.

Sorted Multi-Head Attention (SMHA) Layer
Compared to simple average pooling, it has been demonstrated in related studies that attention techniques are useful for generating features that hold speaker characteristics distinct from those of other speakers [18][19][20]. Therefore, in this study, single-head attention and multi-head attention were used as the baseline, and multi-head attention techniques were modified to formulate more effective frame-level features. In this paper, we propose a method of calculating weights from multiple heads, in which highly correlated elements are obtained within the same head.
Conventional multi-head attention splits the dimension size of the LSTM output according to the number of heads to create sub-features, calculates attention weights for each sub-feature, and then multiplies them by H i to create sub-embedding. In the process of splitting, the correlation of each feature is not considered. In this paper, we propose a sorted multi-head attention that generates sub-embedding by dividing the ordered values of frame-level features to consider the correlation between the features while computing the weights. After evenly dividing the sorted LSTM output sort(H), the divided element H S i is input into each attention layer to generate embedding E i as presented in (4): Here, A S i is the attention weight obtained using (1) with the divided element H S i . Our proposed sorted multi-head attention (SMHA) allows similar values to be calculated within the same head. We believe that the proposed method is able to obtain more robust attention weights by focusing on more important features among similar features within the same head.

Transfer Learning
GE2E loss does not model speaker characteristics directly since it only considers distance using similarity scores in a mini-batch. In this paper, we introduce a training method that combines classification loss and GE2E loss to obtain speaker embedding that increases not only speaker characteristics but also the distinction from other speakers. However, using a combination of classification loss and GE2E loss from the initial training would result in a successful training due to an unbalanced performance between the two different loss functions [21].
In this paper, we propose a system utilizing transfer learning that uses GE2E loss to generate a pre-learned model that then adds a classifier model. In transfer learning, some or all of the pre-learned models are used, and the model parameters are fine-tuned for a new task [22]. In this study, we pre-train the model to distinguish different speakers using the GE2E loss function, and add the classification layer with the cross-entropy at the point when the equal error rate (EER) drops below 10% during training. The final loss function is used by combining GE2E loss (L G ) and cross-entropy loss (L CE ) as shown in (5): Figure 2 shows the models using (a) classification loss (cross-entropy), (b) metric-based loss (GE2E), and (c) the model structure proposed in this paper. The speaker embedding is obtained as the output of the fully connected layer in the system (a) and the output of the attentive pooling block in the system (b), respectively. In our system (c), after the model training is completed, the output of the attentive pooling block is used as the speaker embedding and the cosine similarity is applied for a score calculation.

Dataset
This study conducted training and evaluation using VoxCeleb1, which is a set of audio-visual data extracted from an interview video uploaded on YouTube [23]. The VoxCeleb1 has 148,642 samples from 1251 speakers and 4874 samples from 40 speakers for training and testing, respectively. The dataset has a ratio of approximately 6:4 males to females, 16 kHz sampling rate, and speech length between 4 s and 20 s. In this study, 40 dimensions of Mel-filterbank features were extracted with window and shift sizes set at 25 ms and 12.5 ms, respectively. The samples with extracted features of less than 80 frames were not selected for training. We also used 100 samples of development parts of VoxCeleb2 as validation data in training.

Model Architecture
LSTM is a neural network model that adds multiple gates to a recurrent neural network (RNN), which uses the old state as additional information to prevent information loss over time. Therefore, it is suitable for characterizing the voice signal because it accumulates from the first frame to the last frame and maintains important information. In this paper, acoustic-level features from 80 frames with an overlap of the previous 40 frames were used as the model input. Then, we built a baseline system consisting of a three-layer LSTM with 768 nodes and one linear layer projecting into 256 dimensions to be used as speaker features. The pre-trained model structure is the LSTM network with the GE2E loss function, and a similar neural network structure was used to extract the speaker's integration. A model was added to the pre-trained model to input the speaker's integration, pass it through the fully connected DNN, and output the speaker's ID. Two losses, the cross-entropy loss function and the GE2E loss, were used for the ground truth label and the model output. We used the cosine similarity score (CSS), which is a simple method of similarity among speakers that mainly focused on the speaker verification performance over the change in the attention layer used for the front end and proposed loss combination. Table 1 presents a comparison of the proposed method with the baseline system in terms of speaker verification performance. The model of aggregated frame-level representation using a single-head attention (SHA) with GE2E loss function showed 5.13% in equal error rate (EER). In addition, when trained with the Class-GE2E loss function, the EER

Dataset
This study conducted training and evaluation using VoxCeleb1, which is a set of audiovisual data extracted from an interview video uploaded on YouTube [23]. The VoxCeleb1 has 148,642 samples from 1251 speakers and 4874 samples from 40 speakers for training and testing, respectively. The dataset has a ratio of approximately 6:4 males to females, 16 kHz sampling rate, and speech length between 4 s and 20 s. In this study, 40 dimensions of Mel-filterbank features were extracted with window and shift sizes set at 25 ms and 12.5 ms, respectively. The samples with extracted features of less than 80 frames were not selected for training. We also used 100 samples of development parts of VoxCeleb2 as validation data in training.

Model Architecture
LSTM is a neural network model that adds multiple gates to a recurrent neural network (RNN), which uses the old state as additional information to prevent information loss over time. Therefore, it is suitable for characterizing the voice signal because it accumulates from the first frame to the last frame and maintains important information. In this paper, acoustic-level features from 80 frames with an overlap of the previous 40 frames were used as the model input. Then, we built a baseline system consisting of a three-layer LSTM with 768 nodes and one linear layer projecting into 256 dimensions to be used as speaker features. The pre-trained model structure is the LSTM network with the GE2E loss function, and a similar neural network structure was used to extract the speaker's integration. A model was added to the pre-trained model to input the speaker's integration, pass it through the fully connected DNN, and output the speaker's ID. Two losses, the cross-entropy loss function and the GE2E loss, were used for the ground truth label and the model output. We used the cosine similarity score (CSS), which is a simple method of similarity among speakers that mainly focused on the speaker verification performance over the change in the attention layer used for the front end and proposed loss combination. Table 1 presents a comparison of the proposed method with the baseline system in terms of speaker verification performance. The model of aggregated frame-level representation using a single-head attention (SHA) with GE2E loss function showed 5.13% in equal error rate (EER). In addition, when trained with the Class-GE2E loss function, the EER decreased to 4.49%. The EER of multi-head attention (MHA) was 5.85% when trained with GE2E loss and 4.43% when trained with Class-GE2E loss. We confirmed that the performance improved when training with the proposed Class-GE2E loss, using both the existing attention mechanisms, SHA and MHA. Our proposed sorted multi-head attention (SMHA), considering the correlation of features, generates a sub-embedding and shows a higher performance compared to conventional methods. The sorted multi-head attention model proposed in this paper using a Class-GE2E loss showed the best performance in all experiments with a 4.39% EER. The detection error trade-off (DET) curves for all the systems are shown in Figure 3. Here, the red star indicates the EER. We confirmed that the performance improved when training with the proposed Class-GE2E loss, using both the existing attention mechanisms, SHA and MHA. Our proposed sorted multi-head attention (SMHA), considering the correlation of features, generates a sub-embedding and shows a higher performance compared to conventional methods. The sorted multi-head attention model proposed in this paper using a Class-GE2E loss showed the best performance in all experiments with a 4.39% EER. The detection error trade-off (DET) curves for all the systems are shown in Figure 3. Here, the red star indicates the EER.

Conclusions
To improve speaker verification performance by generating speaker features using multiple attention-intensive techniques, this paper proposed a combination of the GE2E loss function that learns characteristics differentiated from those of other speakers and a classification-based loss function that learns various characteristics. The proposed sorted multi-head attention (SMHA) method outperformed the conventional attention methods showing a 4.55% EER, which considers the correlation between the features. By employing Class-GE2E, the EER of the MHA was improved from 5.85% to 4.43% with a 24.37% relative reduction, and the EER of the SMHA decreased to 4.39% with a 3.52% relative reduction. The experimental results demonstrated that a consideration of the correlation between features by the SMHA improves speaker verification performance compared to the existing multi-head method. The results also confirmed that speaker embedding generated to represent more distinct speaker identities from other speakers is highly effective for improving speaker verification performance by using the two loss functions through the proposed transfer learning scheme.

Conclusions
To improve speaker verification performance by generating speaker features using multiple attention-intensive techniques, this paper proposed a combination of the GE2E loss function that learns characteristics differentiated from those of other speakers and a classification-based loss function that learns various characteristics. The proposed sorted multi-head attention (SMHA) method outperformed the conventional attention methods showing a 4.55% EER, which considers the correlation between the features. By employing Class-GE2E, the EER of the MHA was improved from 5.85% to 4.43% with a 24.37% relative reduction, and the EER of the SMHA decreased to 4.39% with a 3.52% relative reduction. The experimental results demonstrated that a consideration of the correlation between features by the SMHA improves speaker verification performance compared to the existing multi-head method. The results also confirmed that speaker embedding generated to represent more distinct speaker identities from other speakers is highly effective for improving speaker verification performance by using the two loss functions through the proposed transfer learning scheme.