Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination

Bae, Ara; Kim, Wooil

doi:10.3390/electronics11060893

Open AccessArticle

Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination

by

Ara Bae

and

Wooil Kim

^*

Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Korea

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(6), 893; https://doi.org/10.3390/electronics11060893

Submission received: 9 February 2022 / Revised: 5 March 2022 / Accepted: 9 March 2022 / Published: 12 March 2022

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Recent studies prove that speaker verification performance improves by employing an attention mechanism compared to using temporal and statistical pooling techniques. This paper proposes an advanced multi-head attention method, which utilizes a sorted vector of the frame-level features to consider a higher correlation. In this study, we also propose a transfer learning scheme to maximize the effectiveness of the two loss functions, which are the classifier-based cross entropy loss function and metric-based GE2E loss function, to learn the distance between embeddings. The sorted multi-head attention (SMHA) method outperforms the conventional attention methods showing 4.55% in equal error rate (EER). The proposed transfer learning scheme with Class-GE2E loss function significantly improved our attention-based systems. In particular, the EER of the SMHA decreased to 4.39% by employing transfer learning with Class-GE2E loss. The experimental results demonstrate that our effort to include a greater correlation between frame-level features for multi-head attention processing, and the combining of two different loss functions through transfer learning, is highly effective for improving speaker verification performance.

Keywords:

self-attention; speaker verification; transfer learning

1. Introduction

Speaker verification determines whether a speaker is registered in the system. In the enrollment step, the speaker feature obtained by inputting the user’s speech into the feature extractor is stored in the system. Similarity scores are calculated between the feature of a test user, obtained in the same way as those of enrolled users stored in the system. If the similarity score is above a threshold (

θ

), the test user is accepted, and if not, it is rejected. Figure 1 shows the speaker verification diagram.

In conventional methods, the i-vector was used as the speaker feature. As the neural network is widely used in research, studies on a new feature to mimic the i-vector have been vigorously explored [1,2]. Recently, many neural-network-based methods of extracting speaker embeddings have mainly been employed in speaker recognition systems [3,4,5,6,7,8,9]. In neural machine translation (NMT), an attention mechanism was introduced to assign large weights to the features that are useful for generating new domain features. The attention mechanism is currently used in various areas such as computer vision and speech recognition, including speaker recognition. In speaker recognition, the focus has mainly been on how to generate features containing more effective speaker information [10,11,12,13,14,15,16]. In neural-network-based speaker recognition systems, for each speaker’s utterance, an important frame-level representation is captured to generate speaker embedding with a fixed length. Temporal pooling [5], also known as average pooling, averages frame-level representations extracted from neural networks on a time axis, and statistics pooling [6] calculates the average and standard deviation. These methods generated speaker embedding with a fixed length. The attention mechanism was introduced to reflect information on the frames which is more important for creating speaker embedding. Related works, compared to the temporal pooling technique where all frames have the same importance, proved that the attention mechanism was able to extract more effective target speaker characteristics that distinguished from other speakers [12].

To learn the feature extractor model used as the front end, it is also important to select and use an effective object function. The method of classifying the speaker ID by adding softmax to the fully connected layer was widely used so that the front-end model could fully reflect speaker identity [3,6,7]. Many classifiers use cross-entropy with multiple classes as a loss function. Recently, the triple loss function [5] has been proposed to learn more distinguishing characteristics from other speakers. This learning is performed so that the distance from the positive sample, which is another speech sample from the same speaker as the anchor input, is smaller, and the distance from the negative sample, which is a speech sample from different speakers, is larger. However, triple loss has a problem in that it requires a long training time owing to the generation of many combinations for one utterance. To solve this problem, the GE2E loss function was suggested [17], which is trained to select the utterances of several speakers to form a mini-batch with a larger value than the utterances of the same speaker and a lower value than the utterances of other speakers through a similarity matrix. Since the GE2E loss function uses similarity scores computed from a mini-batch, it does not directly consider speaker characteristics in training.

In this paper, we propose an improved attention method in order to obtain more robust speaker embedding. In the proposed attention technique, the multi-head attention weights are calculated from the sorted frame-level feature considering the similarity among the feature elements. We also propose a neural model that uses classifier-based loss and GE2E loss functions together in transfer learning to obtain more effective speaker characteristics and embedding for distinguishing speakers.

Section 2 describes the attention layers with single-head and multi-head mechanisms based on LSTM as baseline systems. The proposed methods, including sorted multi-head attention and classifier-GE2E with transfer learning, are presented in Section 3. The experimental settings related to the data, model structure, and results are discussed in Section 4. The paper is concluded in Section 5.

2. Related Works

A frame-level representation can be obtained by using an acoustic-level representation such as MFCCs and a Mel spectrogram as an input to the neural network. In our system, we used the log-Mel spectrogram for acoustic features and the LSTM for the neural network model [9,11,17,18].

Speaker information can be generated by using a simple average, with all frames considered to have equal weights. However, each frame may or may not contain important information. The attention mechanism takes this into account and concentrates on frame information, which is important for generating speaker information, to generate a more effective speaker embedding. Single-head attention evaluates the importance from one person’s perspective [19] and multi-head attention evaluates the importance from multiple people’s perspectives [20]. In this study, both methods were used as baseline systems.

2.1. Single-Head Attention (SHA) Layer

To measure the importance of each frame, the frame-level feature

H \in ℝ^{D \times T}

obtained from the neural network is nonlinearly transformed, and the score

A

is calculated, as shown in (1), through the trainable parameters

W \in ℝ^{D \times D}

,

b \in ℝ^{D}

and

V \in ℝ^{D \times D}

:

\begin{matrix} H = L S T M (x) \\ A = s o f t m a x (σ (H^{T} W + b^{T}) \cdot V) \end{matrix}

(1)

In (1),

x

is the log-Mel spectrogram, and

σ ()

denotes an activation function, which is used as

t a n h ()

in our study. The attention weight

A \in ℝ^{T \times 1}

is obtained by using the score from the softmax function, making it a probability value, and using it as the weight for the frame. The obtained attention weight is broadcast to the dimensional size of the frame, followed by the summing of the frame-level features and weights to focus on important information to create an utterance-level feature

E \in ℝ^{D \times T}

. The process of generating speaker embedding through the attention layer is expressed in (2):

E = A t t e n t i o n (A, H) = \sum_{T} A ⊙ H

(2)

Here, the operator

⊙

denotes an element-wise multiplier.

2.2. Multi-Head Attention (MHA) Layer

The overall process is the same as that of single-head attention; however, the largest difference is that frame-level features are divided by the number of heads, each passing through the attention layer, as expressed in (3). Thus, the number of generated speaker features is equal to the number of heads. As shown in (3), frame-level features and attention weights are weighted to form an utterance-level feature. The obtained utterance-level features

E_{i}

are concatenated and used as the speaker embedding. Here,

i

= [1, 2, …, head] indicates the index of the head:

\begin{matrix} E_{i} = A t t e n t i o n (A_{i}, H_{i}) \\ E = c o n c a t e (E_{1}, \dots, E_{i}, \dots, E_{h e a d}) \end{matrix}

(3)

2.3. Generalized End-to-End Loss

In order to generate speaker embedding representing speaker characteristics, a structure where utterance-level representations are put into a fully connected layer and classified as speaker IDs was mainly used. As a loss function, cross-entropy with a multi-class was generally employed. However, cross entropy does not sufficiently differentiate other speakers’ embeddings since it is not designed to directly optimize embedding similarity. Therefore, loss functions to learn embedding that directly optimized distance metrics were proposed. Contrast loss and triplet loss demonstrated promising performances; however, they have the disadvantage of taking a long time in training. Recently, generalized end-to-end (GE2E) loss is being popularly employed in speaker recognition systems that show an improved performance. These systems construct a mini-batch by selecting multiple speakers and training the same speaker’s utterance as a similarity score matrix to have a larger value than that of the other speakers.

3. Proposed Methods

3.1. Sorted Multi-Head Attention (SMHA) Layer

Compared to simple average pooling, it has been demonstrated in related studies that attention techniques are useful for generating features that hold speaker characteristics distinct from those of other speakers [18,19,20]. Therefore, in this study, single-head attention and multi-head attention were used as the baseline, and multi-head attention techniques were modified to formulate more effective frame-level features. In this paper, we propose a method of calculating weights from multiple heads, in which highly correlated elements are obtained within the same head.

Conventional multi-head attention splits the dimension size of the LSTM output according to the number of heads to create sub-features, calculates attention weights for each sub-feature, and then multiplies them by

H_{i}

to create sub-embedding. In the process of splitting, the correlation of each feature is not considered. In this paper, we propose a sorted multi-head attention that generates sub-embedding by dividing the ordered values of frame-level features to consider the correlation between the features while computing the weights. After evenly dividing the sorted LSTM output

s o r t (H)

, the divided element

H_{i}^{S}

is input into each attention layer to generate embedding

E_{i}

as presented in (4):

\begin{matrix} H_{i}^{S} = s o r t {(H)}_{i} \\ E_{i} = A t t e n t i o n (A_{i}^{S}, H_{i}^{S}) \\ E = c o n c a t e (E_{1}, \dots, E_{i}, \dots, E_{h e a d}) \end{matrix}

(4)

Here,

A_{i}^{S}

is the attention weight obtained using (1) with the divided element

H_{i}^{S}

. Our proposed sorted multi-head attention (SMHA) allows similar values to be calculated within the same head. We believe that the proposed method is able to obtain more robust attention weights by focusing on more important features among similar features within the same head.

3.2. Transfer Learning

GE2E loss does not model speaker characteristics directly since it only considers distance using similarity scores in a mini-batch. In this paper, we introduce a training method that combines classification loss and GE2E loss to obtain speaker embedding that increases not only speaker characteristics but also the distinction from other speakers. However, using a combination of classification loss and GE2E loss from the initial training would result in a successful training due to an unbalanced performance between the two different loss functions [21].

In this paper, we propose a system utilizing transfer learning that uses GE2E loss to generate a pre-learned model that then adds a classifier model. In transfer learning, some or all of the pre-learned models are used, and the model parameters are fine-tuned for a new task [22]. In this study, we pre-train the model to distinguish different speakers using the GE2E loss function, and add the classification layer with the cross-entropy at the point when the equal error rate (EER) drops below 10% during training. The final loss function is used by combining GE2E loss (

L_{G}

) and cross-entropy loss (

L_{C E}

) as shown in (5):

L = L_{G} + L_{C E}

(5)

Figure 2 shows the models using (a) classification loss (cross-entropy), (b) metric-based loss (GE2E), and (c) the model structure proposed in this paper. The speaker embedding is obtained as the output of the fully connected layer in the system (a) and the output of the attentive pooling block in the system (b), respectively. In our system (c), after the model training is completed, the output of the attentive pooling block is used as the speaker embedding and the cosine similarity is applied for a score calculation.

4. Experiments

4.1. Dataset

This study conducted training and evaluation using VoxCeleb1, which is a set of audio-visual data extracted from an interview video uploaded on YouTube [23]. The VoxCeleb1 has 148,642 samples from 1251 speakers and 4874 samples from 40 speakers for training and testing, respectively. The dataset has a ratio of approximately 6:4 males to females, 16 kHz sampling rate, and speech length between 4 s and 20 s. In this study, 40 dimensions of Mel-filterbank features were extracted with window and shift sizes set at 25 ms and 12.5 ms, respectively. The samples with extracted features of less than 80 frames were not selected for training. We also used 100 samples of development parts of VoxCeleb2 as validation data in training.

4.2. Model Architecture

LSTM is a neural network model that adds multiple gates to a recurrent neural network (RNN), which uses the old state as additional information to prevent information loss over time. Therefore, it is suitable for characterizing the voice signal because it accumulates from the first frame to the last frame and maintains important information. In this paper, acoustic-level features from 80 frames with an overlap of the previous 40 frames were used as the model input. Then, we built a baseline system consisting of a three-layer LSTM with 768 nodes and one linear layer projecting into 256 dimensions to be used as speaker features. The pre-trained model structure is the LSTM network with the GE2E loss function, and a similar neural network structure was used to extract the speaker’s integration. A model was added to the pre-trained model to input the speaker’s integration, pass it through the fully connected DNN, and output the speaker’s ID. Two losses, the cross-entropy loss function and the GE2E loss, were used for the ground truth label and the model output. We used the cosine similarity score (CSS), which is a simple method of similarity among speakers that mainly focused on the speaker verification performance over the change in the attention layer used for the front end and proposed loss combination.

4.3. Experimental Results

Table 1 presents a comparison of the proposed method with the baseline system in terms of speaker verification performance. The model of aggregated frame-level representation using a single-head attention (SHA) with GE2E loss function showed 5.13% in equal error rate (EER). In addition, when trained with the Class-GE2E loss function, the EER decreased to 4.49%. The EER of multi-head attention (MHA) was 5.85% when trained with GE2E loss and 4.43% when trained with Class-GE2E loss. We confirmed that the performance improved when training with the proposed Class-GE2E loss, using both the existing attention mechanisms, SHA and MHA. Our proposed sorted multi-head attention (SMHA), considering the correlation of features, generates a sub-embedding and shows a higher performance compared to conventional methods. The sorted multi-head attention model proposed in this paper using a Class-GE2E loss showed the best performance in all experiments with a 4.39% EER. The detection error trade-off (DET) curves for all the systems are shown in Figure 3. Here, the red star indicates the EER.

5. Conclusions

To improve speaker verification performance by generating speaker features using multiple attention-intensive techniques, this paper proposed a combination of the GE2E loss function that learns characteristics differentiated from those of other speakers and a classification-based loss function that learns various characteristics. The proposed sorted multi-head attention (SMHA) method outperformed the conventional attention methods showing a 4.55% EER, which considers the correlation between the features. By employing Class-GE2E, the EER of the MHA was improved from 5.85% to 4.43% with a 24.37% relative reduction, and the EER of the SMHA decreased to 4.39% with a 3.52% relative reduction. The experimental results demonstrated that a consideration of the correlation between features by the SMHA improves speaker verification performance compared to the existing multi-head method. The results also confirmed that speaker embedding generated to represent more distinct speaker identities from other speakers is highly effective for improving speaker verification performance by using the two loss functions through the proposed transfer learning scheme.

Author Contributions

Data curation, A.B.; Formal analysis, A.B.; Funding acquisition, W.K.; Investigation, A.B.; Methodology, A.B.; Project administration, W.K.; Writing—original draft, A.B.; Writing—review and editing, W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (No. NRF-2021R1F1A1063347).

Data Availability Statement

The data presented in this study are available in https://www.sciencedirect.com/science/article/pii/S0885230819302712 (accessed on 8 February 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Lee, H.-S.; Lu, Y.-D.; Hsu, C.-C.; Tsao, Y.; Wang, H.-M.; Jeng, S.-K. Discriminative Autoencoders for Speaker Verification. In Proceedings of the 2017 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Rohdin, J.; Silnova, A.; Diez, M.; Plchot, O.; Matejka, P.; Burget, L. End-to-End DNN Based Speaker Recognition Inspired by I-Vector and PLDA. In Proceedings of the 2018 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J. Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification. In Proceedings of the 2014 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014. [Google Scholar]
Liu, Y.; Qian, Y.; Chen, N.; Fu, T.; Zhang, Y.; Yu, K. Deep Feature for Text-Dependent Speaker Verification. Speech Commun. 2015, 73, 1–13. [Google Scholar] [CrossRef]
Heigold, G.; Moreno, I.; Bengio, S.; Shazeer, N. End-to-End Text-Dependent Speaker Verification. In Proceedings of the 2016 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Povey, D.; Khudanpur, S. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the 2018 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Wang, Q.; Downey, C.; Wan, L.; Mansfield, P.A.; Moreno, I.L. Speaker Diarization with LSTM. In Proceedings of the 2018 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Lin, Q.; Yin, R.; Li, M.; Bredin, H.; Barras, C. LSTM Based Similarity Measurement with Spectral Clustering for Speaker Diarization. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
Zhu, Y.; Ko, T.; Snyder, D.; Mak, B.; Povey, D. Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar]
Rezaur, F.A.; Chowdhury, R.; Wang, Q.; Moreno, I.L.; Wan, L. Attention-Based Models for Text-Dependent Speaker Verification. In Proceedings of the 2018 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Okabe, K.; Koshinaka, T.; Shinoda, K. Attentive Statistics Pooling for Deep Speaker Embedding. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar]
You, L.; Guo, W.; Dai, L.-R.; Du, J. Deep Neural Network Embeddings with Gating Mechanisms for Text-Independent Speaker Verification. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
India, M.; Safari, P.; Hernando, J. Self Multi-Head Attention for Speaker Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
Fujita, Y.; Kanda, N.; Horiguchi, S.; Xue, Y.; Nagamatsu, K.; Watanabe, S. End-to-End Neural Speaker Diarization with Self-Attention. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019, Singapore, 14–18 December 2019. [Google Scholar]
Sankala, S.; Rafi, B.S.M.; Kodukula, S.R.M. Self Attentive Context Dependent Speaker Embedding for Speaker Verification. In Proceedings of the 2020 in National Conference on Communications (NCC), Kharagpur, India, 21–23 February 2020. [Google Scholar]
Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized End-to-End Loss for Speaker Verification. In Proceedings of the 2018 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Bae, A.; Kim, W. Speaker Verification Employing Combinations of Self-Attention Mechanisms. Electronics 2020, 9, 2201. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Chung, J.S.; Huh, J.; Mun, S.; Lee, M.; Heo, H.-S.; Choe, S.; Ham, C.; Jung, S.; Lee, B.-J.; Han, I. In Defence of Metric Learning for Speaker Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–28 October 2020. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How Transferable Are Features in Deep Neural Networks? In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3320–3328. [Google Scholar]
Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-Scale Speaker Verification in the Wild. Comput. Speech Lang. March 2020, 60, 101027. [Google Scholar] [CrossRef]

Figure 1. Block diagram of speaker verification system.

Figure 2. (a) Classification model architecture using cross-entropy loss, (b) metric-based model architecture using GE2E loss, and (c) model architecture that re-trains by adding a classifier to the model pre-trained with GE2E loss, such as the model proposed in this paper.

Figure 3. Detection error tradeoff curve of models using multiple attention techniques.

Table 1. Equal error rate (%) of speaker verification systems.

Models	GE2E	Class-GE2E
Single-Head Attention (SHA)	5.13	4.49
Multi-Head Attention (MHA)	5.85	4.43
Sorted Multi-Head Attention (SMHA)	4.55	4.39

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bae, A.; Kim, W. Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination. Electronics 2022, 11, 893. https://doi.org/10.3390/electronics11060893

AMA Style

Bae A, Kim W. Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination. Electronics. 2022; 11(6):893. https://doi.org/10.3390/electronics11060893

Chicago/Turabian Style

Bae, Ara, and Wooil Kim. 2022. "Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination" Electronics 11, no. 6: 893. https://doi.org/10.3390/electronics11060893

APA Style

Bae, A., & Kim, W. (2022). Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination. Electronics, 11(6), 893. https://doi.org/10.3390/electronics11060893

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination

Abstract

1. Introduction

2. Related Works

2.1. Single-Head Attention (SHA) Layer

2.2. Multi-Head Attention (MHA) Layer

2.3. Generalized End-to-End Loss

3. Proposed Methods

3.1. Sorted Multi-Head Attention (SMHA) Layer

3.2. Transfer Learning

4. Experiments

4.1. Dataset

4.2. Model Architecture

4.3. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI