Comparison of Modern Deep Learning Models for Speaker Verification

Brydinskyi, Vitalii; Khoma, Yuriy; Sabodashko, Dmytro; Podpora, Michal; Khoma, Volodymyr; Konovalov, Alexander; Kostiak, Maryna

doi:10.3390/app14041329

Open AccessArticle

Comparison of Modern Deep Learning Models for Speaker Verification

by

Vitalii Brydinskyi

^1,2,*

,

Yuriy Khoma

^1,2

,

Dmytro Sabodashko

¹

,

Michal Podpora

³

,

Volodymyr Khoma

^1,4

,

Alexander Konovalov

² and

Maryna Kostiak

¹

Institute of Computer Technologies, Automation and Metrology, Lviv Polytechnic National University, Bandery 12, 79013 Lviv, Ukraine

²

Vidby AG, Suurstoffi 8, 6343 Risch-Rotkreuz, Switzerland

³

Department of Computer Science, Opole University of Technology, Proszkowska 76, 45-758 Opole, Poland

⁴

Department of Control Engineering, Opole University of Technology, Proszkowska 76, 45-758 Opole, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(4), 1329; https://doi.org/10.3390/app14041329

Submission received: 29 December 2023 / Revised: 24 January 2024 / Accepted: 31 January 2024 / Published: 6 February 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This research presents an extensive comparative analysis of a selection of popular deep speaker embedding models, namely WavLM, TitaNet, ECAPA, and PyAnnote, applied in speaker verification tasks. The study employs a specially curated dataset, specifically designed to mirror the real-world operating conditions of voice models as accurately as possible. This dataset includes short, non-English statements gathered from interviews on a popular online video platform. The dataset features a wide range of speakers, with 33 males and 17 females, making a total of 50 unique voices. These speakers vary in age from 20 to 70 years old. This variety helps in thoroughly testing speaker verification models. This dataset is especially useful for research on speaker verification with short recordings. It consists of 10 clips for each person, each clip being no longer than 10 s, adding up to 500 recordings in total. The total length of all recordings is about 1 h and 30 min, which averages to roughly 100 s for each speaker. This dataset is a valuable tool for research in speaker verification, particularly for studies involving short audio clips. The performance of these models is evaluated using common biometric metrics such as false acceptance rate (FAR), false rejection rate (FRR), equal error rate (EER) and detection cost function (DCF). The results reveal that the TitaNet and ECAPA models stand out by presenting the lowest EER (1.91% and 1.71%, respectively) and thus exhibiting higher discriminative features, ensuring, on the one hand, a reduction in intra-class distance (the same speaker), and, on the other hand, maximizing the distance between different speaker embeddings. This analysis also highlights the ECAPA model’s advantageous balance of performance and efficiency, achieving an inference time of 69.43 milliseconds, slightly longer than the PyAnnote models. This study not only compares the performance of models but also provides a comparative analysis of respective model embeddings, offering insights into their strengths and weaknesses. The presented findings serve as a foundation for guiding future research in speaker verification, especially in the context of short audio samples or limited data. This may be particularly relevant for applications requiring quick and accurate speaker identification from short voice clips.

Keywords:

speaker embedding models; speaker verification; non-English speech dataset evaluation

1. Introduction

Speaker verification is one of the important research areas in fast-growing modern machine learning applications, known as Speaker Recognition (SR). Two other SR sub-fields are speaker identification and speaker diarization [1,2].

Speaker verification aims to determine whether the declared identity of the speaker is true or false. During speaker verification, the system typically compares the provided voice sample with a reference voiceprint (template) associated with the declared identity; this is known as a one-to-one mapping. Eventually, the system makes a decision indicating whether the speaker is verified (matches the provided identity) or not verified (does not match the provided identity). In contrast to verification, the task of speaker identification is to determine the identity of the speaker from a set of possible speakers. In the speaker identification process, the system compares a given voice sample with voiceprints from a database of known speakers (one-to-many mapping). Then, the system attempts to identify the speaker by selecting the closest match from the database.

The main goal of speaker diarization is to divide the input audio recording into homogeneous segments corresponding to individual speakers. Traditionally, speaker diarization involves both segmentation (detecting speaker changes in an audio recording) and clustering (grouping speech segments corresponding to specific speakers based on voice characteristics). Unlike speaker verification and identification, speaker diarization can be approached using unsupervised learning [3].

Although this paper is dedicated to studying several state-of-the-art deep speaker embedding models primarily in the context of verification, the results may also have implications for the broader field of speaker recognition.

Nowadays, speaker verification has become a crucial task in many applications, such as security, authentication, surveillance, forensics, multi-speaker tracking, personalized user interfaces, voice assistants, door access control systems and others [4,5,6,7]. Nonetheless, the most important capability of speaker verification is the ability to identify and distinguish between different speakers accurately to be able to pinpoint a specific speaker.

Speech is a time-varying signal that conveys information on multiple levels, including acoustic, semantic, emotional and anatomical. The inherent variability of speech, coupled with the fact that even the same individual cannot say a phrase identically twice, makes its analysis a complex challenge. This complexity poses significant challenges in designing speaker recognition and verification systems.

Early speaker verification systems were constrained by the limitations of statistical methods, which resulted in restricted accuracy. It was only with the advent of machine learning that a significant improvement in performance was achieved [2].

In recent years, numerous new models have emerged in the field of NLP; however, the dominant trend in speech processing revolves around models based on speaker embedding technology. The essence of the speaker embeddings technique is to map the essential features of a speaker’s voice into a numerical (vector) representation of constant dimensions in a multidimensional space [8,9]. Over the subsequent years, the concept of speaker embeddings has been continuously developed, primarily aiming to measure and enhance performance under variable recording or processing conditions [10].

Speaker embeddings offer several advantages compared to traditional speech processing techniques, including greater compactness of voice representation, resilience to noise and recording condition changes and improved discrimination leading to enhanced performance in speech recognition tasks. Currently, speaker embedding technology can be applied to various tasks, including speaker recognition, verification, identification, diarization, tracking different speakers in a recording and detecting the presence of specific keywords [1,11].

Advanced characteristics of speaker embeddings are achieved through training models on extensive voice datasets, allowing the capture of unique voice features crucial for reliable speaker recognition and verification. Modern models vary in architecture, methods of calculating embeddings, length of embeddings and numerous parameters [1,12]. A key area of interest is comparing the performance of these models in real-world scenarios, including their adaptability and portability to other languages.

This study presents a comparison of state-of-the-art models for speaker verification tasks, focusing on models that utilize speaker embeddings. These models offer notable advantages, such as scalability [13], allowing the addition of new users without the need for retraining or fine-tuning. This feature is particularly valuable in applications with constantly growing and changing user bases, where the ability to seamlessly integrate new users without the computational overhead of updating the model is crucial. Moreover, this scalability ensures that the model’s performance remains consistent and reliable, making these models ideal for large-scale application [14]. The models’ performance was compared using a custom dataset comprising 50 speakers. The experiment’s source code and reproducibility details can be found in the following repository: [15].

The primary contributions of the paper include (1) an extensive comparison of state-of-the-art models such as WavLM, TitaNet, Ecapa and Pyannote, with an emphasis on their performance in speaker verification tasks; (2) the utilization of a newly compiled dataset, featuring non-English speech samples of Ukrainian politicians, for evaluating these models; and (3) a thorough analysis of the embeddings produced by each model, providing insights into their strengths, weaknesses and potential applications in real-world scenarios.

The organization of the paper is as follows: Section 1 introduces the context and significance of speaker verification, Section 2 reviews related works and establishes the background for the research, Section 3 outlines the aims of the research and the rationale behind the selection of the models and dataset, Section 4 provides a detailed overview of the models under study, Section 5 describes the experimental setup and methodology, and, finally, Section 6 presents the findings and conclusions.

2. Related Works

The number of publications on Automatic Speech Recognition (ASR), including speaker verification, is rapidly growing [11]. Real-world applications necessitate the development of automatic speaker verification systems, which would be capable of using short arbitrary (text-independent) utterances that might be captured in diverse settings, including time and equipment. These variables intensify the challenge of identifying specific unique voice characteristics of a speaker, which are inherently subject to variability [16,17,18].

Numerous studies have proposed innovative approaches to enhance speaker verification by exploring various neural network architectures and processing methods, in many cases utilizing open or self-generated datasets [19,20,21,22]. For instance, paper [20] presents a novel network with a hierarchical processing method, where information about intonation as a high-level function is combined with low-level embedding functions. This improves the accuracy of the system, which is confirmed by experimental results on the VoxCeleb1 test data. Paper [21] investigates the relationship between visual speech cues, like lip movements, and audio speech, leading to the introduction of a cross-modal speech co-learning paradigm.

Recognizing speakers based on non-English language utterances presents another significant challenge. Studies show that speaker verification performance declines when there is a language mismatch between training and testing data. To improve cross-language invariance, ref. [23] proposed a method of unsupervised adversarial discriminative domain adaptation. In this approach, the representative speech data from one language is adapted to align with the source domain of another language. Taking it a step further, the authors of [24] introduced the Siamese SpeakerNet network, able to verify speakers without being influenced by language, gender, or age differences and demonstrated superior performance compared to existing methods.

In this study, we concentrate on open and accessible pre-trained speech models for speaker verification. Following a thorough analysis, we selected the following four state-of-the-art speech models for further investigation: Pyannote, WavLM, TitaNet and ECAPA [25,26,27,28].

Several publications have discussed the results and capabilities of these language models. For instance, ref. [26] describes the WavLM speaker embedding model as versatile and suitable for a range of tasks, including speaker verification, speech recognition and diarization. The authors of WavLM assert that their model outperforms others, specifically ECAPA, in speaker verification tasks on the VoxCeleb dataset.

Studies [3,29] present the results of utilizing the Pyannote and ECAPA models in diarization tasks rather than verification tasks.

The authors of [30] aim to review the recent progress in speaker embedding development and to perform an experimental benchmark experimental comparison among the state-of-the-art speaker representations for a speaker verification task. This study uses the VoxCeleb1 dataset for the evaluation of these models. The dataset used in the conducted experiment in this study is in the English language. This study performs training as well as evaluation, and they compare the EER metrics for various types of speaker embedding types, such as x-vectors, d-vectors, r-vectors and others. This study does not consider the inference time nor delves into why these embeddings perform well or not. Table 1 presents the results of the experiments of [30], including the EER scores.

In [31], the authors aim to compare the speaker verification performance for children’s and adult’s speech. To achieve this, they use the GMM-based speaker verification system and PF-STAR dataset, which contains children’s English speech utterances.

Our study evaluates the effectiveness of current speaker verification models without additional modifications, emphasizing that the dataset size is not a critical factor for analysis [32,33,34]. This approach is distinct from scenarios where extensive data are essential for training purposes [35,36,37]. We also explore the challenges posed by using brief recordings from real-world environments and examine the performances of these models with Ukrainian speakers, despite the primary training of the models being on English utterances only.

3. Aim of Research

The objective of this study is to identify the most effective state-of-the-art speaker embedding models for verifying speakers, with a special focus on international voices. We assess the models’ ability to verify speakers using a limited set of short utterances, each no longer than 10 s and exclusively in Ukrainian. These utterances were collected from open sources that fall under the definition of “in the wild”. The findings aim to provide a comprehensive guide for speaker verification and offer insights for selecting appropriate speaker embedding models in both research and commercial environments.

4. Models Overview

The models that are used for speaker embedding extraction use different approaches and different neural network architectures to achieve their results and performance. However, these models follow the same generalized architecture, which is presented in Figure 1.

This generalized speaker embedding model architecture consists of three main blocks: feature extraction, deep neural network and speaker labels classifier. A speaker embedding is the feature vector extracted from the output of the deep neural network, with the intention to be used as the input for the speaker classifier at the training stage. The training stage is structured to ensure that the speaker embeddings for different speaker classes are positioned as far apart as possible within a vector space. In Table 2, a comparison of the tested models is presented.

4.1. WavLM

WavLM is a new pre-trained model that leverages self-supervised learning within the HuBERT framework, prioritizing both spoken content modeling and speaker identity retention. This model is pre-trained on large-scale unlabeled data and is able to enhance task performance, diminish the need for data labeling and ease the task-specific adaptation process. The WavLM model learns not only the ASR information by masked speech prediction but also the knowledge of non-ASR tasks by speech denoising modeling. The authors optimize the model structure and training data of HuBERT and wav2vec2. The addition of the gated relative position bias to the Transformer structure as the backbone improves model performance for ASR while keeping almost the same number of parameters and training speed. They also propose an utterance mixing training strategy, where additional overlapped utterances are created unsupervised and incorporated during model training to improve speaker discrimination. The WavLM was pre-trained on a large amount of unlabeled data consisting of 94k h of public audio data (including Libri-Light, GigaSpeech and VoxPopuli datasets). WavLM achieves state-of-the-art performance on the SUPERB benchmark and significant improvements for various speech processing tasks on their representative benchmarks [26].

The WavLM architecture uses the Transformer model as a backbone. It contains a convolutional feature encoder and a Transformer decoder. To improve the model, a gated relative position bias [38] is encoded based on the offset between the “key” and “query” in the Transformer self-attention mechanism. Compared with the convolutional relative position embedding in wav2vec2 and HuBERT, the gates take the content into consideration and adaptively adjust the relative position bias by conditioning on the current speech content.

4.2. PyAnnote

PyAnnote is a framework which provides various modules for speaker diarization, including an embedding module, which is used for speaker embedding extraction [25,39]. PyAnnote speaker embedding model is based on the canonical x-vectors TDNN-base architecture [40] with filter banks replaced by trainable SincNet features [41]. PyAnnote provides pre-trained PyTorch models which share the same generic PyanNet base architecture.

PyAnnote’s speaker embedding model uses a 512-unit-wide and 3-recurrent-layer deep network, which relies on x-vector-like statistical temporal pooling and has been trained on short 500 ms audio chunks. The speaker embedding model was trained on the VoxCeleb dataset.

4.3. TitaNet

TitaNet is a model for a speaker representation extraction, which employs 1D depth-wise convolutions with Squeeze-and-Excitation (SE) layers that integrate global context, and a channel attention-based statistics pooling layer that maps variable-length utterances to fixed-length embeddings (t-vectors) [42]. This model achieves state-of-the-art results in the speaker verification task [27,43]. Authors of TitaNet use the encoder of the ContextNet model as a top-level feature extractor and feed the output to the attentive pooling layer. This layer computes attention features across channel dimensions to capture the time-independent utterance-level speaker representations. The output speaker representation has a fixed size of 192. The training dataset for TitaNet consists of the data from VoxCeleb1 and VoxCeleb2, NIST SRE, Switchboard-Cellular1 and Switchboard-Cellular2, Fisher and LibriSpeech. Combined, these datasets consist of approximately 4.8 M utterances from 16.6K speakers with a total duration of 3.3K h. Additionally, the training process is enhanced with RIR impulse corpora, speed perturbation and spectral augmentation.

4.4. ECAPA

The Ecapa-TDNN model improves the existing x-vector speaker representation extraction architecture with multiple enhancements inspired by the recent advancements in face verification. The authors introduce the Squeeze-and-Excitation blocks into the 1-D Res2Net modules to explicitly model channel interdependencies. An SE block expands the temporal context of the frame layer by rescaling the channels according to the global properties of the recording [42]. Authors also aggregate and propagate features of different hierarchical levels to leverage the ability of the neural networks to learn these hierarchical features. They also improve the statistics pooling module with channel-dependent frame attention, which enables the network to focus on different subsets of frames while estimating statistics. The proposed architecture outperforms state-of-the-art TDNN-based systems on the VoxCeleb test datasets [28]. The Ecapa-TDNN model is trained on the VoxCeleb2 dataset, which includes approximately 5.9K unique speakers. The authors also apply augmentations to the training data audio files, and, as a result, generate six samples per utterance. The aggregations include babble, noise, reverberation, increased and decreased tempo and the effects of using different Opus codec compression settings. The output speaker representation has a fixed size of 192 elements.

5. Experiment Setup

In this study, a custom-created dataset of non-English short utterances was used to evaluate speaker verification models based on speaker embeddings. Comprising speech recordings from 50 Ukrainian politicians (50 unique voices, 33 male and 17 female) sourced from YouTube interviews, the dataset features 10 clips per individual, each up to 10 s long. This amounts to a total of 500 recordings, collectively spanning 1 h and 30 min (about 100 s per speaker). Every entry within the dataset consists of single-channel (mono) audio sampled at 16 kHz, recorded in the .wav format. Additionally, it is worth noting that the age range of the speakers spans from 20 to 70. The dataset is designed to be diverse, including speakers of different genders and ages, to provide a comprehensive evaluation of the speaker verification models. This dataset is an important resource for research in speaker verification and in the field of speaker verification with short-duration recordings.

In order to evaluate the performance of the speaker verification models, a test dataset of 4500 pairs of recordings was created. Half of these pairs consist of recordings of the same speaker, while the other half consists of recordings of different speakers. This allows for a comprehensive evaluation of the models’ ability to accurately identify a speaker and differentiate them from others.

The dataset is organized into 50 distinct classes or speakers, each corresponding to a different speaker. Within the designated directory for each class, there are 10 audio files containing the respective speakers’ voices. The structural layout of the dataset is illustrated in Figure 2.

The test set, derived from the dataset depicted in Figure 2, is specifically tailored for evaluating speaker verification performance. In this restructured dataset, each speaker class yields 90 audio file pairs. These pairs are evenly divided into two subsets: one containing 45 intra-class pairs, where each audio file is matched with another from the same speaker, and another with 45 inter-class pairs, formed by pairing a random audio file from the target class with one from a different, randomly chosen class. The composition and structure of the test set are detailed in Figure 3.

To evaluate and compare each of the speaker verification models’ performance, the Equal Error Rate (EER), False Acceptance Rate (FAR) and False Rejection Rate (FRR) were used, which are popular performance metrics in biometric systems. The False Acceptance Rate (FAR) represents the proportion of incorrect acceptances, indicating the likelihood that an unauthorized speaker is mistakenly verified. The False Rejection Rate (FRR) indicates the proportion of incorrect rejections, reflecting the frequency with which an authorized speaker is erroneously denied. The Equal Error Rate (EER) metric is a measure of the error rate of the system where the threshold is modified in such a way that the False Acceptance Rate and False Rejection Rate are equal [44], providing a single value to summarize the overall trade-off between the FAR and FRR values.

The experiment is conducted in the following steps for each of the pairs in the collected dataset. Both audio files of a pair are passed to the speaker verification model as an input, where the model outputs an embedding. Both output embeddings (which correspond to the input audio files) are compared using the cosine similarity function, which is a number between 0 and 1. Then, based on the retrieved similarity and threshold, a decision is made: if similarity is equal to or greater than the threshold, the decision is to accept (the speakers in the input audio files are the same person), but if it is lower than the threshold, the decision is to reject (the speakers in the input audio files are not the same person). After this process is repeated for each of the pairs in the dataset, the predicted decisions are compared against the ground truth and the metrics are calculated. In Figure 4, the experiment structure is displayed.

All experiments were run on the following hardware: a virtual cloud server (powered by RunPod) with an A4500 GPU with 20 GB of VRAM.

The experiment results are presented in Table 3.

From the experiment results, it can be seen that out of the five selected speaker verification models, TitaNet and Ecapa models provide superior speaker verification performance, achieving the lowest rates for FAR, FRR and EER. Pyannote ranks third in performance. In contrast, both variants of the WavLM model have worse results compared to the others. To delve deeper into these results, we computed additional parameters such as embedding distance between different classes and embedding distance between the same class, and the Detection Cost Function (DCF) [45], which combines false acceptance and rejection rates into a single cost metric, reflecting the overall effectiveness of the system. These findings are detailed in Table 4.

Further analysis of the results shows that the better-performing models, namely TitaNet and Ecapa, exhibit greater disparities between intra-class (same class) and inter-class (different class) distances compared to the less effective models, such as WavLM. Notably, the standard deviation for the different-class distances with the WavLM model is much higher than the other models. These observations explain why the TitaNet and Ecapa models performed so well and why the WavLM model did not. This variability implies that embeddings from WavLM are more dispersed, complicating the accurate differentiation between speakers. Figure 5 visualizes some of the speaker embeddings, offering a clearer perspective on the experimental findings.

The prior observations are further confirmed by the visualizations. It can be seen that the embeddings of the better-performing models (TitaNet, Ecapa) are more tightly clustered among the speakers, which facilitates more accurate speaker verification. In contrast, the WavLM model produces more scattered embeddings, resulting in less distinct groupings, and, consequently, less reliable verification.

6. Conclusions

In summary, our study provides a comprehensive comparison of modern deep learning models for speaker verification based on speaker embeddings on a custom-created dataset. The findings highlight the robust capabilities of speaker embeddings in this domain and provide valuable insights for further research in this field. Out of all the tested speaker verification models, the TitaNet and Ecapa models emerged as superior, with the lowest Equal Error Rates of 1.91% and 1.71%, respectively. While Pyannote trailed with a higher EER of 3.8%, WavLM resulted in an EER of 10.88%. The capability of speaker embedding models to generate distinctly separable embeddings plays a crucial role in a speaker verification system.

A critical aspect of our study was the application of large models to small datasets, specifically for testing purposes. This approach, not intended for training, was instrumental in assessing the models’ efficiency and accuracy in real-world scenarios where data availability (of both language-related datasets and particular speaker samples) can be limited.

The model with the shortest per-sample inference time was Pyannote (49.44 ms), while the longest one was the TitaNet (110.18 ms). Notably, the Ecapa model provides the best combination of short inference time (69.43 ms), which is a little bit longer than PyAnnote, coupled with superior accuracy as reflected by the lowest Equal Error Rate among the models evaluated. This balance is especially relevant in the context of applying large models to smaller datasets for testing, demonstrating the Ecapa model’s adaptability and efficiency in varied data environments.

Author Contributions

Conceptualization, V.B., Y.K. and D.S.; methodology, V.B. and D.S.; software, V.B. and M.K.; validation, Y.K., D.S., V.K. and A.K.; formal analysis, Y.K., V.K. and D.S.; investigation, V.B., Y.K. and D.S.; resources, V.B. and D.S.; data curation, V.B. and M.K.; writing—original draft preparation, V.B., D.S. and M.P.; writing—review and editing, M.P.; visualization, V.B. and D.S.; supervision, V.B., Y.K. and D.S.; project administration, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset created, labeled and used within this paper is made available as open source on the HuggingFace repository, at https://github.com/vbrydik/speaker-verification-test, with DOI: https://huggingface.co/datasets?other=doi%3A10.57967%2Fhf%2F0701 (accessed on 6 June 2023).

Conflicts of Interest

Authors Vitalii Brydinskyi, Yuriy Khoma and Alexander Konovalov were affiliated by the company Vidby AG. All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Bai, Z.; Zhang, X.L. Speaker recognition based on deep learning: An overview. Neural Netw. 2021, 140, 65–99. [Google Scholar] [CrossRef]
Kabir, M.M.; Mridha, M.F.; Shin, J.; Jahan, I.; Ohi, A.Q. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access 2021, 9, 79236–79263. [Google Scholar] [CrossRef]
Khoma, V.; Khoma, Y.; Brydinskyi, V.; Konovalov, A. Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library. Sensors 2023, 23, 2082. [Google Scholar] [CrossRef] [PubMed]
Dovydaitis, L.; Rasymas, T.; Rudžionis, V. Speaker authentication system based on voice biometrics and speech recognition. In Proceedings of the Business Information Systems Workshops: BIS 2016 International Workshops, Leipzig, Germany, 6–8 July 2016; Springer: Berlin/Heidelberg, Germany, 2017; pp. 79–84. [Google Scholar]
Hansen, J.H.; Hasan, T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag. 2015, 32, 74–99. [Google Scholar] [CrossRef]
Jahangir, R.; Teh, Y.W.; Nweke, H.F.; Mujtaba, G.; Al-Garadi, M.A.; Ali, I. Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Expert Syst. Appl. 2021, 171, 114591. [Google Scholar] [CrossRef]
Alaliyat, S.; Waaler, F.F.; Dyvik, K.; Oucheikh, R.; Hameed, I. Speaker Verification Using Machine Learning for Door Access Control Systems. In Proceedings of the International Conference on Artificial Intelligence and Computer Vision, Settat, Morocco, 28–30 June 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 689–700. [Google Scholar]
Wells, J.H.; Williams, L.R. Embeddings and Extensions in Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 84. [Google Scholar]
Tsoi, P.K.; Fung, P. A Novel Technique for Frame Selection for GMM-based text-independent Speaker Recognition. In Proceedings of the ICSLP 2000, Beijing, China, 16–20 October 2000. [Google Scholar]
Bhattacharya, G.; Alam, M.J.; Kenny, P. Deep Speaker Embeddings for Short-Duration Speaker Verification. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 1517–1521. [Google Scholar]
Mohammed, T.S.; Aljebory, K.M.; Rasheed, M.A.A.; Al-Ani, M.S.; Sagheer, A.M. Analysis of Methods and Techniques Used for Speaker Identification, Recognition, and Verification: A Study on Quarter-Century Research Outcomes. Iraqi J. Sci. 2021, 62, 3256–3281. [Google Scholar] [CrossRef]
Univaso, P. Forensic speaker identification: A tutorial. IEEE Lat. Am. Trans. 2017, 15, 1754–1770. [Google Scholar] [CrossRef]
Echihabi, K.; Zoumpatianos, K.; Palpanas, T. Scalable machine learning on high-dimensional vectors: From data seriesto deep network embeddings. In Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, Biarritz, France, 30 June–3 July 2020; pp. 1–6. [Google Scholar]
Jurafsky, D.; Martin, J.H.; Kehler, A.; Vander Linden, K.; Ward, N. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Prentice Hall: Hoboken, NJ, USA, 2000. [Google Scholar]
Brydinskyi, V. Dataset of 500 Short Speech Utterances of 50 Ukrainian Politicians. 2023. Available online: https://github.com/vbrydik/speaker-verification-test (accessed on 30 January 2024).
Xie, W.; Nagrani, A.; Chung, J.S.; Zisserman, A. Utterance-level aggregation for speaker recognition in the wild. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5791–5795. [Google Scholar]
Poddar, A.; Sahidullah, M.; Saha, G. Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biom. 2018, 7, 91–101. [Google Scholar] [CrossRef]
Viñals, I.; Ortega, A.; Miguel, A.; Lleida, E. An analysis of the short utterance problem for speaker characterization. Appl. Sci. 2019, 9, 3697. [Google Scholar] [CrossRef]
Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized end-to-end loss for speaker verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4879–4883. [Google Scholar]
Li, J.; Yan, N.; Wang, L. FDN: Finite difference network with hierarchical convolutional features for text-independent speaker verification. arXiv 2021, arXiv:2108.07974. [Google Scholar]
Liu, M.; Lee, K.A.; Wang, L.; Zhang, H.; Zeng, C.; Dang, J. Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Kim, S.H.; Nam, H.; Park, Y.H. Analysis-based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification. IEEE Access 2023, 11, 60646–60659. [Google Scholar] [CrossRef]
Xia, W.; Huang, J.; Hansen, J.H. Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5816–5820. [Google Scholar]
Habib, H.; Tauseef, H.; Fahiem, M.A.; Farhan, S.; Usman, G. SpeakerNet for Cross-lingual Text-Independent Speaker Verification. Arch. Acoust. 2020, 45, 573–583. [Google Scholar]
Bredin, H.; Yin, R.; Coria, J.M.; Gelly, G.; Korshunov, P.; Lavechin, M.; Fustes, D.; Titeux, H.; Bouaziz, W.; Gill, M.P. Pyannote.audio: Neural building blocks for speaker diarization. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7124–7128. [Google Scholar]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Koluguri, N.R.; Park, T.; Ginsburg, B. TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 8102–8106. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. arXiv 2020, arXiv:2005.07143. [Google Scholar]
Dawalatabad, N.; Ravanelli, M.; Grondin, F.; Thienpondt, J.; Desplanques, B.; Na, H. ECAPA-TDNN embeddings for speaker diarization. arXiv 2021, arXiv:2104.01466. [Google Scholar]
Jakubec, M.; Jarina, R.; Lieskovska, E.; Kasak, P. Deep speaker embeddings for Speaker Verification: Review and experimental comparison. Eng. Appl. Artif. Intell. 2024, 127, 107232. [Google Scholar] [CrossRef]
Safavi, S.; Najafian, M.; Hanani, A.; Russell, M.J.; Jancovic, P. Comparison of speaker verification performance for adult and child speech. In Proceedings of the WOCCI, Singapore, 19 September 2014; pp. 27–31. [Google Scholar]
Tobin, J.; Tomanek, K. Personalized automatic speech recognition trained on small disordered speech datasets. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6637–6641. [Google Scholar]
Nammous, M.K.; Saeed, K.; Kobojek, P. Using a small amount of text-independent speech data for a BiLSTM large-scale speaker identification approach. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 764–770. [Google Scholar] [CrossRef]
Prihasto, B.; Azhar, N.F. Evaluation of recurrent neural network based on Indonesian speech synthesis for small datasets. Adv. Sci. Technol. 2021, 104, 17–25. [Google Scholar]
Nagrani, A.; Chung, J.S.; Zisserman, A. Voxceleb: A large-scale speaker identification dataset. arXiv 2017, arXiv:1706.08612. [Google Scholar]
Zeinali, H.; Burget, L.; Černockỳ, J.H. A multi purpose and large scale speech corpus in Persian and English for speaker and speech recognition: The DeepMine database. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 397–402. [Google Scholar]
Aldarmaki, H.; Ullah, A.; Ram, S.; Zaki, N. Unsupervised automatic speech recognition: A review. Speech Commun. 2022, 139, 76–91. [Google Scholar] [CrossRef]
Chi, Z.; Huang, S.; Dong, L.; Ma, S.; Zheng, B.; Singhal, S.; Bajaj, P.; Song, X.; Mao, X.L.; Huang, H.; et al. Xlm-e: Cross-lingual language model pre-training via electra. arXiv 2021, arXiv:2106.16138. [Google Scholar]
Coria, J.M.; Bredin, H.; Ghannay, S.; Rosset, S. A comparison of metric learning loss functions for end-to-end speaker verification. In Proceedings of the International Conference on Statistical Language and Speech Processing, Cardiff, UK, 14–16 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 137–148. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AL, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with sincnet. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 1021–1028. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Kuchaiev, O.; Li, J.; Nguyen, H.; Hrinchuk, O.; Leary, R.; Ginsburg, B.; Kriman, S.; Beliaev, S.; Lavrukhin, V.; Cook, J.; et al. Nemo: A toolkit for building ai applications using neural modules. arXiv 2019, arXiv:1909.09577. [Google Scholar]
Cheng, J.M.; Wang, H.C. A method of estimating the equal error rate for automatic speaker verification. In Proceedings of the 2004 International Symposium on Chinese Spoken Language Processing, Hong Kong, China, 15–18 December 2004; pp. 285–288. [Google Scholar]
Kinnunen, T.; Lee, K.A.; Delgado, H.; Evans, N.; Todisco, M.; Sahidullah, M.; Yamagishi, J.; Reynolds, D.A. t-DCF: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv 2018, arXiv:1804.09618. [Google Scholar]

Figure 1. Generalized speaker embedding model architecture.

Figure 2. Dataset structure.

Figure 3. Test set structure for each of the speaker classes in the dataset.

Figure 4. Speaker verification experiment structure.

Figure 5. Speaker embeddings visualization for respective models: PyAnnote, WavLM, TitaNet and ECAPA. The axes represent the main PCA components for the voice sample embeddings of five exemplary Ukrainian politicians.

Table 1. Experiment results of the study in [30].

Embedding Method	EER (%)
i-vector	10.73
d-vector	6.44
x-vector	3.98
x-vector (E-TDNN)	3.76
x-vector (F-TDNN)	3.53
r-vector (ResNet)	3.18
r-vector (Res2Net)	2.71

Table 2. Comparison of the speaker embedding model parameters.

Model	Speaker Embedding Dimension	Training Dataset	Architecture
PyAnnote	512	VoxCeleb	X-vector with SincNet
WavLM	256	LibriSpeech	Transformer
TitaNet	192	VoxCeleb, NIST SRE, Fisher, LibriSpeech	ContextNet with channel attention pooling
Ecapa-TDNN	192	VoxCeleb	Improved TDNN

Table 3. Calculated metrics in the conducted experiment.

Model	FAR (%)	FRR (%)	EER (%)	DCF	Inference Time (ms)
PyAnnote	3.78	3.82	3.8	0.259	49.44 ± 11.97
WavLM-Base-SV	12.53	12.4	12.47	0.445	91.39 ± 12.85
WavLM-Base-Plus-SV	10.84	10.93	10.88	0.407	92.25 ± 17.01
TitaNet-Large	1.91	1.91	1.91	0.138	110.18 ± 8.59
Ecapa	1.73	1.68	1.71	0.139	69.43 ± 8.06

Table 4. Calculated speaker embedding parameters.

Model	The Distance between Samples of Different Classes	The Distance between Samples of the Same Class
PyAnnote	1.298 ± 0.071	0.941 ± 0.117
WavLM-Base-SV	0.654 ± 0.238	0.286 ± 0.075
WavLM-Base-Plus-SV	0.684 ± 0.254	0.286 ± 0.073
TitaNet-Large	1.246 ± 0.084	0.773 ± 0.121
Ecapa	1.293 ± 0.081	0.827 ± 0.119

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Brydinskyi, V.; Khoma, Y.; Sabodashko, D.; Podpora, M.; Khoma, V.; Konovalov, A.; Kostiak, M. Comparison of Modern Deep Learning Models for Speaker Verification. Appl. Sci. 2024, 14, 1329. https://doi.org/10.3390/app14041329

AMA Style

Brydinskyi V, Khoma Y, Sabodashko D, Podpora M, Khoma V, Konovalov A, Kostiak M. Comparison of Modern Deep Learning Models for Speaker Verification. Applied Sciences. 2024; 14(4):1329. https://doi.org/10.3390/app14041329

Chicago/Turabian Style

Brydinskyi, Vitalii, Yuriy Khoma, Dmytro Sabodashko, Michal Podpora, Volodymyr Khoma, Alexander Konovalov, and Maryna Kostiak. 2024. "Comparison of Modern Deep Learning Models for Speaker Verification" Applied Sciences 14, no. 4: 1329. https://doi.org/10.3390/app14041329

APA Style

Brydinskyi, V., Khoma, Y., Sabodashko, D., Podpora, M., Khoma, V., Konovalov, A., & Kostiak, M. (2024). Comparison of Modern Deep Learning Models for Speaker Verification. Applied Sciences, 14(4), 1329. https://doi.org/10.3390/app14041329

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Modern Deep Learning Models for Speaker Verification

Abstract

1. Introduction

2. Related Works

3. Aim of Research

4. Models Overview

4.1. WavLM

4.2. PyAnnote

4.3. TitaNet

4.4. ECAPA

5. Experiment Setup

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI