As previously mentioned, text-independent speaker verification can be accomplished via a variety of machine and deep learning techniques. In order to justify the concentration on SVMs, over other classical machine learning techniques, and ensure that a satisfactory range of deep learning techniques are covered, a number of both will be briefly covered, before engaging in an in-depth analysis of a select few.
The following deep learning-based papers serve to supplement the later papers covered in depth by showing the range and diversity of techniques employed. Jung et al. [
7] improved upon their previous work, namely that in Jung et al. [
4], using a DNN to perform text-independent speaker verification using only raw waveforms as input. Their network, which they dubbed RawNet, uses a convolutional neural network-gated recurrent unit (CNN-GRU) where the CNN portions are ResNet blocks. They improved upon their original network by incorporating filter-wise feature map scaling (FMS). With this addition, they were able to obtain an EER of 2.57% on the VoxCeleb-1 evaluation set, an improvement of 2.23% from their original method. Tang et al. [
8] created a hybrid TDNN LSTM network with a multilevel pooling strategy to obtain speaker information from the TDNN and LSTM layer. The reasoning for combining these models is because the TDNN focuses on local features while the LSTM considers global and sequential information from the entire utterance. With this system, they were able to achieve an EER of 6.61% using the Tagalog and Cantonese portion of the NIST SRE16 eval test and SRE18 dev test. Wang et al. [
9] compared how a wide variety of hyperparameters affect a CNN used for text-independent speaker verification, including but not limited to input embeddings, the number of enrollment utterances, duration of enrollment utterances, and loss functions. Ultimately, they found that a greater number of enrollment utterances of longer duration give the best results. Using this information, they are able to obtain an EER of 2.0%. Paranatti et al. [
10] used data from the LibriSpeech dataset to train a CNN on an MFCC spectrogram. Their model consists of three convolutional, three pooling, two dense, one flattened, and one dropout layer(s). With this model, they are able to achieve a training accuracy of 98% and a validation accuracy of 93%.
During a review of machine learning papers from the last five to ten years, a number of different techniques being used for text-independent speaker verification were found. It is worth noting that three of these techniques were more common than the rest, those being GMM-UBM, SVM, and i-vector-based. Song et al. [
11] tested the effect of noise invariant frame selection, a preprocessing method, on three machine learning models: vector quantization (VQ), GMM-UBM, and an i-vector-based approach. They used the TIMIT dataset and have an input of 24 MFCC features 12 MFCC and 12
MFCC. Their GMM-UBM achieved an EER of 5.6%, their VQ achieved 8.3%, and the i-vector-based method obtained 5.0%. Thakur et al. [
12] trained a GMM-UBM and compared it to a generic i-vector-based system using PLDA. They created their own dataset from scraping YouTube and Librivox, and this dataset contains 50 speakers each with more than an hour of total data split into approximately 50 one-minute utterances. They removed any silence from the data using voice activity detection, and then they parameterized the speech with MFCC. Their GMM-UBM achieves an EER of 5.31% and their i-vector system has an EER of 4.74%. Rakhmanenko et al. [
13] tested nine different feature sets, which all included MFCC with 14 coefficients. The tests were conducted on a GMM-UBM with 256 mixtures trained on the various features as extracted from an in-house dataset with 50 speakers split evenly by gender with at least 6 min of speech each. The best EER they obtained was 0.763% from the feature set containing 14 MFCC coefficients and voicing probability. Ditlovich et al. [
14] compared GMM-UBMs and GMM-IBM with and without mostly voiced speech (MVS) at three different SNR levels; clean, 3 db, and 9 db. MSV comprises speech data that have not undergone voice activity detection to remove portions containing silence. The best results are obtained with MVS with a clean SNR where GMM-UBM obtains an EER of 14.5833% and GMM-IBM obtains 13.02%. Pinheiro et al. [
15] compared a type-2 Fuzzy GMM-UBM (T2F-GMM-UBM) and a generic GMM-UBM. A fuzzy GMM is a GMM that uses a soft clustering method, meaning that each point is assigned a score based on how strongly it correlates with a cluster, this allows points to be assigned to multiple clusters [
16]. Type 2 means that two separate variables are used to obtain the score as opposed to the usual one. For each model architecture they try four different numbers of mixtures: 32, 64, 128, and 256. The best EER that the T2F-GMM-UBM obtains is 13.73% with 128 mixtures, and the best EER that the GMM-UBM obtains is 16.86% with 64 mixtures. Sarmah et al. [
17] collected data from four different microphones and trained four different GMM-UBM and evaluated them on data from the device they were trained on and the other three. The lowest EER for matching training and testing devices was 7.50% and the lowest EER for mismatching devices was 18.70%. Li et al. [
18] created a hybrid GMM i-vector-based system. The authors decided to combine the two because the GMM is able to find features that the tokenizer is not able to. Using the English portion of the NIST SRE dataset, they were able to achieve an EER of 1.71% with their hybrid system. Das et al. [
19] created an i-vector-based system using PLDA. They used a 39 dimensional feature vector made of 13 MFCC, 13
MFCC, and 13
MFCC extracted from an in-house dataset collected from 100 students, each with 3 min of speech data. Tests were conducted to determine the effects that different amounts of limited data have on the model. The following utterance test durations were evaluated: 2, 3, 5, 10, 15, 20 s, and full. Also, 1 and 3 min of training data were compared. From these tests, they were able to achieve an EER of 11.30%. Chen et al. [
20] created an i-based vector system using PLDA and tested its efficacy on data where enrollment and test utterances were captured on different devices. Their paper calculated the EER of many different scenarios: the lowest was 0.164% and the highest was 6.802%. Ghoniem et al. [
21] used a fuzzy HMM (FHMM) on an in-house Arabic language dataset. The FHMM was a normal HMM which incorporated the fuzzy c-means kernel (KFCM). The KFCM was incorporated to compute fuzzy membership values in order to reduce information loss and increase recognition rate. Using this FHMM, they were able to achieve a recognition rate of 98.3%. Hourri et al. [
22] devised a novel clustering method loosely based off of K-means clustering which assigns a point based on the assignment of the nearest point, as opposed to K-means clustering which uses the nearest centroid. With this scoring method, they were able to achieve an EER of 0.32%. The aforementioned papers are briefly summarized in
Table 1 to enable an easier comparison to SVMs.
The choice to concentrate on SVMs over the other machine learning techniques was affected by several factors. Preliminary comparisons found that SVMs achieve comparable or better results than other machine learning techniques. For example, the average EER of reviewed SVMs is 5.2% as opposed to the average 9.075% of the GMM-UBM, the most common machine learning technique. While limited, there are still enough recent papers to enable a comprehensive review and analysis, for example, there is very limited recent work using HMM and K-means clusters for text-independent speaker verification. The final factor that led us to choose SVM over other machine learning techniques was that there is a lack of recent research on the subject despite the advantages it has, and we wished to determine whether this is justified.
2.2. SVM
All of the SVM papers reviewed compare multiple feature sets or multiple kernels, but it is worth noting that all of the SVM papers use Mel-frequency cepstral coefficients MFCC, with varying numbers of filters, and most of the SVM papers use the radial basis function kernel RBF in their analysis. Another commonality between the SVM papers is the use of principal component analysis (PCA) as a dimensionality reduction technique. These similarities give a sense of what methods are standard in the field.
Table 3 gives an overview of the features used by each SVM paper.
Abdalmalak et al. [
24] compared 13 different feature sets and three different kernel functions, logistic regression, RBF, and linear, on the English Language Speech Database for Speaker Recognition (ELSDSR). After finding the feature set that scored the highest ROC AUC across all three kernels, they further improved the performance by combining the trained models. They tested four different methods of combining their models: majority vote, unanimous vote, at least one, and
k out of
N votes. After testing all of these different methods, they found that the “at least one" method worked best, providing an ROC AUC of 98%. Their chosen method works by verifying the speaker when at least 1 of the 3 models predicts the input as the target speaker. Zerget et al. [
25] tested three different feature sets and the PCA of each using SVMs with an RBF kernel. The tested feature sets are as follows: 12 MFCC coefficients plus the energy parameter as well as the first and second derivative, 12 LSF coefficients, and the combination of the previous two. The testing was performed using the TIMIT dataset, and the paper’s metric of choice was the equal error rate (EER). The best result they obtained was an EER of 0.51%, which they obtained when using the PCA of the MFCC-LSF feature set. Kamruzzaman et al. [
26] compared the accuracy of SVMs trained using chunking against those trained with sequential minimum optimization SMO. Chunking and SMO are different methods of solving the quadratic programming problem, which occurs during SVM training. The testing was done on an in-house dataset and found that SVMs trained using SMO are slightly more accurate, 91.88% vs. 95%. Charan et al. [
23] compared three feature extraction techniques, namely MFCC, LPCC, and PLP, and well as two dimensionality reduction techniques, namely PCA and t-SNE.
The paper covers several different machine learning algorithms in addition to SVMs, but those are not within the scope of this paper. The previously listed features undergo dimensionality reduction using PCA and t-SNE; after this, SVMs are trained on the new reduced features. Rashno et al. [
27] compared three different SVM kernels: RBF, polynomial, and multilayer perceptron (MLP) on two different features selection methods, one based on a genetic algorithm (GA) and one based on ant colony optimization (ACO) on the TIMIT dataset. Ali et al. [
28] tested seven different feature sets on the Urdu dataset, each obtained via a unique feature extraction technique. They obtained their best results from combining the MFCC features and output of a restricted Boltzmann machine that was run on the PCA of the audio signals spectrogram. Of the SVM papers that we reviewed, four of them had publicly accessible datasets, Urdu dataset, ELSDSR, and TIMIT, as shown in
Table 2.
The following papers covered SVMs that were the subject of papers published before the year 2010. These papers will not be a member of the in-depth analysis that the newer SVMs are and instead serve to further illustrate the diversity of techniques in this field. Kharroubi et al. [
35] combined a GMM and SVM. The GMM was trained on a 33-dimensional feature vector made up of 16 coefficients from LFCC, 16 coefficients from
LFCC, and the delta of the energy. After the GMM was trained, the output was given to the SVM where the actual classification occurs. Using this system, they were able to achieve an EER of 16%. Gu et al. [
36] constructed six different SVMS from two kernel functions, polynomial and RBF, and three decision functions, namely binary sigmoid, and unthresholded. Three SVM were made for each kernel function each with one of the different loss functions. The best SVM they trained uses the RBF kernel and the loss function without a threshold, obtaining an EER of 2.3%. Wan et al. [
37] implemented an SVM using a score-space kernel. Score-based kernels are generalized Fisher kernels and are able to discriminate between whole sequences as opposed to frame-based kernels. Using a score-based kernel, the authors were able to achieve an EER of 4.03%. Liu et al. [
38] created a hybrid GMM/SVM system. The system works by first running the data through a GMM that is adapted from a UBM, and after that, 16 MFCC and 16
MFCC coefficients are taken and fed into an SVM which accomplishes the classification. The system is tested and trained on a subset of the NIST 2004 speaker recognition dataset, and they were able to obtain an EER of 11.92%.
2.3. Deep Learning
Before discussing the specifics of each deep learning technique, it is worth noting that the papers reviewed have a few things in common. EER is the only metric calculated by all but one of the papers, which additionally calculates the ROC AUC; this was surprising as the SVM papers do not seem to have a preference one way or another. The second commonality is the prevalence of the VoxCeleb dataset, which is likely because it is one of the largest speech datasets out there, making it good for deep learning applications.
Tayebi et al. [
29] created a text-independent speaker verification system using Google’s Generalized End-to-End Loss for Speaker Verification (GE2E) and trains several models with different numbers of enrollment utterances and compares their EER to each other along with three baseline GMMs. They conducted their experiments on the LibriSpeech dataset; for training, they used the train-clean-360 subset, and from that, they trained six different models, each with a different number of enrollment utterances: 2, 3, 5, 7, 10, and 15. These six models were evaluated on three different subsets: test–clean, test–other, and dev–clean. Choi et al. [
30] combined a CNN-meet-vision-Transformer (CMT) with broadcasting residual learning (BRL) to create a novel architecture they call BC-CMT. They trained three of these models: BC-CMT-Tiny with 273.6K parameters, BC-CMT-Small with 1.4M parameters, and BC-CMT-Base with 6.3M parameters on the VoxCeleb-1 dataset and evaluated it against other models using the VoxCeleb Text-independent speaker verification (TI-SV) benchmark, which is composed of three sets, namely VoxCeleb-1 original, extended, and hard. Xu et al. [
31] built a CNN from ResNet and Squeeze-and-Excite (SE) with four loss functions: triplet, n-pair, angular, and Softmax. The authors combined these loss functions because they hypothesize that, by combining these loss functions, they will complement each other. The model is trained on the VoxCeleb2 training set, which contains over a million utterances from over 6000 speakers and was chosen because it was collected in natural noisy environments, which translates well to real-world scenarios. They compared their model to a number of other architectures using the VoxCeleb benchmark. Li et al. [
32] created a dual attention network that is trained end-to-end and evaluated on the VoxCeleb-1 database. Their model works by taking a pair of input utterances to generate utterance-level embedding from which the similarity is measured; this works because the utterances from the same speaker are expected to have highly similar embeddings. Chen et al. [
33] used the Mandarin Chinese regional dataset within the Common Voice dataset to train and evaluate two text-independent speaker verification 3D-CNNs, a lightweight one and a normal one. Since voice data are two-dimensional, they randomly segmented and stacked the data to make these three-dimensional. Zhang et al. [
34] used contrastive self-supervised learning (CSSL) to train a ResNet34-based model using only 5% of the labeled data from the VoxCeleb1 and 2 datasets, and are able to achieve comparable results to models using significantly more data. All of the deep learning papers had publicly accessible datasets, and one of them had publicly available code, as shown in
Table 2.