Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches

Kohler, Odin; Imtiaz, Masudul

doi:10.3390/electronics14050963

Open AccessArticle

Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches

by

Odin Kohler

^1,*,†

and

Masudul Imtiaz

^2,†

¹

Computer Science Department, Clarkson University, 8 Clarkson Ave., Potsdam, NY 13699, USA

²

Electrical & Computer Engineering Department, Clarkson University, 8 Clarkson Ave., Potsdam, NY 13699, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(5), 963; https://doi.org/10.3390/electronics14050963

Submission received: 31 December 2024 / Revised: 15 February 2025 / Accepted: 26 February 2025 / Published: 28 February 2025

(This article belongs to the Special Issue Advanced Machine Learning, Pattern Recognition, and Deep Learning Technologies: Methodologies and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Speaker verification is a common issue that has enumerable biomedical security applications. Speaker verification comes in two different forms: text-independent and text-dependent. Each of these forms can be implemented via many different machine learning and deep learning techniques. From our research, we found that there is significantly less work implementing text-independent speaker verification using machine learning techniques than there is using deep learning techniques. Because of this gap, we were motivated to build our own SVM and CNN model for text-independent speaker verification and compare them to other systems using SVMs or deep learning techniques. We limited ourselves to SVMs because they are commonly used for speech recognition and achieved very high accuracies. The main motivation behind this was two-fold. The first reason is to demonstrate that SVMs can and have been successfully used for text-independent speaker verification at a level comparable to deep learning techniques; the second reason is to make work using SVMs for text-independent speaker verification more accessible so it can be expanded upon easily. The analysis and comparison conducted in this paper will demonstrate how SVMs achieve results comparable to deep learning techniques and allow future researchers to more easily find SVMs used for text-independent speaker verification and derive a sense of what is being implemented in the field.

Keywords:

support vector machines; text-independent speaker verification; machine learning

1. Introduction

Speaker verification plays a critical role in many biometric security systems, and it is important not to confuse it with an equally important similar process, speaker identification. The goal of speaker verification is to confirm or deny the identity of the input speaker as the claimed speaker; this is achieved by comparing the input speech to a ground truth of the claimed identity. The goal of speaker identification is to assign an identity to the input speech; this is achieved by comparing the input speech to many ground truths of many different speakers. Another important distinction for speaker verification is that it comes in two flavors: text-dependent and text-independent [1]. Text-dependent verification depends on a specific piece of text or a predefined passcode, which is used as a baseline to compare future inputs; this approach is more restrictive, as it only works when the predefined code word is uttered. This limitation is not shared by text-independent systems, as it functions by extracting the characteristics of the target’s voice that are present across different words and sounds. The advantages of text-independent speaker verification come with a cost; it is much more complicated and difficult than text-dependent speaker verification. Text-independent speaker verification is a multistep process that involves choosing between many different feature extraction methods, feature engineering, dimensionality reduction, and choosing a machine learning model. Whereas text-dependent speaker verification can be as simple as comparing two waveforms, the input and the target, as in Kaczmarek et al. [2]. The complexity of using traditional machine learning techniques for text-independent speaker verification means that it is often easier to use deep learning techniques, as they automatically do the feature extraction, Kok et al. [3], Jung et al. [4]. However, deep learning techniques have a few disadvantages when compared to traditional machine learning techniques, namely that they are often more computationally expensive, as shown in Thompson et al. [5], and less explainable, as shown in Roscher et al. [6]. And herein lie the two issues that this paper aims to address. The first and primary issue is that SVMs are underutilized and under-researched in this field despite being able to perform on par with deep learning techniques while maintaining several advantages over them, i.e., increased explainability, decreased computation time, and requiring less data. The second issue is that because SVMs are underutilized and under-researched in this field there are significantly less published works using them for this, especially so when it comes to recent research. This gap in research makes it all the more important that existing papers on the subject are as accessible as possible so they can be expanded upon.

With these motivating factors in mind, we create our own text-independent speaker verification SVM and CNN. With these, we conduct a comprehensive comparison and analysis of our models as well as twelve papers implementing text-independent speaker verification, six using SVMs and six using various deep learning techniques. In order to obtain the most comprehensive comparison, we limited our search to one classical machine learning model, namely support vector machines. SVMs were chosen for three reasons: they are commonly used in this field, so its applicability has already been established and there is enough available work to allow for an in depth comparison. They perform on par or better to other machine learning techniques in the field. Additionally, there is a lack of recent research on them despite their advantages. When comparing the papers, we kept four aspects in mind: the metrics, the specifics of the model used, the features, and the data. The metrics calculated for our study are accuracy, precision, recall, F1-score, ROC AUC, EER, MSE, and MCC. When speaking to the specifics of the SVM model, we are mainly concerned with the kernel used, except in one case where multiple SVMs are combined in a sort of voting fashion. For deep learning techniques, this concerns the model architecture. When comparing the features, we kept in mind the exact feature extraction method used as well as the parameters, e.g., the number of coefficients. When comparing the data, we kept a number of things in mind: the dataset availability, the dataset size, the gender split, and the amount of data per speaker.

In this study, which is the first of its kind to our knowledge, we comprehensively compared SVMs and various deep learning techniques used for text-independent speaker verification, with the primary goal of showing comparability between the two. We accomplish this by showing that the performance of SVMs is on par with that of various deep learning techniques and, in some cases, better. This is important because it means that the advantages that SVMs have over deep learning techniques are worth considering and that more effort should be put into researching and improving the use of SVMs in text-independent speaker verification. All of our work can be found here https://github.com/0dink/BioMed_project (accessed on 1 February 2025).

2. Relevant Work

As previously mentioned, text-independent speaker verification can be accomplished via a variety of machine and deep learning techniques. In order to justify the concentration on SVMs, over other classical machine learning techniques, and ensure that a satisfactory range of deep learning techniques are covered, a number of both will be briefly covered, before engaging in an in-depth analysis of a select few.

The following deep learning-based papers serve to supplement the later papers covered in depth by showing the range and diversity of techniques employed. Jung et al. [7] improved upon their previous work, namely that in Jung et al. [4], using a DNN to perform text-independent speaker verification using only raw waveforms as input. Their network, which they dubbed RawNet, uses a convolutional neural network-gated recurrent unit (CNN-GRU) where the CNN portions are ResNet blocks. They improved upon their original network by incorporating filter-wise feature map scaling (FMS). With this addition, they were able to obtain an EER of 2.57% on the VoxCeleb-1 evaluation set, an improvement of 2.23% from their original method. Tang et al. [8] created a hybrid TDNN LSTM network with a multilevel pooling strategy to obtain speaker information from the TDNN and LSTM layer. The reasoning for combining these models is because the TDNN focuses on local features while the LSTM considers global and sequential information from the entire utterance. With this system, they were able to achieve an EER of 6.61% using the Tagalog and Cantonese portion of the NIST SRE16 eval test and SRE18 dev test. Wang et al. [9] compared how a wide variety of hyperparameters affect a CNN used for text-independent speaker verification, including but not limited to input embeddings, the number of enrollment utterances, duration of enrollment utterances, and loss functions. Ultimately, they found that a greater number of enrollment utterances of longer duration give the best results. Using this information, they are able to obtain an EER of 2.0%. Paranatti et al. [10] used data from the LibriSpeech dataset to train a CNN on an MFCC spectrogram. Their model consists of three convolutional, three pooling, two dense, one flattened, and one dropout layer(s). With this model, they are able to achieve a training accuracy of 98% and a validation accuracy of 93%.

During a review of machine learning papers from the last five to ten years, a number of different techniques being used for text-independent speaker verification were found. It is worth noting that three of these techniques were more common than the rest, those being GMM-UBM, SVM, and i-vector-based. Song et al. [11] tested the effect of noise invariant frame selection, a preprocessing method, on three machine learning models: vector quantization (VQ), GMM-UBM, and an i-vector-based approach. They used the TIMIT dataset and have an input of 24 MFCC features 12 MFCC and 12

Δ

MFCC. Their GMM-UBM achieved an EER of 5.6%, their VQ achieved 8.3%, and the i-vector-based method obtained 5.0%. Thakur et al. [12] trained a GMM-UBM and compared it to a generic i-vector-based system using PLDA. They created their own dataset from scraping YouTube and Librivox, and this dataset contains 50 speakers each with more than an hour of total data split into approximately 50 one-minute utterances. They removed any silence from the data using voice activity detection, and then they parameterized the speech with MFCC. Their GMM-UBM achieves an EER of 5.31% and their i-vector system has an EER of 4.74%. Rakhmanenko et al. [13] tested nine different feature sets, which all included MFCC with 14 coefficients. The tests were conducted on a GMM-UBM with 256 mixtures trained on the various features as extracted from an in-house dataset with 50 speakers split evenly by gender with at least 6 min of speech each. The best EER they obtained was 0.763% from the feature set containing 14 MFCC coefficients and voicing probability. Ditlovich et al. [14] compared GMM-UBMs and GMM-IBM with and without mostly voiced speech (MVS) at three different SNR levels; clean, 3 db, and 9 db. MSV comprises speech data that have not undergone voice activity detection to remove portions containing silence. The best results are obtained with MVS with a clean SNR where GMM-UBM obtains an EER of 14.5833% and GMM-IBM obtains 13.02%. Pinheiro et al. [15] compared a type-2 Fuzzy GMM-UBM (T2F-GMM-UBM) and a generic GMM-UBM. A fuzzy GMM is a GMM that uses a soft clustering method, meaning that each point is assigned a score based on how strongly it correlates with a cluster, this allows points to be assigned to multiple clusters [16]. Type 2 means that two separate variables are used to obtain the score as opposed to the usual one. For each model architecture they try four different numbers of mixtures: 32, 64, 128, and 256. The best EER that the T2F-GMM-UBM obtains is 13.73% with 128 mixtures, and the best EER that the GMM-UBM obtains is 16.86% with 64 mixtures. Sarmah et al. [17] collected data from four different microphones and trained four different GMM-UBM and evaluated them on data from the device they were trained on and the other three. The lowest EER for matching training and testing devices was 7.50% and the lowest EER for mismatching devices was 18.70%. Li et al. [18] created a hybrid GMM i-vector-based system. The authors decided to combine the two because the GMM is able to find features that the tokenizer is not able to. Using the English portion of the NIST SRE dataset, they were able to achieve an EER of 1.71% with their hybrid system. Das et al. [19] created an i-vector-based system using PLDA. They used a 39 dimensional feature vector made of 13 MFCC, 13

Δ

MFCC, and 13

Δ Δ

MFCC extracted from an in-house dataset collected from 100 students, each with 3 min of speech data. Tests were conducted to determine the effects that different amounts of limited data have on the model. The following utterance test durations were evaluated: 2, 3, 5, 10, 15, 20 s, and full. Also, 1 and 3 min of training data were compared. From these tests, they were able to achieve an EER of 11.30%. Chen et al. [20] created an i-based vector system using PLDA and tested its efficacy on data where enrollment and test utterances were captured on different devices. Their paper calculated the EER of many different scenarios: the lowest was 0.164% and the highest was 6.802%. Ghoniem et al. [21] used a fuzzy HMM (FHMM) on an in-house Arabic language dataset. The FHMM was a normal HMM which incorporated the fuzzy c-means kernel (KFCM). The KFCM was incorporated to compute fuzzy membership values in order to reduce information loss and increase recognition rate. Using this FHMM, they were able to achieve a recognition rate of 98.3%. Hourri et al. [22] devised a novel clustering method loosely based off of K-means clustering which assigns a point based on the assignment of the nearest point, as opposed to K-means clustering which uses the nearest centroid. With this scoring method, they were able to achieve an EER of 0.32%. The aforementioned papers are briefly summarized in Table 1 to enable an easier comparison to SVMs.

The choice to concentrate on SVMs over the other machine learning techniques was affected by several factors. Preliminary comparisons found that SVMs achieve comparable or better results than other machine learning techniques. For example, the average EER of reviewed SVMs is 5.2% as opposed to the average 9.075% of the GMM-UBM, the most common machine learning technique. While limited, there are still enough recent papers to enable a comprehensive review and analysis, for example, there is very limited recent work using HMM and K-means clusters for text-independent speaker verification. The final factor that led us to choose SVM over other machine learning techniques was that there is a lack of recent research on the subject despite the advantages it has, and we wished to determine whether this is justified.

2.1. Works for Comparison

The search for relevant work was limited to papers published in the last 15 years. Our search turned up six papers which used SVMs, with the most recent one being Charan et al. [23] which was published in 2017. It is important to note that using SVMs for this is not common as of late, and has been largely overlooked in favor of deep learning techniques. We found this surprising because the majority of the reviewed SVMs are comparable to the reviewed deep learning techniques, despite being older. To keep our comparison balanced, we chose six papers that use various deep learning techniques. This can be seen in Table 2, along with an overview of their code and dataset availability.

2.2. SVM

All of the SVM papers reviewed compare multiple feature sets or multiple kernels, but it is worth noting that all of the SVM papers use Mel-frequency cepstral coefficients MFCC, with varying numbers of filters, and most of the SVM papers use the radial basis function kernel RBF in their analysis. Another commonality between the SVM papers is the use of principal component analysis (PCA) as a dimensionality reduction technique. These similarities give a sense of what methods are standard in the field. Table 3 gives an overview of the features used by each SVM paper.

Abdalmalak et al. [24] compared 13 different feature sets and three different kernel functions, logistic regression, RBF, and linear, on the English Language Speech Database for Speaker Recognition (ELSDSR). After finding the feature set that scored the highest ROC AUC across all three kernels, they further improved the performance by combining the trained models. They tested four different methods of combining their models: majority vote, unanimous vote, at least one, and k out of N votes. After testing all of these different methods, they found that the “at least one" method worked best, providing an ROC AUC of 98%. Their chosen method works by verifying the speaker when at least 1 of the 3 models predicts the input as the target speaker. Zerget et al. [25] tested three different feature sets and the PCA of each using SVMs with an RBF kernel. The tested feature sets are as follows: 12 MFCC coefficients plus the energy parameter as well as the first and second derivative, 12 LSF coefficients, and the combination of the previous two. The testing was performed using the TIMIT dataset, and the paper’s metric of choice was the equal error rate (EER). The best result they obtained was an EER of 0.51%, which they obtained when using the PCA of the MFCC-LSF feature set. Kamruzzaman et al. [26] compared the accuracy of SVMs trained using chunking against those trained with sequential minimum optimization SMO. Chunking and SMO are different methods of solving the quadratic programming problem, which occurs during SVM training. The testing was done on an in-house dataset and found that SVMs trained using SMO are slightly more accurate, 91.88% vs. 95%. Charan et al. [23] compared three feature extraction techniques, namely MFCC, LPCC, and PLP, and well as two dimensionality reduction techniques, namely PCA and t-SNE.

The paper covers several different machine learning algorithms in addition to SVMs, but those are not within the scope of this paper. The previously listed features undergo dimensionality reduction using PCA and t-SNE; after this, SVMs are trained on the new reduced features. Rashno et al. [27] compared three different SVM kernels: RBF, polynomial, and multilayer perceptron (MLP) on two different features selection methods, one based on a genetic algorithm (GA) and one based on ant colony optimization (ACO) on the TIMIT dataset. Ali et al. [28] tested seven different feature sets on the Urdu dataset, each obtained via a unique feature extraction technique. They obtained their best results from combining the MFCC features and output of a restricted Boltzmann machine that was run on the PCA of the audio signals spectrogram. Of the SVM papers that we reviewed, four of them had publicly accessible datasets, Urdu dataset, ELSDSR, and TIMIT, as shown in Table 2.

The following papers covered SVMs that were the subject of papers published before the year 2010. These papers will not be a member of the in-depth analysis that the newer SVMs are and instead serve to further illustrate the diversity of techniques in this field. Kharroubi et al. [35] combined a GMM and SVM. The GMM was trained on a 33-dimensional feature vector made up of 16 coefficients from LFCC, 16 coefficients from

Δ

LFCC, and the delta of the energy. After the GMM was trained, the output was given to the SVM where the actual classification occurs. Using this system, they were able to achieve an EER of 16%. Gu et al. [36] constructed six different SVMS from two kernel functions, polynomial and RBF, and three decision functions, namely binary sigmoid, and unthresholded. Three SVM were made for each kernel function each with one of the different loss functions. The best SVM they trained uses the RBF kernel and the loss function without a threshold, obtaining an EER of 2.3%. Wan et al. [37] implemented an SVM using a score-space kernel. Score-based kernels are generalized Fisher kernels and are able to discriminate between whole sequences as opposed to frame-based kernels. Using a score-based kernel, the authors were able to achieve an EER of 4.03%. Liu et al. [38] created a hybrid GMM/SVM system. The system works by first running the data through a GMM that is adapted from a UBM, and after that, 16 MFCC and 16

Δ

MFCC coefficients are taken and fed into an SVM which accomplishes the classification. The system is tested and trained on a subset of the NIST 2004 speaker recognition dataset, and they were able to obtain an EER of 11.92%.

2.3. Deep Learning

Before discussing the specifics of each deep learning technique, it is worth noting that the papers reviewed have a few things in common. EER is the only metric calculated by all but one of the papers, which additionally calculates the ROC AUC; this was surprising as the SVM papers do not seem to have a preference one way or another. The second commonality is the prevalence of the VoxCeleb dataset, which is likely because it is one of the largest speech datasets out there, making it good for deep learning applications.

Tayebi et al. [29] created a text-independent speaker verification system using Google’s Generalized End-to-End Loss for Speaker Verification (GE2E) and trains several models with different numbers of enrollment utterances and compares their EER to each other along with three baseline GMMs. They conducted their experiments on the LibriSpeech dataset; for training, they used the train-clean-360 subset, and from that, they trained six different models, each with a different number of enrollment utterances: 2, 3, 5, 7, 10, and 15. These six models were evaluated on three different subsets: test–clean, test–other, and dev–clean. Choi et al. [30] combined a CNN-meet-vision-Transformer (CMT) with broadcasting residual learning (BRL) to create a novel architecture they call BC-CMT. They trained three of these models: BC-CMT-Tiny with 273.6K parameters, BC-CMT-Small with 1.4M parameters, and BC-CMT-Base with 6.3M parameters on the VoxCeleb-1 dataset and evaluated it against other models using the VoxCeleb Text-independent speaker verification (TI-SV) benchmark, which is composed of three sets, namely VoxCeleb-1 original, extended, and hard. Xu et al. [31] built a CNN from ResNet and Squeeze-and-Excite (SE) with four loss functions: triplet, n-pair, angular, and Softmax. The authors combined these loss functions because they hypothesize that, by combining these loss functions, they will complement each other. The model is trained on the VoxCeleb2 training set, which contains over a million utterances from over 6000 speakers and was chosen because it was collected in natural noisy environments, which translates well to real-world scenarios. They compared their model to a number of other architectures using the VoxCeleb benchmark. Li et al. [32] created a dual attention network that is trained end-to-end and evaluated on the VoxCeleb-1 database. Their model works by taking a pair of input utterances to generate utterance-level embedding from which the similarity is measured; this works because the utterances from the same speaker are expected to have highly similar embeddings. Chen et al. [33] used the Mandarin Chinese regional dataset within the Common Voice dataset to train and evaluate two text-independent speaker verification 3D-CNNs, a lightweight one and a normal one. Since voice data are two-dimensional, they randomly segmented and stacked the data to make these three-dimensional. Zhang et al. [34] used contrastive self-supervised learning (CSSL) to train a ResNet34-based model using only 5% of the labeled data from the VoxCeleb1 and 2 datasets, and are able to achieve comparable results to models using significantly more data. All of the deep learning papers had publicly accessible datasets, and one of them had publicly available code, as shown in Table 2.

2.4. Datasets

2.4.1. ELSDSR

Abdalmalak et al. [24] used the English Language Speech Database for Speaker Recognition (ELSDSR). This dataset is composed of 22 non-native English speakers, 10 male, and 12 female, each speaking nine paragraphs worth of text. The speech is recorded with a sampling rate of 16 kHz and a bit rate of 16. The datasets authors suggested splitting the dataset into training and testing, where each speaker has seven paragraphs for training and 2 for testing. This works out to an average speech duration of 83 s for training and 17.6 for testing [39]; it is worth noting that Abdalmalak et al. [24] used the suggested training and testing split for their work.

2.4.2. TIMIT

Zerget et al. [25] and Rashno et al. [27] used the TIMIT Acoustic–Phonetic Continuous Speech Corpus for their work. This dataset is composed of 630 speakers, 70% male, and 30% female, each speaking ten sentences. The speech is recorded with a sampling rate of 16 kHz and a bit rate of 16 [40]. It is worth noting that Zerget et al. [25] did not use the entire data set and instead used only 180 speakers. Rashno et al. [27] also did not use the entire dataset, using only 100 speakers, 72 male and 28 female.

2.4.3. VoxCeleb-1

This dataset was used by Choi et al. [30], Xu et al. [31], Zhang et al. [34], and Li et al. [32]. Choi, Zhang, and Li used this dataset for training and evaluation while Xu used it for evaluation. The VoxCeleb-1 dataset is compiled from automatically annotated YouTube videos of celebrities. The dataset contains 153,516 utterances from 1251 celebrities 45% female and 55% male. On average, each celebrity has 123 utterances, with the most being 250 and the least being 45. The average length of utterance is 8.2 s with the longest being 145 s and the shortest being 4 s Nagran et al. [41].

2.4.4. VoxCeleb-2

Xu et al. [31], Zhang et al. [34], and Choi et al. [30] used the VoxCeleb-2 dataset to train their model, Choi used this in addition to data from VoxCeleb-1 and Zhang evaluated on this dataset. This dataset, similar to VoxCeleb-1, is generated from automatically annotated YouTube videos of celebrities. This dataset is much larger than VoxCeleb-1 containing 1,126,246 utterances from 6112 celebrities split 49% and 61% for female and male. On average, each celebrity has 185 utterances with an average length of 7.8 s, as in Chung et al. [42]. It is also worth noting that Xu et al. [31] uses the entirety of the dataset for training, so evaluation is done on the VoxCeleb-1 dataset.

2.4.5. Urdu Dataset

Ali et al. [28] used the Urdu dataset [43] to train and test their models. This dataset is the first of its kind on the Urdu language, the national language of Pakistan. The dataset contains 250 of the most commonly spoken words from the Urdu language, including digits. There are 50 speakers split evenly by gender and nativity, each speaker utters the word once in a soundproof recording studio. It is also worth noting that Ali et al. [28] used the entirety of the dataset.

2.4.6. Common Voice

Chen et al. [33] used the Common Voice dataset, specifically the Mandarin Chinese regional dataset. This dataset contains a testing and training subset. The training subset is made up of 271 speakers with 4899 samples distributed amongst them. The testing subset contains 6945 speakers distributed over 170 speakers. It is also worth noting that Chen et al. [33] used the entirety of the dataset.

3. Materials and Methods

3.1. Dataset

For our work, we used the LibriSpeech dataset. We chose this dataset over other datasets mentioned in this paper for a number of reasons. LibriSpeech has the best gender split of any of the other datasets, with nearly a perfect 50–50 male–female ratio. Additionally, it was sufficiently large second only to VoxCeleb-2, which is more than twice as large, Nagrani et al. [44]. The third reason that it was chosen as opposed to VoxCeleb-2 was that the data were collected from audiobooks, meaning that it was collected in a controlled environment. This enabled us to use a simple denoising technique that saved time, as opposed to more complicated methods, e.g., namely deep learning-based denoising autoencoders.

The Librispeech dataset was created by taking an 87 Gb subset of the LibriVox project and filtering out samples with excessive audio degradation and samples containing multiple speakers. After preprocessing, the dataset is composed of 2484 different subjects with a roughly 50–50 split by gender. This filtered dataset contains about 1000 h of speech sampled at 16 kHz, adding up to a total of 61 Gb. Due to the dataset’s large size, it is divided into seven subsets: dev–clean, test–clean, dev–other, test–other, train–clean–100, train–clean–360, and train–other–500, which contain 5.4, 5.4, 5.3, 5.1, 100.6, 363.6, and 496.7 h of audio recording, respectively. These subsets are divided into clean and other; those in the clean category scored a low word error rate (WER) when fed through an acoustic model trained on WSJ’s si-84 data subset [45].

To train and test the models, the test–clean–100 subset was chosen. This subset was chosen for two reasons. The first is because it is considered clean, meaning that the speech is easier to make out and contains less background noise. The second reason that this subset was chosen is because of its size; the test–clean–360 is much bigger than needed, so there was no reason to deal with its large overhead, and test–clean was too small, containing only eight minutes of audio per speaker. Ultimately, this left only test–clean–100, which still is larger than what was needed, but it provided as much data as would ever be realistically needed, as it contains 251 speakers, each speaking for roughly 25 min.

As previously mentioned, the test–clean–100 subset contained more data than needed, so two subsets were created. The first subset contained 20 arbitrarily chosen speakers split evenly by gender, each with 5 min of audio. The second subset contained every speaker within test–clean–100 that contained at least 16.6 min of speech. This worked out to 245 speakers with a roughly 50–50 split by gender. The first smaller subset was used to compare different feature sets, as its small size kept extraction time low, allowing different combinations of features to be rapidly tested. Once an optimal set of features was found, the features of the second larger subset were computed.

Preprocessing

Before feature extraction could occur, two preprocessing steps had to be performed: denoising and segmentation. The speech data were denoised using a technique called spectral gating.

At a high level, spectral gating works by discarding any time-frequency component below a threshold for each frequency component of the signal. It does this by computing the mean and standard deviation for each frequency channel of a Short-Time Fourier Transform (STFT). A threshold or gate for each frequency component is then set at some level above the mean. This threshold determines whether or not a given time-frequency component in the spectrogram is considered to be signal or noise; the spectrogram is masked based on this threshold and finally inverted back into the time domain with an inverse STFT
[46].

During this step, care was taken not to be too aggressive with the noise reduction, as there was very little noise in the data. To ensure that the denoising did not remove important data, several different levels of denoising were tested by changing the proportion of the spectral gating mask that was applied. After trial and error, it was determined that an 80% percent decrease was ideal. The average change in SNR from cleaning is a decrease of 22% with a standard deviation of 8%, Table 4 gives a more detailed breakdown.

The second preprocessing step was to divide the audio into one-second increments. This was so that speakers could be easily compared to one another. The one-second increments were not chosen arbitrarily. From testing, it was determined that audio segments as small as 0.2 s, when used for text-dependent speaker verification, could work. Keeping in mind that text-dependent speaker verification needs less data, it was obvious that longer segments would be needed, but this gave a good jumping-off point. Initially, four-second segments were tested; these worked really well, but they were computationally expensive for the feature extractor. The lengths of the segments were iteratively shortened by one second until one-second segments were obtained; these were significantly cheaper computationally and had the same effect on the model as the four-second segments.

3.2. Features

Features were extracted using Mel-frequency cepstral coefficients (MFCC). There are a number of alternatives, such as LPCC, PLP, and BFCC, that are used in the SVM papers we review. However, MFCC was chosen as the feature extraction method because it is based on the Mel scale, which approximates the human auditory system, making it good for speech analysis applications. Additionally, it gave us results that showed comparability, our goal. From each speech segment, 20 coefficients were calculated. 20 was chosen instead of the traditional 13 because testing with the Mann–Whitney and Kruskal–Wallis test showed a statistically significant decrease in the quality of SVMs and CNN trained with only 13 coefficients Table 5.

In order to further refine the feature set, it was necessary to determine whether any features were highly correlated and could be removed with principal component analysis PCA. PCA was chosen over t-SNE and UMAP for a number of reasons. The first reason is that PCA is deterministic, which allows for the best comparison when testing between different feature sets. The second reason only pertains to the t-SNE, which is significantly slower than PCA for large datasets like the one we are using. A heat map of the Spearman correlation matrix was generated to visualize this Figure 1. As shown in Figure 1, there are quite a few highly correlated features indicating that the feature set should undergo PCA. To compute the independent feature set, PCA with a variance threshold of 95% was applied. In addition to correcting the correlation, the independent feature set is 50% smaller.

Ultimately, the models did well because the features were intelligently chosen. The maximum and minimum values give the model important information about the range of energy in the speech sample, allowing the model to derive a sense of how lethargic and how energetic the speaker’s speech is. The mean allows the model to derive a sense of where the majority of the energy is within the signal. The median does this as well but is immune to outliers, which may throw off the mean, i.e., any noise that was missed in the denoising process. The standard deviation gives the model an idea of how much the energy within the speech is spread out. The skew can reveal whether there is more low energy or high energy in the coefficients by measuring the symmetry of the distribution. Finally, there is kurtosis, which can show the model how much variation there is within the speech.

3.3. Model

To obtain the best comparison possible, we trained our own SVM and CNN. Training our own models allowed us to do a number of things we would not have otherwise been able to do. It allowed us to directly compare the computation time of an SVM and a deep learning model. It allowed us to compare an SVM and a deep learning model on the exact same feature set. The models also gave baselines enabling a comparison of two papers that do not compute the same metrics. To give a little background, SVMs are supervised learning models used for classification and can perform linear separation as well as nonlinear separation using what is called the kernel trick, which works by mapping the data to a higher dimension where they are linearly separable [47]. CNNs are supervised learning models that are feed-forward neural networks usually made up of three main layers: convolutional, pooling, and fully connected [48].

3.3.1. Hyperparameters

To ensure that the results from the SVM and CNN were satisfactory and comparable to the papers reviewed, several different combinations of hyperparameters were tested. For the sake of brevity and to avoid unnecessarily expanding the scope of the paper, as only a demonstration of comparability is necessary for our case, this section will be kept succinct. Nineteen different combinations of hyperparameters were tested, including four different kernels: linear, polynomial, sigmoid, and rbf. The linear kernel did the worst, followed by sigmoid and then polynomial, which varied depending on the specified degree. The kernel that worked best was rbf. In an attempt to further improve results, the regularization and kernel coefficients were modified, but this either resulted in either a severe decrease in metrics or a negligible increase in one metric, while metrics saw moderate or negligible decreases. In the end, the default hyperparameters, as given by sklearns implementation, gave the best results. In the case of the CNN, 14 different hyperparameters where tested, and it is worth noting that unlike the SVMs hyperparameter tuning, each model was compared to each other on a fraction of the whole dataset to allow for rapid iteration. When comparing the competing models, three different activation functions were tested: sigmoid, Softmax, and relu. While relu ended up being the most effective activation function, most of the gains were achieved by changing the number and shape of layers until we obtained the network shown in Figure 2.

3.3.2. Training

As previously mentioned, each speaker’s audio was divided into one-second segments, resulting in around 1500 segments per speaker, as each speaker had roughly twenty-five minutes of data. Out of these 1500 segments, 1000 segments, or 16.6 min worth, were chosen at random. The training was conducted using k-fold cross-validation with five folds. K-fold cross-validation was chosen over nested cross-validation because the hyperparameters were already tuned, and five folds were chosen because increasing values had no notable effects on the computed metrics. The CNN training used ten epochs with batch sizes of 50.

4. Results

For both the SVM and CNN, each speaker had their own set of models, and from each set of models, the following metrics were calculated: accuracy, precision, recall, F1-score, ROC AUC, EER, MSE, and MCC. The way this worked with k-fold cross-validation was that each speaker had five models, the number of k-folds, and each of the five models had the aforementioned eight metrics calculated. They were then averaged together, and from those averages, a five-number summary and mean were calculated, along with a visualization of the ROC AUC curve. Table 6 and Table 7. In addition to this, the ROC curve was also generated, as can be seen in Figure 3 and Figure 4.

5. Discussion and Conclusions

In this paper, we conduct a comprehensive comparison and summarization of 14 models that implement text-independent speaker verification. Seven are SVMs, one of which is of our own design, and seven are deep learning techniques, one of which is a CNN of our own design. This process consisted of summarizing the methodology of all the models and datasets used by said models. In addition to the summarization, we made several tables that compare the metrics, features, model, data, and dataset availability.

This paper had two main motivations. The primary motivation was to show that SVMs can achieve results comparable to deep learning methods in the field of text-independent speaker verification and outperform them in certain areas, namely explainability and computational efficiency. The secondary motivation was to compile works in which SVMs are used in text-independent speaker verification to make these works more accessible and highlight the limited works in this category. Accomplishing the second goal was simple and was accomplished by finding, compiling, and analyzing works that use SVMs for text-independent speaker verification. The primary goal was more challenging; the main challenge was finding a way to compare all of the works with each other, as many of them used different datasets, methods, features, and metrics. The best way to overcome these differences and enable the best possible comparison was to make a comprehensive table that covered all of the works in detail. See Table 8 and Table 9.

5.1. Comparability

The first obstacle when comparing the SVM papers to the deep learning papers is that only two of the six SVM papers computed the EER, Zergat et al. [25] and Rashno et al. [27]. This made it difficult to compare the other SVM papers because EER was used exclusively by all except one deep learning paper. So while Zergat et al. [25] and Rashno et al. [27] can be directly compared, the other four papers cannot, at least not easily. Comparing Abdalmalak et al. [24] was more challenging as they only computed the AUC ROC. But from the AUC ROC, we are able to estimate the EER. This is because a high ROC AUC will mean a low EER see Figure 5, which illustrates the relationship. This relationship gives us two optimization problems for each curve where we are looking for the highest and lowest EER while meeting the AUC constraint. If we do this, we can calculate the range of possible EER values for the given ROC AUC, which can be seen in Table 8.

Finding a way to compare the final three SVM papers, namely Kamruzzaman et al. [26], Charan et al. [23], and Ali et al. [28], posed the greatest challenge as there is no way to estimate EER from accuracy, but this is where the SVM model we trained comes into play. Since we calculate the accuracy and EER for our SVM, we have a rough idea of what the EER should be given in terms of accuracy. So, for the SVM with an accuracy of 95%, as in Kamruzzaman et al. [26], we can safely say that their model had a similar EER as their accuracy is within 2% points of ours. The same can be said of the SVM by Ali et al. [28] which has an accuracy of 92.6%, within 4% points of ours. Conversely, since the accuracies of the SVMs in Charan et al. [23] are significantly lower, it is safe to say that the EER is significantly higher.

Keeping this all in mind, we are able to accurately and reliably compare the SVMs from Zergat et al. [25], Rashno et al. [27], and Abdalmalak et al. [24]—the last of which is possible because of the ability to estimate EER from ROC AUC. The second three papers, Kamaruzzaman et al. [26], Ali et al. [28], and Charan et al. [23], cannot be compared as reliably because EER cannot be obtained from raw accuracy, but we are able to derive a good idea of how they compare based on the SVM created in this study.

5.1.1. Comparison

As previously mentioned, we are able to reliably compare the SVMs of Abdalmalak et al. [24], Zergat et al. [25], Rashno et al. [27], and our own as they all have calculated the EER. Based on the metrics from Table 8 and Table 9, we can conclude that SVMs are capable of performing on par or better, as in the case of Zergast et al. [25], whose EER is lower than all of the deep learning papers reviewed. Table 8 and Table 9 also show that all of the SVMs used less data than their deep learning counterparts, which was not an advantage of SVMs that we had anticipated but it is nonetheless present. We believe that this is the case because SVMs require careful feature extraction and engineering to work properly so they can get away with less data as the data they are given are more carefully curated. This is opposed to deep learning techniques, which are able to perform some of the feature extraction and engineering themselves, given a sufficiently large dataset. The final advantage we found SVMs to have on deep learning techniques was their computational efficiency. Unfortunately, none of the papers reviewed disclosed training times for their classifiers, so we were only able to compare the training times of the SVM and CNN from our study. Training the CNN took 41 min and 43.2 s to train on a machine with an AMD Ryzen 9 7950X 16-Core processor, NVIDIA GeForce RTX 3060 Ti with 8 GB of VRAM, and 64 GB of DDR5 RAM. This is in contrast to the SVM, which took only 1 min and 0.5 s to train on the same machine.

5.1.2. Future Work

While our comparison is comprehensive, there is room for future work. The first and most obvious area of improvement is to increase the number of papers reviewed, but this may be more challenging than would initially appear as there are not many recent SVM papers on this subject. The second area for improvement could be to divide the deep learning techniques by architecture; in this vein, one could include the alternative machine learning techniques in the in-depth review. The final area for improvement would be to recreate the code from the reviewed papers, as this would allow for a better analysis and comparison of the computation cost of SVMs and deep learning techniques.

Besides ways that our review and comparison may be expanded, the reviewed papers listed a number of avenues for future work. Rashno et al. [27] suggested that intrinsic properties of data, such as relief weight, may be used in conjunction with particle swarm optimization for faster convergence in feature selection algorithms. Kamruzzaman et al. [26] suggested that HMMs may be used to improve segmentation when cross-talk and laughter are present in the data. Additionally, we have a few thoughts on potential areas for future work on SVM for text-independent speaker verification spurred by our review. We observed that SVM needs significantly less data than deep learning techniques to obtain comparable results. It would be prudent to further investigate this to see how far this difference can be pushed. During the literature review we say that many papers combine two or more different models to obtain improved results; however, we did not see any papers combining SVMs with neural networks, so it would be prudent to investigate this further to see whether such a union would be advantageous. Finally, we saw that there was a large reliance on MFCC as a feature extraction technique, and while MFCC works well, it would be prudent to conduct an in-depth study on the effects that feature sets have on SVMs, as the papers we found did not review more than six different feature sets.

Author Contributions

Conceptualization, O.K. and M.I.; methodology, O.K. and M.I.; software, O.K.; validation, O.K. and M.I.; formal analysis, O.K.; investigation, O.K.; resources, O.K. and M.I.; data curation, O.K.; writing—original draft preparation, O.K.; writing—review and editing, O.K. and M.I.; visualization, O.K.; supervision, M.I.; project administration, O.K. and M.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code presented in this study is openly available in “BioMed_project” at https://github.com/0dink/BioMed_project (accessed on 1 February 2025). The data used in this study are openly available in “librispeech” at https://www.openslr.org/12 (accessed on 10 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SVM	Support Vector Machine
ROC	Receiver Operating Curve
AUC	Area Under Curve
RBF	Radial Basis Function
EER	Equal Error Rate
MSE	Mean Square Error
MCC	Matthews Correlation Coefficient
SNR	Signal to Noise Ratio
GMM	Gaussian Mixture Model
UBM	Universal Background Model
IBM	Individual Background Model
HMM	Hidden Markov Model
ACO	Ant Colony Optimization
DBN	Deep Belief Network
PLDA	Probabilistic Linear Discriminant Analysis
LSTM	Long Short-Term Memory
TDNN	Time Delay Neural Network
VQ	Vector Quantization

References

Bäckström, T.; Räsänen, O.; Zewoudie, A.; Zarazaga, P.P.; Koivusalo, L.; Das, S.; Mellado, E.G.; Mansali, M.B.; Ramos, D.; Kadiri, S.; et al. Introduction to Speech Processing, 2nd ed.; Aalto University: Espoo, Finland, 2022. [Google Scholar] [CrossRef]
Kaczmarek, A.; Staworko, M. Application of dynamic time warping and cepstrograms to text-dependent speaker verification. In Proceedings of the Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2009, Poznan, Poland, 24–26 September 2009. [Google Scholar]
Kok, C.L.; Ho, C.K.; Aung, T.H.; Koh, Y.Y.; Teo, T.H. Transfer Learning and Deep Neural Networks for Robust Intersubject Hand Movement Detection from EEG Signals. Appl. Sci. 2024, 14, 8091. [Google Scholar] [CrossRef]
Jung, J.; Heo, H.-S.; Kim, J.; Shim, H.; Yu, H.-J. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv 2019, arXiv:1904.08104. [Google Scholar]
Thompson, N.C.; Greenewald, K.; Lee, K.; Manso, G.F. The Computational Limits of Deep Learning. arXiv 2022, arXiv:2007.05558. [Google Scholar]
Roscher, R.; Bohn, B.; Duarte, M.F.; Garcke, J. Explainable Machine Learning for Scientific Insights and Discoveries. IEEE Access 2020, 8, 42200–42216. [Google Scholar] [CrossRef]
Jung, J.; Kim, S.; Shim, H.; Kim, J.; Yu, H.-J. Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms. arXiv 2020, arXiv:2004.00526. [Google Scholar]
Tang, Y.; Ding, G.; Huang, J.; He, X.; Zhou, B. Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar] [CrossRef]
Wang, S.; Huang, Z.; Qian, Y.; Yu, K. Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1686–1696. [Google Scholar] [CrossRef]
Paranatti, C.S.; Bhandari, R.R.; Patil, T.M.; Chikkamath, S.; Nirmala, S.R.; Budihal, S. Speaker Verification: A Raw Waveform Approach for Text Independent using CNN. In Proceedings of the 2024 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India, 1–3 March 2024. [Google Scholar] [CrossRef]
Song, S.; Zhang, S.; Schuller, B.W.; Shen, L.; Valstar, M. Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018. [Google Scholar] [CrossRef]
Thakur, K.; Bhukya, R.K. Speaker Authentication Using GMM-UBM. In Proceedings of the 2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Prayagraj, India, 2–4 December 2022. [Google Scholar] [CrossRef]
Rakhmanenko, I.; Meshcheryakov, R. Speech Features Evaluation for Small Set Automatic Speaker Verification Using GMM-UBM System. In Speech and Computer; Springer: Cham, Switzerland, 2016. [Google Scholar]
Ditlovich, V.; Bistritz, Y. Speaker verification with mostly voiced speech for GMM/UBM and GMM/IBM systems. In Proceedings of the 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), Kyiv, UKraine, 29 May–2 June 2017. [Google Scholar] [CrossRef]
Pinheiro, H.N.B.; Vieira, S.R.F.; Ren, T.I.; Cavalcanti, G.D.C.; de Mattos Neto, P.S.G. Type-2 fuzzy GMM for text-independent speaker verification under unseen noise conditions. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016. [Google Scholar] [CrossRef]
MathWorks. Available online: https://www.mathworks.com/help/stats/clustering-using-gaussian-mixture-models.html (accessed on 15 January 2025).
Sarmah, K.; Bhattacharjee, U. Text—Independent Multi-Sensor Speaker Verification System. J. Int. J. Comput. Sci. Eng. 2015, 4, 7–16. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2657491 (accessed on 25 February 2025).
Li, M.; Liu, L.; Cai, W.; Liu, W. Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification. J. Sign. Process. Syst. 2016, 82, 207–215. [Google Scholar] [CrossRef]
Das, R.K.; Jelil, S.; Prasanna, S.R.M. Significance of constraining text in limited data text-independent speaker verification. In Proceedings of the 2016 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 12–15 June 2016. [Google Scholar] [CrossRef]
Chen, L.; Kong, A.L.; Ma, B.; Ma, L.; Li, H.; Dai, L.R. Adaptation of PLDA for multi-source text-independent speaker verification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar] [CrossRef]
Ghoniem, R.M.; Shaalan, K. A Novel Arabic Text-independent Speaker Verification System based on Fuzzy Hidden Markov Model. Procedia Comput. Sci. 2017, 117, 274–286. [Google Scholar] [CrossRef]
Hourri, S.; Kharroubi, J. A Novel Scoring Method Based on Distance Calculation for Similarity Measurement in Text-Independent Speaker Verification. Procedia Comput. Sci. 2019, 148, 256–265. [Google Scholar] [CrossRef]
Charan, R.; Manisha, A.; Karthik, R.; Kumar, M.R. A text-independent speaker verification model: A comparative analysis. arXiv 2017, arXiv:1712.00917. [Google Scholar]
Abdalmalak, K.A.; Gallardo-Antolín, A. Enhancement of a text-independent speaker verification system by using feature combination and parallel structure classifiers. Neural Comput. Appl. 2018, 29, 637–651. [Google Scholar] [CrossRef]
Zergat, K.Y.; Amrouche, A. Robust Support Vector Machines for Speaker Verification Task. arXiv 2013, arXiv:1306.2906. [Google Scholar]
Kamruzzaman, S.M.; Karim, A.N.M.R.; Islam, M.S.; Haque, M.E. Speaker Identification using MFCC-Domain Support Vector Machine. arXiv 2010, arXiv:1009.4972. [Google Scholar]
Rashno, A.; Ahadi, S.M.; Kelarestaghi, M. Text-independent speaker verification with ant colony optimization feature selection and support vector machine. In Proceedings of the 2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA), Rasht, Iran, 11–12 March 2015. [Google Scholar] [CrossRef]
Ali, H.; Tran, S.N.; Benetos, E.; d’Avila Garcez, A.S. Speaker recognition with hybrid features from a deep belief network. Neural Comput. Appl. 2018, 29, 13–19. [Google Scholar] [CrossRef]
Arasteh, S.T. An Empirical Study on Text-Independent Speaker Verification based on the GE2E Method. arXiv 2022, arXiv:2011.04896. [Google Scholar]
Choi, J.-H.; Yang, J.-Y.; Jeoung, Y.-R.; Chang, J.-H. Improved CNN-Transformer Using Broadcasted Residual Learning for Text-Independent Speaker Verification; ISCA Interspeech: Incheon, Republic of Korea, 2022. [Google Scholar]
Xu, J.; Wang, X.; Feng, B.; Liu, W. Deep multi-metric learning for text-independent speaker verification. Neurocomputing 2020, 410, 394–400. [Google Scholar] [CrossRef]
Li, J.; Lee, T. Text-Independent Speaker Verification with Dual Attention Network. arXiv 2020, arXiv:2009.05485. [Google Scholar]
Chen, J.Y.; Jeng, J.T. Text-Independent Speaker Verification Using Lightweight 3D Convolutional Neural Networks. In Proceedings of the 2024 International Conference on System Science and Engineering (ICSSE), Hsinchu, Taiwan, 26–28 June 2024. [Google Scholar] [CrossRef]
Zhang, H.; Zou, Y.; Wang, H. Contrastive Self-Supervised Learning for Text-Independent Speaker Verification. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar] [CrossRef]
Kharroubi, J.; Petrovska-Delacrétaz, D.; Chollet, G. Combining GMM’s with Suport Vector Machines for Text-Independent Speaker Verification; ISCA Eurospeech: Aalborg, Denmark, 2001. [Google Scholar]
Gu, Y.; Thomas, T. A Text-Independent Speaker Verification System Using Support Vector Machines Classifier; ISCA Eurospeech: Aalborg, Denmark, 2001. [Google Scholar]
Wan, V.; Renals, S. Speaker verification using sequence discriminant support vector machines. IEEE Trans. Speech Audio Process. 2005, 13, 203–210. [Google Scholar] [CrossRef]
Liu, M.; Xie, Y.; Yao, Z.; Dai, B. A New Hybrid GMM/SVM for Speaker Verification. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006. [Google Scholar] [CrossRef]
Feng, L.; Hansen, L.K. A New Database for Speaker Recognition; Technical University of Denmark: Kongens Lyngby, Denmark, 2004; Available online: https://orbit.dtu.dk/en/publications/a-new-database-for-speaker-recognition (accessed on 25 February 2025).
Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L.; Zue, V. TIMIT Acoustic-Phonetic Continuous Speech Corpus; UPENN: Philadelphia, PA, USA, 1993. [Google Scholar] [CrossRef]
Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset; ISCA Interspeech: Stockholm, Sweden, 2017. [Google Scholar] [CrossRef]
Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition; ISCA Interspeech: Hyderabad, India, 2018. [Google Scholar] [CrossRef]
Ali, H.; Yahya, K.M.; Ahmad, N.; Farooq, O. A Medium Vocabulary Urdu Isolated Words Balanced Corpus for Automatic Speech Recognition. In Proceedings of the 2012 International Conference on Electronics Computer Technology, Kanyakumari, India, 6–8 April 2012. [Google Scholar]
Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 2020, 60, 101027. [Google Scholar] [CrossRef]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015. [Google Scholar] [CrossRef]
Sainburg, T.; Gentner, T.Q. Toward a Computational Neuroethology of Vocal Communication: From Bioacoustics to Neurophysiology, Emerging Tools and Future Directions. Front. Behav. Neurosci. 2021, 15, 811737. [Google Scholar] [CrossRef] [PubMed]
IBM. What Is Support Vector Machine?|IBM—ibm.com. Available online: https://www.ibm.com/think/topics/support-vector-machine (accessed on 28 January 2025).
IBM. What Are Convolutional Neural Networks?|IBM—ibm.com. Available online: https://www.ibm.com/think/topics/convolutional-neural-networks (accessed on 28 January 2025).

Figure 1. Spearman correlation matrix without principal component analysis.

Figure 2. Our CNN architecture.

Figure 3. ROC curves of speaker verification SVMs. Where each color represents the the curve from each speakers model.

Figure 4. ROC curves of speaker verification CNNs. Where each color represents the the curve from each speakers model.

Figure 5. Relationship between ROC AUC and EER.

Table 1. Alternatives to SVMs.

Source	Date Published	Model	EER	Accuracy
Ghoniem et al. [21]	2017	FHMM	n/a	98.3%
Thakur et al. [12]	2022	GMM-UBM	5.31%	n/a
	2022	i-vector	4.74%	n/a
Rakhmanenko et al. [13]	2016	GMM-UBM	0.763%	n/a
Ditlovich et al. [14]	2017	GMM-UBM	16.3%	n/a
	2017	GMM-IBM	13.9%	n/a
Pinheiro et al. [15]	2016	GMM-UBM	16.86%	n/a
	2016	T2F-GMM-UBM	13.73%	n/a
Sarmah et al. [17]	2015	GMM-UBM	7.50%	n/a
Song et al. [11]	2018	GMM-UBM	5.6%	n/a
	2018	VQ	8.3%	n/a
	2018	i-vector	5.0%	n/a
Li et al. [18]	2015	GMM-i-vector	1.71%	n/a
Das et al. [19]	2016	i-vector	11.30%	n/a
Chen et al. [20]	2017	i-vector	0.164% to 6.802%	n/a
Hourri et al. [22]	2019	K-means	0.32%	n/a

Note: “n/a” indicates that the metric was not reported in the corresponding study.

Table 2. Accessibility of work related to the target domain.

Source	Date Published	Code Availability	Dataset	Model
Ours	2024	Yes	Libirispeech	SVM and DL
AbdalmalTak et al. [24]	2016	No	ELSDSR	SVM
Zergat et al. [25]	2013	No	TIMIT	SVM
Kamruzzaman et al. [26]	2010	No	Not Specified	SVM
Charan et al. [23]	2017	No	Not Specified	SVM
Rashno et al. [27]	2015	No	TIMIT	SVM
Ali et al. [28]	2016	No	Urdu dataset	SVM
Tayebi et al. [29]	2022	No	Libirispeech	DL
Choi et al. [30]	2022	No	VoxCeleb-1	DL
Xu et al. [31]	2020	Yes	VoxCeleb-1 and 2	DL
Li et al. [32]	2020	No	VoxCeleb-1	DL
Chen et al. [33]	2024	No	Common Voice	DL
Zhang et al. [34]	2021	No	VoxCeleb-1 and 2	DL

Table 3. Features computed in related studies.

Papers	MFCC	PLP	R-PLP	LSF	BFCC	LPCC	PCA	t-SNE
Ours	✓						✓
AbdalmalTak et al. [24]	✓	✓	✓		✓
Zergat et al. [25]	✓			✓			✓
Kamruzzaman et al. [26]	✓
Charan et al. [23]	✓	✓				✓	✓	✓
Rashno et al. [27]	✓
Ali et al. [28]	✓						✓

Note: “✓” indicates that the corresponding method was used while blank indicates that it was not.

Table 4. Five-number summary and mean of SNR for pre- and post-denoising, in decibels.

Metric	Min	Q1	Median	Q3	Max	Mean
Before denoising	0.954	11.802	15.8058	18.518	37.3284	15.1214
After denoising	0.5972	8.7658	12.239	14.8351	29.0571	11.8289

Table 5. Twenty versus thirteen coefficients employed in model training. p-value rounded to nearest 10,000th.

Model/Test	Accuracy	Precision	Recall	F1-Score	ROC AUC	EER	MSE	MCC
SVM Mann	0.055	0.176	0.057	0.053	0.042	0.066	0.055	0.060
SVM Kruskal	0.053	0.172	0.055	0.051	0.041	0.064	0.053	0.058
CNN Mann	0	0	0	0.001	0	0.001	0	0.001
CNN Kruskal	0	0	0	0.001	0	0.001	0	0.001

Table 6. Five-number summary and mean of speaker verification SVMs.

Metric	Min	Q1	Median	Q3	Max	Mean
Accuracy	0.9044	0.9570	0.9691	0.9803	0.9970	0.9666
Precision	0.9097	0.9648	0.9769	0.9870	0.9979	0.9740
Recall	0.8890	0.9445	0.9618	0.9739	0.9970	0.9581
F1-score	0.9015	0.9555	0.9685	0.9798	0.9969	0.9658
ROC AUC	0.9691	0.9950	0.9982	0.9995	1.0000	0.9964
EER	0.0010	0.0182	0.0289	0.0438	0.0974	0.0324
MCC	0.8084	0.9137	0.9388	0.9606	0.9939	0.9335
MSE	0.0030	0.0197	0.0309	0.0430	0.0956	0.0334

Table 7. Five-number summary and mean of speaker verification CNNs.

Metric	Min	Q1	Median	Q3	Max	Mean
Accuracy	0.9942	0.9999	1.0000	1.0000	1.0000	0.9999
Precision	0.9934	1.0000	1.0000	1.0000	1.0000	0.9998
Recall	0.9928	1.0000	1.0000	1.0000	1.0000	0.9999
F1-score	0.6622	0.8040	0.8513	0.8848	0.9288	0.8395
ROC AUC	0.9418	0.9952	0.9981	0.9994	1.0000	0.9951
EER	0.0000	0.0030	0.0046	0.0086	0.0521	0.0074
MCC	0.9000	0.9818	0.9899	0.9930	0.9990	0.9846
MSE	0.0000	0.0000	0.0000	0.0001	0.0051	0.0001

Table 8. Paper comparisons.

Source: SVM Ours
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
96.91%	97.69%	96.18%	96.85%	99.82%	2.89%	4.30%	96.06%	RBF kernel SVM	20 MFCC coefficients with PCA applied	245 speakers with 50/50 gender split
Source: CNN Ours
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
100%	100%	100%	85.13%	99.81%	0.46%	0%	98.99%	See Figure 2	See SVM ours	See SVM ours
Source: Abdalmalak et al. [24]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	98%	5.6% to 7.1%	n/a	n/a	3 parallel SVMs (linear, RBF, logistic regression kernels)	MFCC, BFCC, PLP, R-PLP	All 22 speakers from the ELSDSR dataset; each speaker has 9 utterances, split 7/2 for training and testing
n/a	n/a	n/a	n/a	96%	9.3% to 11.68%	n/a	n/a	Logistic regression kernel SVM
n/a	n/a	n/a	n/a	94%	12.5% to 14.9%	n/a	n/a	Linear kernel SVM
n/a	n/a	n/a	n/a	94%	12.5% to 14.9%	n/a	n/a	RBF kernel SVM
Source: Zergat et al. [25]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	n/a	0.51%	n/a	n/a	RBF kernel SVM	12 MFCC coefficients and LSF with PCA applied	180 speakers from TIMIT dataset split 50/50 for training/testing with 15/9 s per speaker
n/a	n/a	n/a	n/a	n/a	0.63%	n/a	n/a		12 MFCC PCA applied
n/a	n/a	n/a	n/a	n/a	2.56%	n/a	n/a		LSF with PCA applied
Source: Kamruzzaman et al. [26]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
91.88%	n/a	n/a	n/a	n/a	n/a	n/a	n/a	SVM trained with Chunking	20 MFCC coefficients	8 speakers with 20 samples of the word “zero” per speaker
95%	n/a	n/a	n/a	n/a	n/a	n/a	n/a	SVM trained w/SMO		8 speakers with 20 samples of the word “zero” per speaker
Source: Charan et al. [23]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
23%	n/a	n/a	n/a	n/a	n/a	n/a	n/a	SVM	20 MFCC coefficients with PCA	15 total speakers with 40–45 samples each
18.5%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		20 LPCC coefficients with PCA
22%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		20 PLP coefficients with PCA
57.9%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		20 MFCC coefficients with t-SNE
38%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		20 LPCC coefficients with t-SNE
52.8%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		20 PLP coefficients with t-SNE
Source: Rashno et al. [27]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	n/a	2.140%	n/a	n/a	RBF kernel	39 → 14 with ACO	72 male and 26 females chosen randomly from TIMIT with 6 sentences each for training and 4 each for testing
n/a	n/a	n/a	n/a	n/a	8.751%	n/a	n/a	MLP kernel	39 → 13 with ACO
n/a	n/a	n/a	n/a	n/a	1.745%	n/a	n/a	polynomial kernel	39 → 14 with ACO
n/a	n/a	n/a	n/a	n/a	5.122%	n/a	n/a	RBF kernel	39 → 16 with GA
n/a	n/a	n/a	n/a	n/a	10.123%	n/a	n/a	MLP kernel	39 → 15 with GA
n/a	n/a	n/a	n/a	n/a	3.128%	n/a	n/a	polynomial kernel	39 → 14 with GA
Source: Ali et al. [28]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
88.6%	n/a	n/a	n/a	n/a	n/a	n/a	n/a	RBF kernel	36 MFCC coeff	250 speakers from the Urbu dataset split with a 2:1:1 train, validation, and test ratio
90.40%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		DBN-1
91.40%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		DBN-1 and 36 MFCC coeff
72.20%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		DBN-2
87.00%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		DBN-2 and 36 MFCC coeff
90.60%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		DBN-1 and 2
92.60%	n/a	n/a	n/a	n/a	n/a	n/a	n/a		DBN-1 and 2 and 36 MFCC coeff
Source: Tayebi et al. [29]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	n/a	3.92%	n/a	n/a	GE2E trained on 2 enrollment utterances	40-dimensional log-Mel-filterbank of 25 ms frames with 10 ms steps	Evaluated on LibriSpeech test-clean subset
n/a	n/a	n/a	n/a	n/a	2.57%	n/a	n/a	GE2E trained on 3 enrollment utterances
n/a	n/a	n/a	n/a	n/a	2.41%	n/a	n/a	GE2E trained on 4 enrollment utterances
n/a	n/a	n/a	n/a	n/a	2.27%	n/a	n/a	GE2E trained on 7 enrollment utterances
n/a	n/a	n/a	n/a	n/a	2.17%	n/a	n/a	GE2E trained on 10 enrollment utterances
n/a	n/a	n/a	n/a	n/a	2.01%	n/a	n/a	GE2E trained on 15 enrollment utterances

Note: “n/a” indicates that the metric was not reported in the corresponding study.

Table 9. Paper comparisons continued.

Source: Choi et al. [30]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	n/a	2.70%	n/a	n/a	BC-CMT-Tiny (273.6K params)	80-dimensional log-Mel-filterbank of 25 ms frames with 10 ms steps	Trained on 5994 speakers with an average of 185 utterances per speaker each lasting on average 7.8 s
n/a	n/a	n/a	n/a	n/a	1.05%	n/a	n/a	BC-CMT-Small (1.4M params)
n/a	n/a	n/a	n/a	n/a	0.86%	n/a	n/a	BC-CMT-Base (6.3M params)
Source: Xu et al. [31]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	n/a	3.48%	n/a	n/a	ResNet-50 Architecture	Spectrograms generated by a sliding window using a Hamming window with a width of 20 ms and a step of 10 ms	VoxCeleb2 training set consisting of 1,128,246 utterances from 5994 speakers with an average duration of 8.28 s
Source: Li et al. [32]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	n/a	2.16%	n/a	n/a	ResNet34-AM-Softmax	Calculated by DNN	Trained on 7205 speakers each with two sets of 64 random 3 s segments of utterances
Source: Chen et al. [33]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	90.1%	16.4%	n/a	n/a	3D-CNN with 469.465K Params	40 Mel-scale logarithmic energy coefficients	Uses Common Voice breakdown described in Section 2.4.6
n/a	n/a	n/a	n/a	92.5%	14.3%	n/a	n/a	Lightweight 3D-CNN with 235.381K Params	40 Mel-scale logarithmic energy coefficients	Uses Common Voice breakdown described in Section 2.4.6
Source: Zhang et al. [34]
Accuracy	Precision	Recall	F1	ROC AUC	EER	MSE	MCC	Model	Features	Data
n/a	n/a	n/a	n/a	n/a	3.88%	n/a	n/a	Fine-tuned thin-ResNet34 with all labels and all layers	40-dimensional log-mel features with hamming window of 25 ms and 10 ms step	Random 2 s non overlapping speech segments from VoxCeleb-1
n/a	n/a	n/a	n/a	n/a	1.87%	n/a	n/a	Fine-tuned thin-ResNet34 with all labels and all layers		Random 2 s non overlapping speech segments from VoxCeleb-2

Note: “n/a” indicates that the metric was not reported in the corresponding study.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kohler, O.; Imtiaz, M. Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches. Electronics 2025, 14, 963. https://doi.org/10.3390/electronics14050963

AMA Style

Kohler O, Imtiaz M. Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches. Electronics. 2025; 14(5):963. https://doi.org/10.3390/electronics14050963

Chicago/Turabian Style

Kohler, Odin, and Masudul Imtiaz. 2025. "Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches" Electronics 14, no. 5: 963. https://doi.org/10.3390/electronics14050963

APA Style

Kohler, O., & Imtiaz, M. (2025). Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches. Electronics, 14(5), 963. https://doi.org/10.3390/electronics14050963

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches

Abstract

1. Introduction

2. Relevant Work

2.1. Works for Comparison

2.2. SVM

2.3. Deep Learning

2.4. Datasets

2.4.1. ELSDSR

2.4.2. TIMIT

2.4.3. VoxCeleb-1

2.4.4. VoxCeleb-2

2.4.5. Urdu Dataset

2.4.6. Common Voice

3. Materials and Methods

3.1. Dataset

Preprocessing

3.2. Features

3.3. Model

3.3.1. Hyperparameters

3.3.2. Training

4. Results

5. Discussion and Conclusions

5.1. Comparability

5.1.1. Comparison

5.1.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI