Next Article in Journal
A Semi-Supervised Clustering Algorithm for Underground Disaster Monitoring and Early Warning
Next Article in Special Issue
Optimizing Multi-View CNN for CAD Mechanical Model Classification: An Evaluation of Pruning and Quantization Techniques
Previous Article in Journal
Correction: Islam et al. A Novel Anomaly Detection System on the Internet of Railways Using Extended Neural Networks. Electronics 2022, 11, 2813
Previous Article in Special Issue
An Evolutionary Deep Learning Framework for Accurate Remaining Capacity Prediction in Lithium-Ion Batteries
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches

1
Computer Science Department, Clarkson University, 8 Clarkson Ave., Potsdam, NY 13699, USA
2
Electrical & Computer Engineering Department, Clarkson University, 8 Clarkson Ave., Potsdam, NY 13699, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2025, 14(5), 963; https://doi.org/10.3390/electronics14050963
Submission received: 31 December 2024 / Revised: 15 February 2025 / Accepted: 26 February 2025 / Published: 28 February 2025

Abstract

:
Speaker verification is a common issue that has enumerable biomedical security applications. Speaker verification comes in two different forms: text-independent and text-dependent. Each of these forms can be implemented via many different machine learning and deep learning techniques. From our research, we found that there is significantly less work implementing text-independent speaker verification using machine learning techniques than there is using deep learning techniques. Because of this gap, we were motivated to build our own SVM and CNN model for text-independent speaker verification and compare them to other systems using SVMs or deep learning techniques. We limited ourselves to SVMs because they are commonly used for speech recognition and achieved very high accuracies. The main motivation behind this was two-fold. The first reason is to demonstrate that SVMs can and have been successfully used for text-independent speaker verification at a level comparable to deep learning techniques; the second reason is to make work using SVMs for text-independent speaker verification more accessible so it can be expanded upon easily. The analysis and comparison conducted in this paper will demonstrate how SVMs achieve results comparable to deep learning techniques and allow future researchers to more easily find SVMs used for text-independent speaker verification and derive a sense of what is being implemented in the field.

1. Introduction

Speaker verification plays a critical role in many biometric security systems, and it is important not to confuse it with an equally important similar process, speaker identification. The goal of speaker verification is to confirm or deny the identity of the input speaker as the claimed speaker; this is achieved by comparing the input speech to a ground truth of the claimed identity. The goal of speaker identification is to assign an identity to the input speech; this is achieved by comparing the input speech to many ground truths of many different speakers. Another important distinction for speaker verification is that it comes in two flavors: text-dependent and text-independent [1]. Text-dependent verification depends on a specific piece of text or a predefined passcode, which is used as a baseline to compare future inputs; this approach is more restrictive, as it only works when the predefined code word is uttered. This limitation is not shared by text-independent systems, as it functions by extracting the characteristics of the target’s voice that are present across different words and sounds. The advantages of text-independent speaker verification come with a cost; it is much more complicated and difficult than text-dependent speaker verification. Text-independent speaker verification is a multistep process that involves choosing between many different feature extraction methods, feature engineering, dimensionality reduction, and choosing a machine learning model. Whereas text-dependent speaker verification can be as simple as comparing two waveforms, the input and the target, as in Kaczmarek et al. [2]. The complexity of using traditional machine learning techniques for text-independent speaker verification means that it is often easier to use deep learning techniques, as they automatically do the feature extraction, Kok et al. [3], Jung et al. [4]. However, deep learning techniques have a few disadvantages when compared to traditional machine learning techniques, namely that they are often more computationally expensive, as shown in Thompson et al. [5], and less explainable, as shown in Roscher et al. [6]. And herein lie the two issues that this paper aims to address. The first and primary issue is that SVMs are underutilized and under-researched in this field despite being able to perform on par with deep learning techniques while maintaining several advantages over them, i.e., increased explainability, decreased computation time, and requiring less data. The second issue is that because SVMs are underutilized and under-researched in this field there are significantly less published works using them for this, especially so when it comes to recent research. This gap in research makes it all the more important that existing papers on the subject are as accessible as possible so they can be expanded upon.
With these motivating factors in mind, we create our own text-independent speaker verification SVM and CNN. With these, we conduct a comprehensive comparison and analysis of our models as well as twelve papers implementing text-independent speaker verification, six using SVMs and six using various deep learning techniques. In order to obtain the most comprehensive comparison, we limited our search to one classical machine learning model, namely support vector machines. SVMs were chosen for three reasons: they are commonly used in this field, so its applicability has already been established and there is enough available work to allow for an in depth comparison. They perform on par or better to other machine learning techniques in the field. Additionally, there is a lack of recent research on them despite their advantages. When comparing the papers, we kept four aspects in mind: the metrics, the specifics of the model used, the features, and the data. The metrics calculated for our study are accuracy, precision, recall, F1-score, ROC AUC, EER, MSE, and MCC. When speaking to the specifics of the SVM model, we are mainly concerned with the kernel used, except in one case where multiple SVMs are combined in a sort of voting fashion. For deep learning techniques, this concerns the model architecture. When comparing the features, we kept in mind the exact feature extraction method used as well as the parameters, e.g., the number of coefficients. When comparing the data, we kept a number of things in mind: the dataset availability, the dataset size, the gender split, and the amount of data per speaker.
In this study, which is the first of its kind to our knowledge, we comprehensively compared SVMs and various deep learning techniques used for text-independent speaker verification, with the primary goal of showing comparability between the two. We accomplish this by showing that the performance of SVMs is on par with that of various deep learning techniques and, in some cases, better. This is important because it means that the advantages that SVMs have over deep learning techniques are worth considering and that more effort should be put into researching and improving the use of SVMs in text-independent speaker verification. All of our work can be found here https://github.com/0dink/BioMed_project (accessed on 1 February 2025).

2. Relevant Work

As previously mentioned, text-independent speaker verification can be accomplished via a variety of machine and deep learning techniques. In order to justify the concentration on SVMs, over other classical machine learning techniques, and ensure that a satisfactory range of deep learning techniques are covered, a number of both will be briefly covered, before engaging in an in-depth analysis of a select few.
The following deep learning-based papers serve to supplement the later papers covered in depth by showing the range and diversity of techniques employed. Jung et al. [7] improved upon their previous work, namely that in Jung et al. [4], using a DNN to perform text-independent speaker verification using only raw waveforms as input. Their network, which they dubbed RawNet, uses a convolutional neural network-gated recurrent unit (CNN-GRU) where the CNN portions are ResNet blocks. They improved upon their original network by incorporating filter-wise feature map scaling (FMS). With this addition, they were able to obtain an EER of 2.57% on the VoxCeleb-1 evaluation set, an improvement of 2.23% from their original method. Tang et al. [8] created a hybrid TDNN LSTM network with a multilevel pooling strategy to obtain speaker information from the TDNN and LSTM layer. The reasoning for combining these models is because the TDNN focuses on local features while the LSTM considers global and sequential information from the entire utterance. With this system, they were able to achieve an EER of 6.61% using the Tagalog and Cantonese portion of the NIST SRE16 eval test and SRE18 dev test. Wang et al. [9] compared how a wide variety of hyperparameters affect a CNN used for text-independent speaker verification, including but not limited to input embeddings, the number of enrollment utterances, duration of enrollment utterances, and loss functions. Ultimately, they found that a greater number of enrollment utterances of longer duration give the best results. Using this information, they are able to obtain an EER of 2.0%. Paranatti et al. [10] used data from the LibriSpeech dataset to train a CNN on an MFCC spectrogram. Their model consists of three convolutional, three pooling, two dense, one flattened, and one dropout layer(s). With this model, they are able to achieve a training accuracy of 98% and a validation accuracy of 93%.
During a review of machine learning papers from the last five to ten years, a number of different techniques being used for text-independent speaker verification were found. It is worth noting that three of these techniques were more common than the rest, those being GMM-UBM, SVM, and i-vector-based. Song et al. [11] tested the effect of noise invariant frame selection, a preprocessing method, on three machine learning models: vector quantization (VQ), GMM-UBM, and an i-vector-based approach. They used the TIMIT dataset and have an input of 24 MFCC features 12 MFCC and 12 Δ MFCC. Their GMM-UBM achieved an EER of 5.6%, their VQ achieved 8.3%, and the i-vector-based method obtained 5.0%. Thakur et al. [12] trained a GMM-UBM and compared it to a generic i-vector-based system using PLDA. They created their own dataset from scraping YouTube and Librivox, and this dataset contains 50 speakers each with more than an hour of total data split into approximately 50 one-minute utterances. They removed any silence from the data using voice activity detection, and then they parameterized the speech with MFCC. Their GMM-UBM achieves an EER of 5.31% and their i-vector system has an EER of 4.74%. Rakhmanenko et al. [13] tested nine different feature sets, which all included MFCC with 14 coefficients. The tests were conducted on a GMM-UBM with 256 mixtures trained on the various features as extracted from an in-house dataset with 50 speakers split evenly by gender with at least 6 min of speech each. The best EER they obtained was 0.763% from the feature set containing 14 MFCC coefficients and voicing probability. Ditlovich et al. [14] compared GMM-UBMs and GMM-IBM with and without mostly voiced speech (MVS) at three different SNR levels; clean, 3 db, and 9 db. MSV comprises speech data that have not undergone voice activity detection to remove portions containing silence. The best results are obtained with MVS with a clean SNR where GMM-UBM obtains an EER of 14.5833% and GMM-IBM obtains 13.02%. Pinheiro et al. [15] compared a type-2 Fuzzy GMM-UBM (T2F-GMM-UBM) and a generic GMM-UBM. A fuzzy GMM is a GMM that uses a soft clustering method, meaning that each point is assigned a score based on how strongly it correlates with a cluster, this allows points to be assigned to multiple clusters [16]. Type 2 means that two separate variables are used to obtain the score as opposed to the usual one. For each model architecture they try four different numbers of mixtures: 32, 64, 128, and 256. The best EER that the T2F-GMM-UBM obtains is 13.73% with 128 mixtures, and the best EER that the GMM-UBM obtains is 16.86% with 64 mixtures. Sarmah et al. [17] collected data from four different microphones and trained four different GMM-UBM and evaluated them on data from the device they were trained on and the other three. The lowest EER for matching training and testing devices was 7.50% and the lowest EER for mismatching devices was 18.70%. Li et al. [18] created a hybrid GMM i-vector-based system. The authors decided to combine the two because the GMM is able to find features that the tokenizer is not able to. Using the English portion of the NIST SRE dataset, they were able to achieve an EER of 1.71% with their hybrid system. Das et al. [19] created an i-vector-based system using PLDA. They used a 39 dimensional feature vector made of 13 MFCC, 13 Δ MFCC, and 13 Δ Δ MFCC extracted from an in-house dataset collected from 100 students, each with 3 min of speech data. Tests were conducted to determine the effects that different amounts of limited data have on the model. The following utterance test durations were evaluated: 2, 3, 5, 10, 15, 20 s, and full. Also, 1 and 3 min of training data were compared. From these tests, they were able to achieve an EER of 11.30%. Chen et al. [20] created an i-based vector system using PLDA and tested its efficacy on data where enrollment and test utterances were captured on different devices. Their paper calculated the EER of many different scenarios: the lowest was 0.164% and the highest was 6.802%. Ghoniem et al. [21] used a fuzzy HMM (FHMM) on an in-house Arabic language dataset. The FHMM was a normal HMM which incorporated the fuzzy c-means kernel (KFCM). The KFCM was incorporated to compute fuzzy membership values in order to reduce information loss and increase recognition rate. Using this FHMM, they were able to achieve a recognition rate of 98.3%. Hourri et al. [22] devised a novel clustering method loosely based off of K-means clustering which assigns a point based on the assignment of the nearest point, as opposed to K-means clustering which uses the nearest centroid. With this scoring method, they were able to achieve an EER of 0.32%. The aforementioned papers are briefly summarized in Table 1 to enable an easier comparison to SVMs.
The choice to concentrate on SVMs over the other machine learning techniques was affected by several factors. Preliminary comparisons found that SVMs achieve comparable or better results than other machine learning techniques. For example, the average EER of reviewed SVMs is 5.2% as opposed to the average 9.075% of the GMM-UBM, the most common machine learning technique. While limited, there are still enough recent papers to enable a comprehensive review and analysis, for example, there is very limited recent work using HMM and K-means clusters for text-independent speaker verification. The final factor that led us to choose SVM over other machine learning techniques was that there is a lack of recent research on the subject despite the advantages it has, and we wished to determine whether this is justified.

2.1. Works for Comparison

The search for relevant work was limited to papers published in the last 15 years. Our search turned up six papers which used SVMs, with the most recent one being Charan et al. [23] which was published in 2017. It is important to note that using SVMs for this is not common as of late, and has been largely overlooked in favor of deep learning techniques. We found this surprising because the majority of the reviewed SVMs are comparable to the reviewed deep learning techniques, despite being older. To keep our comparison balanced, we chose six papers that use various deep learning techniques. This can be seen in Table 2, along with an overview of their code and dataset availability.

2.2. SVM

All of the SVM papers reviewed compare multiple feature sets or multiple kernels, but it is worth noting that all of the SVM papers use Mel-frequency cepstral coefficients MFCC, with varying numbers of filters, and most of the SVM papers use the radial basis function kernel RBF in their analysis. Another commonality between the SVM papers is the use of principal component analysis (PCA) as a dimensionality reduction technique. These similarities give a sense of what methods are standard in the field. Table 3 gives an overview of the features used by each SVM paper.
Abdalmalak et al. [24] compared 13 different feature sets and three different kernel functions, logistic regression, RBF, and linear, on the English Language Speech Database for Speaker Recognition (ELSDSR). After finding the feature set that scored the highest ROC AUC across all three kernels, they further improved the performance by combining the trained models. They tested four different methods of combining their models: majority vote, unanimous vote, at least one, and k out of N votes. After testing all of these different methods, they found that the “at least one" method worked best, providing an ROC AUC of 98%. Their chosen method works by verifying the speaker when at least 1 of the 3 models predicts the input as the target speaker. Zerget et al. [25] tested three different feature sets and the PCA of each using SVMs with an RBF kernel. The tested feature sets are as follows: 12 MFCC coefficients plus the energy parameter as well as the first and second derivative, 12 LSF coefficients, and the combination of the previous two. The testing was performed using the TIMIT dataset, and the paper’s metric of choice was the equal error rate (EER). The best result they obtained was an EER of 0.51%, which they obtained when using the PCA of the MFCC-LSF feature set. Kamruzzaman et al. [26] compared the accuracy of SVMs trained using chunking against those trained with sequential minimum optimization SMO. Chunking and SMO are different methods of solving the quadratic programming problem, which occurs during SVM training. The testing was done on an in-house dataset and found that SVMs trained using SMO are slightly more accurate, 91.88% vs. 95%. Charan et al. [23] compared three feature extraction techniques, namely MFCC, LPCC, and PLP, and well as two dimensionality reduction techniques, namely PCA and t-SNE.
The paper covers several different machine learning algorithms in addition to SVMs, but those are not within the scope of this paper. The previously listed features undergo dimensionality reduction using PCA and t-SNE; after this, SVMs are trained on the new reduced features. Rashno et al. [27] compared three different SVM kernels: RBF, polynomial, and multilayer perceptron (MLP) on two different features selection methods, one based on a genetic algorithm (GA) and one based on ant colony optimization (ACO) on the TIMIT dataset. Ali et al. [28] tested seven different feature sets on the Urdu dataset, each obtained via a unique feature extraction technique. They obtained their best results from combining the MFCC features and output of a restricted Boltzmann machine that was run on the PCA of the audio signals spectrogram. Of the SVM papers that we reviewed, four of them had publicly accessible datasets, Urdu dataset, ELSDSR, and TIMIT, as shown in Table 2.
The following papers covered SVMs that were the subject of papers published before the year 2010. These papers will not be a member of the in-depth analysis that the newer SVMs are and instead serve to further illustrate the diversity of techniques in this field. Kharroubi et al. [35] combined a GMM and SVM. The GMM was trained on a 33-dimensional feature vector made up of 16 coefficients from LFCC, 16 coefficients from Δ LFCC, and the delta of the energy. After the GMM was trained, the output was given to the SVM where the actual classification occurs. Using this system, they were able to achieve an EER of 16%. Gu et al. [36] constructed six different SVMS from two kernel functions, polynomial and RBF, and three decision functions, namely binary sigmoid, and unthresholded. Three SVM were made for each kernel function each with one of the different loss functions. The best SVM they trained uses the RBF kernel and the loss function without a threshold, obtaining an EER of 2.3%. Wan et al. [37] implemented an SVM using a score-space kernel. Score-based kernels are generalized Fisher kernels and are able to discriminate between whole sequences as opposed to frame-based kernels. Using a score-based kernel, the authors were able to achieve an EER of 4.03%. Liu et al. [38] created a hybrid GMM/SVM system. The system works by first running the data through a GMM that is adapted from a UBM, and after that, 16 MFCC and 16 Δ MFCC coefficients are taken and fed into an SVM which accomplishes the classification. The system is tested and trained on a subset of the NIST 2004 speaker recognition dataset, and they were able to obtain an EER of 11.92%.

2.3. Deep Learning

Before discussing the specifics of each deep learning technique, it is worth noting that the papers reviewed have a few things in common. EER is the only metric calculated by all but one of the papers, which additionally calculates the ROC AUC; this was surprising as the SVM papers do not seem to have a preference one way or another. The second commonality is the prevalence of the VoxCeleb dataset, which is likely because it is one of the largest speech datasets out there, making it good for deep learning applications.
Tayebi et al. [29] created a text-independent speaker verification system using Google’s Generalized End-to-End Loss for Speaker Verification (GE2E) and trains several models with different numbers of enrollment utterances and compares their EER to each other along with three baseline GMMs. They conducted their experiments on the LibriSpeech dataset; for training, they used the train-clean-360 subset, and from that, they trained six different models, each with a different number of enrollment utterances: 2, 3, 5, 7, 10, and 15. These six models were evaluated on three different subsets: test–clean, test–other, and dev–clean. Choi et al. [30] combined a CNN-meet-vision-Transformer (CMT) with broadcasting residual learning (BRL) to create a novel architecture they call BC-CMT. They trained three of these models: BC-CMT-Tiny with 273.6K parameters, BC-CMT-Small with 1.4M parameters, and BC-CMT-Base with 6.3M parameters on the VoxCeleb-1 dataset and evaluated it against other models using the VoxCeleb Text-independent speaker verification (TI-SV) benchmark, which is composed of three sets, namely VoxCeleb-1 original, extended, and hard. Xu et al. [31] built a CNN from ResNet and Squeeze-and-Excite (SE) with four loss functions: triplet, n-pair, angular, and Softmax. The authors combined these loss functions because they hypothesize that, by combining these loss functions, they will complement each other. The model is trained on the VoxCeleb2 training set, which contains over a million utterances from over 6000 speakers and was chosen because it was collected in natural noisy environments, which translates well to real-world scenarios. They compared their model to a number of other architectures using the VoxCeleb benchmark. Li et al. [32] created a dual attention network that is trained end-to-end and evaluated on the VoxCeleb-1 database. Their model works by taking a pair of input utterances to generate utterance-level embedding from which the similarity is measured; this works because the utterances from the same speaker are expected to have highly similar embeddings. Chen et al. [33] used the Mandarin Chinese regional dataset within the Common Voice dataset to train and evaluate two text-independent speaker verification 3D-CNNs, a lightweight one and a normal one. Since voice data are two-dimensional, they randomly segmented and stacked the data to make these three-dimensional. Zhang et al. [34] used contrastive self-supervised learning (CSSL) to train a ResNet34-based model using only 5% of the labeled data from the VoxCeleb1 and 2 datasets, and are able to achieve comparable results to models using significantly more data. All of the deep learning papers had publicly accessible datasets, and one of them had publicly available code, as shown in Table 2.

2.4. Datasets

2.4.1. ELSDSR

Abdalmalak et al. [24] used the English Language Speech Database for Speaker Recognition (ELSDSR). This dataset is composed of 22 non-native English speakers, 10 male, and 12 female, each speaking nine paragraphs worth of text. The speech is recorded with a sampling rate of 16 kHz and a bit rate of 16. The datasets authors suggested splitting the dataset into training and testing, where each speaker has seven paragraphs for training and 2 for testing. This works out to an average speech duration of 83 s for training and 17.6 for testing [39]; it is worth noting that Abdalmalak et al. [24] used the suggested training and testing split for their work.

2.4.2. TIMIT

Zerget et al. [25] and Rashno et al. [27] used the TIMIT Acoustic–Phonetic Continuous Speech Corpus for their work. This dataset is composed of 630 speakers, 70% male, and 30% female, each speaking ten sentences. The speech is recorded with a sampling rate of 16 kHz and a bit rate of 16 [40]. It is worth noting that Zerget et al. [25] did not use the entire data set and instead used only 180 speakers. Rashno et al. [27] also did not use the entire dataset, using only 100 speakers, 72 male and 28 female.

2.4.3. VoxCeleb-1

This dataset was used by Choi et al. [30], Xu et al. [31], Zhang et al. [34], and Li et al. [32]. Choi, Zhang, and Li used this dataset for training and evaluation while Xu used it for evaluation. The VoxCeleb-1 dataset is compiled from automatically annotated YouTube videos of celebrities. The dataset contains 153,516 utterances from 1251 celebrities 45% female and 55% male. On average, each celebrity has 123 utterances, with the most being 250 and the least being 45. The average length of utterance is 8.2 s with the longest being 145 s and the shortest being 4 s Nagran et al. [41].

2.4.4. VoxCeleb-2

Xu et al. [31], Zhang et al. [34], and Choi et al. [30] used the VoxCeleb-2 dataset to train their model, Choi used this in addition to data from VoxCeleb-1 and Zhang evaluated on this dataset. This dataset, similar to VoxCeleb-1, is generated from automatically annotated YouTube videos of celebrities. This dataset is much larger than VoxCeleb-1 containing 1,126,246 utterances from 6112 celebrities split 49% and 61% for female and male. On average, each celebrity has 185 utterances with an average length of 7.8 s, as in Chung et al. [42]. It is also worth noting that Xu et al. [31] uses the entirety of the dataset for training, so evaluation is done on the VoxCeleb-1 dataset.

2.4.5. Urdu Dataset

Ali et al. [28] used the Urdu dataset [43] to train and test their models. This dataset is the first of its kind on the Urdu language, the national language of Pakistan. The dataset contains 250 of the most commonly spoken words from the Urdu language, including digits. There are 50 speakers split evenly by gender and nativity, each speaker utters the word once in a soundproof recording studio. It is also worth noting that Ali et al. [28] used the entirety of the dataset.

2.4.6. Common Voice

Chen et al. [33] used the Common Voice dataset, specifically the Mandarin Chinese regional dataset. This dataset contains a testing and training subset. The training subset is made up of 271 speakers with 4899 samples distributed amongst them. The testing subset contains 6945 speakers distributed over 170 speakers. It is also worth noting that Chen et al. [33] used the entirety of the dataset.

3. Materials and Methods

3.1. Dataset

For our work, we used the LibriSpeech dataset. We chose this dataset over other datasets mentioned in this paper for a number of reasons. LibriSpeech has the best gender split of any of the other datasets, with nearly a perfect 50–50 male–female ratio. Additionally, it was sufficiently large second only to VoxCeleb-2, which is more than twice as large, Nagrani et al. [44]. The third reason that it was chosen as opposed to VoxCeleb-2 was that the data were collected from audiobooks, meaning that it was collected in a controlled environment. This enabled us to use a simple denoising technique that saved time, as opposed to more complicated methods, e.g., namely deep learning-based denoising autoencoders.
The Librispeech dataset was created by taking an 87 Gb subset of the LibriVox project and filtering out samples with excessive audio degradation and samples containing multiple speakers. After preprocessing, the dataset is composed of 2484 different subjects with a roughly 50–50 split by gender. This filtered dataset contains about 1000 h of speech sampled at 16 kHz, adding up to a total of 61 Gb. Due to the dataset’s large size, it is divided into seven subsets: dev–clean, test–clean, dev–other, test–other, train–clean–100, train–clean–360, and train–other–500, which contain 5.4, 5.4, 5.3, 5.1, 100.6, 363.6, and 496.7 h of audio recording, respectively. These subsets are divided into clean and other; those in the clean category scored a low word error rate (WER) when fed through an acoustic model trained on WSJ’s si-84 data subset [45].
To train and test the models, the test–clean–100 subset was chosen. This subset was chosen for two reasons. The first is because it is considered clean, meaning that the speech is easier to make out and contains less background noise. The second reason that this subset was chosen is because of its size; the test–clean–360 is much bigger than needed, so there was no reason to deal with its large overhead, and test–clean was too small, containing only eight minutes of audio per speaker. Ultimately, this left only test–clean–100, which still is larger than what was needed, but it provided as much data as would ever be realistically needed, as it contains 251 speakers, each speaking for roughly 25 min.
As previously mentioned, the test–clean–100 subset contained more data than needed, so two subsets were created. The first subset contained 20 arbitrarily chosen speakers split evenly by gender, each with 5 min of audio. The second subset contained every speaker within test–clean–100 that contained at least 16.6 min of speech. This worked out to 245 speakers with a roughly 50–50 split by gender. The first smaller subset was used to compare different feature sets, as its small size kept extraction time low, allowing different combinations of features to be rapidly tested. Once an optimal set of features was found, the features of the second larger subset were computed.

Preprocessing

Before feature extraction could occur, two preprocessing steps had to be performed: denoising and segmentation. The speech data were denoised using a technique called spectral gating.
At a high level, spectral gating works by discarding any time-frequency component below a threshold for each frequency component of the signal. It does this by computing the mean and standard deviation for each frequency channel of a Short-Time Fourier Transform (STFT). A threshold or gate for each frequency component is then set at some level above the mean. This threshold determines whether or not a given time-frequency component in the spectrogram is considered to be signal or noise; the spectrogram is masked based on this threshold and finally inverted back into the time domain with an inverse STFT
[46].
During this step, care was taken not to be too aggressive with the noise reduction, as there was very little noise in the data. To ensure that the denoising did not remove important data, several different levels of denoising were tested by changing the proportion of the spectral gating mask that was applied. After trial and error, it was determined that an 80% percent decrease was ideal. The average change in SNR from cleaning is a decrease of 22% with a standard deviation of 8%, Table 4 gives a more detailed breakdown.
The second preprocessing step was to divide the audio into one-second increments. This was so that speakers could be easily compared to one another. The one-second increments were not chosen arbitrarily. From testing, it was determined that audio segments as small as 0.2 s, when used for text-dependent speaker verification, could work. Keeping in mind that text-dependent speaker verification needs less data, it was obvious that longer segments would be needed, but this gave a good jumping-off point. Initially, four-second segments were tested; these worked really well, but they were computationally expensive for the feature extractor. The lengths of the segments were iteratively shortened by one second until one-second segments were obtained; these were significantly cheaper computationally and had the same effect on the model as the four-second segments.

3.2. Features

Features were extracted using Mel-frequency cepstral coefficients (MFCC). There are a number of alternatives, such as LPCC, PLP, and BFCC, that are used in the SVM papers we review. However, MFCC was chosen as the feature extraction method because it is based on the Mel scale, which approximates the human auditory system, making it good for speech analysis applications. Additionally, it gave us results that showed comparability, our goal. From each speech segment, 20 coefficients were calculated. 20 was chosen instead of the traditional 13 because testing with the Mann–Whitney and Kruskal–Wallis test showed a statistically significant decrease in the quality of SVMs and CNN trained with only 13 coefficients Table 5.
In order to further refine the feature set, it was necessary to determine whether any features were highly correlated and could be removed with principal component analysis PCA. PCA was chosen over t-SNE and UMAP for a number of reasons. The first reason is that PCA is deterministic, which allows for the best comparison when testing between different feature sets. The second reason only pertains to the t-SNE, which is significantly slower than PCA for large datasets like the one we are using. A heat map of the Spearman correlation matrix was generated to visualize this Figure 1. As shown in Figure 1, there are quite a few highly correlated features indicating that the feature set should undergo PCA. To compute the independent feature set, PCA with a variance threshold of 95% was applied. In addition to correcting the correlation, the independent feature set is 50% smaller.
Ultimately, the models did well because the features were intelligently chosen. The maximum and minimum values give the model important information about the range of energy in the speech sample, allowing the model to derive a sense of how lethargic and how energetic the speaker’s speech is. The mean allows the model to derive a sense of where the majority of the energy is within the signal. The median does this as well but is immune to outliers, which may throw off the mean, i.e., any noise that was missed in the denoising process. The standard deviation gives the model an idea of how much the energy within the speech is spread out. The skew can reveal whether there is more low energy or high energy in the coefficients by measuring the symmetry of the distribution. Finally, there is kurtosis, which can show the model how much variation there is within the speech.

3.3. Model

To obtain the best comparison possible, we trained our own SVM and CNN. Training our own models allowed us to do a number of things we would not have otherwise been able to do. It allowed us to directly compare the computation time of an SVM and a deep learning model. It allowed us to compare an SVM and a deep learning model on the exact same feature set. The models also gave baselines enabling a comparison of two papers that do not compute the same metrics. To give a little background, SVMs are supervised learning models used for classification and can perform linear separation as well as nonlinear separation using what is called the kernel trick, which works by mapping the data to a higher dimension where they are linearly separable [47]. CNNs are supervised learning models that are feed-forward neural networks usually made up of three main layers: convolutional, pooling, and fully connected [48].

3.3.1. Hyperparameters

To ensure that the results from the SVM and CNN were satisfactory and comparable to the papers reviewed, several different combinations of hyperparameters were tested. For the sake of brevity and to avoid unnecessarily expanding the scope of the paper, as only a demonstration of comparability is necessary for our case, this section will be kept succinct. Nineteen different combinations of hyperparameters were tested, including four different kernels: linear, polynomial, sigmoid, and rbf. The linear kernel did the worst, followed by sigmoid and then polynomial, which varied depending on the specified degree. The kernel that worked best was rbf. In an attempt to further improve results, the regularization and kernel coefficients were modified, but this either resulted in either a severe decrease in metrics or a negligible increase in one metric, while metrics saw moderate or negligible decreases. In the end, the default hyperparameters, as given by sklearns implementation, gave the best results. In the case of the CNN, 14 different hyperparameters where tested, and it is worth noting that unlike the SVMs hyperparameter tuning, each model was compared to each other on a fraction of the whole dataset to allow for rapid iteration. When comparing the competing models, three different activation functions were tested: sigmoid, Softmax, and relu. While relu ended up being the most effective activation function, most of the gains were achieved by changing the number and shape of layers until we obtained the network shown in Figure 2.

3.3.2. Training

As previously mentioned, each speaker’s audio was divided into one-second segments, resulting in around 1500 segments per speaker, as each speaker had roughly twenty-five minutes of data. Out of these 1500 segments, 1000 segments, or 16.6 min worth, were chosen at random. The training was conducted using k-fold cross-validation with five folds. K-fold cross-validation was chosen over nested cross-validation because the hyperparameters were already tuned, and five folds were chosen because increasing values had no notable effects on the computed metrics. The CNN training used ten epochs with batch sizes of 50.

4. Results

For both the SVM and CNN, each speaker had their own set of models, and from each set of models, the following metrics were calculated: accuracy, precision, recall, F1-score, ROC AUC, EER, MSE, and MCC. The way this worked with k-fold cross-validation was that each speaker had five models, the number of k-folds, and each of the five models had the aforementioned eight metrics calculated. They were then averaged together, and from those averages, a five-number summary and mean were calculated, along with a visualization of the ROC AUC curve. Table 6 and Table 7. In addition to this, the ROC curve was also generated, as can be seen in Figure 3 and Figure 4.

5. Discussion and Conclusions

In this paper, we conduct a comprehensive comparison and summarization of 14 models that implement text-independent speaker verification. Seven are SVMs, one of which is of our own design, and seven are deep learning techniques, one of which is a CNN of our own design. This process consisted of summarizing the methodology of all the models and datasets used by said models. In addition to the summarization, we made several tables that compare the metrics, features, model, data, and dataset availability.
This paper had two main motivations. The primary motivation was to show that SVMs can achieve results comparable to deep learning methods in the field of text-independent speaker verification and outperform them in certain areas, namely explainability and computational efficiency. The secondary motivation was to compile works in which SVMs are used in text-independent speaker verification to make these works more accessible and highlight the limited works in this category. Accomplishing the second goal was simple and was accomplished by finding, compiling, and analyzing works that use SVMs for text-independent speaker verification. The primary goal was more challenging; the main challenge was finding a way to compare all of the works with each other, as many of them used different datasets, methods, features, and metrics. The best way to overcome these differences and enable the best possible comparison was to make a comprehensive table that covered all of the works in detail. See Table 8 and Table 9.

5.1. Comparability

The first obstacle when comparing the SVM papers to the deep learning papers is that only two of the six SVM papers computed the EER, Zergat et al. [25] and Rashno et al. [27]. This made it difficult to compare the other SVM papers because EER was used exclusively by all except one deep learning paper. So while Zergat et al. [25] and Rashno et al. [27] can be directly compared, the other four papers cannot, at least not easily. Comparing Abdalmalak et al. [24] was more challenging as they only computed the AUC ROC. But from the AUC ROC, we are able to estimate the EER. This is because a high ROC AUC will mean a low EER see Figure 5, which illustrates the relationship. This relationship gives us two optimization problems for each curve where we are looking for the highest and lowest EER while meeting the AUC constraint. If we do this, we can calculate the range of possible EER values for the given ROC AUC, which can be seen in Table 8.
Finding a way to compare the final three SVM papers, namely Kamruzzaman et al. [26], Charan et al. [23], and Ali et al. [28], posed the greatest challenge as there is no way to estimate EER from accuracy, but this is where the SVM model we trained comes into play. Since we calculate the accuracy and EER for our SVM, we have a rough idea of what the EER should be given in terms of accuracy. So, for the SVM with an accuracy of 95%, as in Kamruzzaman et al. [26], we can safely say that their model had a similar EER as their accuracy is within 2% points of ours. The same can be said of the SVM by Ali et al. [28] which has an accuracy of 92.6%, within 4% points of ours. Conversely, since the accuracies of the SVMs in Charan et al. [23] are significantly lower, it is safe to say that the EER is significantly higher.
Keeping this all in mind, we are able to accurately and reliably compare the SVMs from Zergat et al. [25], Rashno et al. [27], and Abdalmalak et al. [24]—the last of which is possible because of the ability to estimate EER from ROC AUC. The second three papers, Kamaruzzaman et al. [26], Ali et al. [28], and Charan et al. [23], cannot be compared as reliably because EER cannot be obtained from raw accuracy, but we are able to derive a good idea of how they compare based on the SVM created in this study.

5.1.1. Comparison

As previously mentioned, we are able to reliably compare the SVMs of Abdalmalak et al. [24], Zergat et al. [25], Rashno et al. [27], and our own as they all have calculated the EER. Based on the metrics from Table 8 and Table 9, we can conclude that SVMs are capable of performing on par or better, as in the case of Zergast et al. [25], whose EER is lower than all of the deep learning papers reviewed. Table 8 and Table 9 also show that all of the SVMs used less data than their deep learning counterparts, which was not an advantage of SVMs that we had anticipated but it is nonetheless present. We believe that this is the case because SVMs require careful feature extraction and engineering to work properly so they can get away with less data as the data they are given are more carefully curated. This is opposed to deep learning techniques, which are able to perform some of the feature extraction and engineering themselves, given a sufficiently large dataset. The final advantage we found SVMs to have on deep learning techniques was their computational efficiency. Unfortunately, none of the papers reviewed disclosed training times for their classifiers, so we were only able to compare the training times of the SVM and CNN from our study. Training the CNN took 41 min and 43.2 s to train on a machine with an AMD Ryzen 9 7950X 16-Core processor, NVIDIA GeForce RTX 3060 Ti with 8 GB of VRAM, and 64 GB of DDR5 RAM. This is in contrast to the SVM, which took only 1 min and 0.5 s to train on the same machine.

5.1.2. Future Work

While our comparison is comprehensive, there is room for future work. The first and most obvious area of improvement is to increase the number of papers reviewed, but this may be more challenging than would initially appear as there are not many recent SVM papers on this subject. The second area for improvement could be to divide the deep learning techniques by architecture; in this vein, one could include the alternative machine learning techniques in the in-depth review. The final area for improvement would be to recreate the code from the reviewed papers, as this would allow for a better analysis and comparison of the computation cost of SVMs and deep learning techniques.
Besides ways that our review and comparison may be expanded, the reviewed papers listed a number of avenues for future work. Rashno et al. [27] suggested that intrinsic properties of data, such as relief weight, may be used in conjunction with particle swarm optimization for faster convergence in feature selection algorithms. Kamruzzaman et al. [26] suggested that HMMs may be used to improve segmentation when cross-talk and laughter are present in the data. Additionally, we have a few thoughts on potential areas for future work on SVM for text-independent speaker verification spurred by our review. We observed that SVM needs significantly less data than deep learning techniques to obtain comparable results. It would be prudent to further investigate this to see how far this difference can be pushed. During the literature review we say that many papers combine two or more different models to obtain improved results; however, we did not see any papers combining SVMs with neural networks, so it would be prudent to investigate this further to see whether such a union would be advantageous. Finally, we saw that there was a large reliance on MFCC as a feature extraction technique, and while MFCC works well, it would be prudent to conduct an in-depth study on the effects that feature sets have on SVMs, as the papers we found did not review more than six different feature sets.

Author Contributions

Conceptualization, O.K. and M.I.; methodology, O.K. and M.I.; software, O.K.; validation, O.K. and M.I.; formal analysis, O.K.; investigation, O.K.; resources, O.K. and M.I.; data curation, O.K.; writing—original draft preparation, O.K.; writing—review and editing, O.K. and M.I.; visualization, O.K.; supervision, M.I.; project administration, O.K. and M.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code presented in this study is openly available in “BioMed_project” at https://github.com/0dink/BioMed_project (accessed on 1 February 2025). The data used in this study are openly available in “librispeech” at https://www.openslr.org/12 (accessed on 10 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SVMSupport Vector Machine
ROCReceiver Operating Curve
AUCArea Under Curve
RBFRadial Basis Function
EEREqual Error Rate
MSEMean Square Error
MCCMatthews Correlation Coefficient
SNRSignal to Noise Ratio
GMMGaussian Mixture Model
UBMUniversal Background Model
IBMIndividual Background Model
HMMHidden Markov Model
ACOAnt Colony Optimization
DBNDeep Belief Network
PLDAProbabilistic Linear Discriminant Analysis
LSTMLong Short-Term Memory
TDNNTime Delay Neural Network
VQVector Quantization

References

  1. Bäckström, T.; Räsänen, O.; Zewoudie, A.; Zarazaga, P.P.; Koivusalo, L.; Das, S.; Mellado, E.G.; Mansali, M.B.; Ramos, D.; Kadiri, S.; et al. Introduction to Speech Processing, 2nd ed.; Aalto University: Espoo, Finland, 2022. [Google Scholar] [CrossRef]
  2. Kaczmarek, A.; Staworko, M. Application of dynamic time warping and cepstrograms to text-dependent speaker verification. In Proceedings of the Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2009, Poznan, Poland, 24–26 September 2009. [Google Scholar]
  3. Kok, C.L.; Ho, C.K.; Aung, T.H.; Koh, Y.Y.; Teo, T.H. Transfer Learning and Deep Neural Networks for Robust Intersubject Hand Movement Detection from EEG Signals. Appl. Sci. 2024, 14, 8091. [Google Scholar] [CrossRef]
  4. Jung, J.; Heo, H.-S.; Kim, J.; Shim, H.; Yu, H.-J. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv 2019, arXiv:1904.08104. [Google Scholar]
  5. Thompson, N.C.; Greenewald, K.; Lee, K.; Manso, G.F. The Computational Limits of Deep Learning. arXiv 2022, arXiv:2007.05558. [Google Scholar]
  6. Roscher, R.; Bohn, B.; Duarte, M.F.; Garcke, J. Explainable Machine Learning for Scientific Insights and Discoveries. IEEE Access 2020, 8, 42200–42216. [Google Scholar] [CrossRef]
  7. Jung, J.; Kim, S.; Shim, H.; Kim, J.; Yu, H.-J. Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms. arXiv 2020, arXiv:2004.00526. [Google Scholar]
  8. Tang, Y.; Ding, G.; Huang, J.; He, X.; Zhou, B. Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar] [CrossRef]
  9. Wang, S.; Huang, Z.; Qian, Y.; Yu, K. Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1686–1696. [Google Scholar] [CrossRef]
  10. Paranatti, C.S.; Bhandari, R.R.; Patil, T.M.; Chikkamath, S.; Nirmala, S.R.; Budihal, S. Speaker Verification: A Raw Waveform Approach for Text Independent using CNN. In Proceedings of the 2024 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India, 1–3 March 2024. [Google Scholar] [CrossRef]
  11. Song, S.; Zhang, S.; Schuller, B.W.; Shen, L.; Valstar, M. Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018. [Google Scholar] [CrossRef]
  12. Thakur, K.; Bhukya, R.K. Speaker Authentication Using GMM-UBM. In Proceedings of the 2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Prayagraj, India, 2–4 December 2022. [Google Scholar] [CrossRef]
  13. Rakhmanenko, I.; Meshcheryakov, R. Speech Features Evaluation for Small Set Automatic Speaker Verification Using GMM-UBM System. In Speech and Computer; Springer: Cham, Switzerland, 2016. [Google Scholar]
  14. Ditlovich, V.; Bistritz, Y. Speaker verification with mostly voiced speech for GMM/UBM and GMM/IBM systems. In Proceedings of the 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), Kyiv, UKraine, 29 May–2 June 2017. [Google Scholar] [CrossRef]
  15. Pinheiro, H.N.B.; Vieira, S.R.F.; Ren, T.I.; Cavalcanti, G.D.C.; de Mattos Neto, P.S.G. Type-2 fuzzy GMM for text-independent speaker verification under unseen noise conditions. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016. [Google Scholar] [CrossRef]
  16. MathWorks. Available online: https://www.mathworks.com/help/stats/clustering-using-gaussian-mixture-models.html (accessed on 15 January 2025).
  17. Sarmah, K.; Bhattacharjee, U. Text—Independent Multi-Sensor Speaker Verification System. J. Int. J. Comput. Sci. Eng. 2015, 4, 7–16. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2657491 (accessed on 25 February 2025).
  18. Li, M.; Liu, L.; Cai, W.; Liu, W. Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification. J. Sign. Process. Syst. 2016, 82, 207–215. [Google Scholar] [CrossRef]
  19. Das, R.K.; Jelil, S.; Prasanna, S.R.M. Significance of constraining text in limited data text-independent speaker verification. In Proceedings of the 2016 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 12–15 June 2016. [Google Scholar] [CrossRef]
  20. Chen, L.; Kong, A.L.; Ma, B.; Ma, L.; Li, H.; Dai, L.R. Adaptation of PLDA for multi-source text-independent speaker verification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar] [CrossRef]
  21. Ghoniem, R.M.; Shaalan, K. A Novel Arabic Text-independent Speaker Verification System based on Fuzzy Hidden Markov Model. Procedia Comput. Sci. 2017, 117, 274–286. [Google Scholar] [CrossRef]
  22. Hourri, S.; Kharroubi, J. A Novel Scoring Method Based on Distance Calculation for Similarity Measurement in Text-Independent Speaker Verification. Procedia Comput. Sci. 2019, 148, 256–265. [Google Scholar] [CrossRef]
  23. Charan, R.; Manisha, A.; Karthik, R.; Kumar, M.R. A text-independent speaker verification model: A comparative analysis. arXiv 2017, arXiv:1712.00917. [Google Scholar]
  24. Abdalmalak, K.A.; Gallardo-Antolín, A. Enhancement of a text-independent speaker verification system by using feature combination and parallel structure classifiers. Neural Comput. Appl. 2018, 29, 637–651. [Google Scholar] [CrossRef]
  25. Zergat, K.Y.; Amrouche, A. Robust Support Vector Machines for Speaker Verification Task. arXiv 2013, arXiv:1306.2906. [Google Scholar]
  26. Kamruzzaman, S.M.; Karim, A.N.M.R.; Islam, M.S.; Haque, M.E. Speaker Identification using MFCC-Domain Support Vector Machine. arXiv 2010, arXiv:1009.4972. [Google Scholar]
  27. Rashno, A.; Ahadi, S.M.; Kelarestaghi, M. Text-independent speaker verification with ant colony optimization feature selection and support vector machine. In Proceedings of the 2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA), Rasht, Iran, 11–12 March 2015. [Google Scholar] [CrossRef]
  28. Ali, H.; Tran, S.N.; Benetos, E.; d’Avila Garcez, A.S. Speaker recognition with hybrid features from a deep belief network. Neural Comput. Appl. 2018, 29, 13–19. [Google Scholar] [CrossRef]
  29. Arasteh, S.T. An Empirical Study on Text-Independent Speaker Verification based on the GE2E Method. arXiv 2022, arXiv:2011.04896. [Google Scholar]
  30. Choi, J.-H.; Yang, J.-Y.; Jeoung, Y.-R.; Chang, J.-H. Improved CNN-Transformer Using Broadcasted Residual Learning for Text-Independent Speaker Verification; ISCA Interspeech: Incheon, Republic of Korea, 2022. [Google Scholar]
  31. Xu, J.; Wang, X.; Feng, B.; Liu, W. Deep multi-metric learning for text-independent speaker verification. Neurocomputing 2020, 410, 394–400. [Google Scholar] [CrossRef]
  32. Li, J.; Lee, T. Text-Independent Speaker Verification with Dual Attention Network. arXiv 2020, arXiv:2009.05485. [Google Scholar]
  33. Chen, J.Y.; Jeng, J.T. Text-Independent Speaker Verification Using Lightweight 3D Convolutional Neural Networks. In Proceedings of the 2024 International Conference on System Science and Engineering (ICSSE), Hsinchu, Taiwan, 26–28 June 2024. [Google Scholar] [CrossRef]
  34. Zhang, H.; Zou, Y.; Wang, H. Contrastive Self-Supervised Learning for Text-Independent Speaker Verification. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar] [CrossRef]
  35. Kharroubi, J.; Petrovska-Delacrétaz, D.; Chollet, G. Combining GMM’s with Suport Vector Machines for Text-Independent Speaker Verification; ISCA Eurospeech: Aalborg, Denmark, 2001. [Google Scholar]
  36. Gu, Y.; Thomas, T. A Text-Independent Speaker Verification System Using Support Vector Machines Classifier; ISCA Eurospeech: Aalborg, Denmark, 2001. [Google Scholar]
  37. Wan, V.; Renals, S. Speaker verification using sequence discriminant support vector machines. IEEE Trans. Speech Audio Process. 2005, 13, 203–210. [Google Scholar] [CrossRef]
  38. Liu, M.; Xie, Y.; Yao, Z.; Dai, B. A New Hybrid GMM/SVM for Speaker Verification. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006. [Google Scholar] [CrossRef]
  39. Feng, L.; Hansen, L.K. A New Database for Speaker Recognition; Technical University of Denmark: Kongens Lyngby, Denmark, 2004; Available online: https://orbit.dtu.dk/en/publications/a-new-database-for-speaker-recognition (accessed on 25 February 2025).
  40. Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L.; Zue, V. TIMIT Acoustic-Phonetic Continuous Speech Corpus; UPENN: Philadelphia, PA, USA, 1993. [Google Scholar] [CrossRef]
  41. Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset; ISCA Interspeech: Stockholm, Sweden, 2017. [Google Scholar] [CrossRef]
  42. Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition; ISCA Interspeech: Hyderabad, India, 2018. [Google Scholar] [CrossRef]
  43. Ali, H.; Yahya, K.M.; Ahmad, N.; Farooq, O. A Medium Vocabulary Urdu Isolated Words Balanced Corpus for Automatic Speech Recognition. In Proceedings of the 2012 International Conference on Electronics Computer Technology, Kanyakumari, India, 6–8 April 2012. [Google Scholar]
  44. Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 2020, 60, 101027. [Google Scholar] [CrossRef]
  45. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015. [Google Scholar] [CrossRef]
  46. Sainburg, T.; Gentner, T.Q. Toward a Computational Neuroethology of Vocal Communication: From Bioacoustics to Neurophysiology, Emerging Tools and Future Directions. Front. Behav. Neurosci. 2021, 15, 811737. [Google Scholar] [CrossRef] [PubMed]
  47. IBM. What Is Support Vector Machine?|IBM—ibm.com. Available online: https://www.ibm.com/think/topics/support-vector-machine (accessed on 28 January 2025).
  48. IBM. What Are Convolutional Neural Networks?|IBM—ibm.com. Available online: https://www.ibm.com/think/topics/convolutional-neural-networks (accessed on 28 January 2025).
Figure 1. Spearman correlation matrix without principal component analysis.
Figure 1. Spearman correlation matrix without principal component analysis.
Electronics 14 00963 g001
Figure 2. Our CNN architecture.
Figure 2. Our CNN architecture.
Electronics 14 00963 g002
Figure 3. ROC curves of speaker verification SVMs. Where each color represents the the curve from each speakers model.
Figure 3. ROC curves of speaker verification SVMs. Where each color represents the the curve from each speakers model.
Electronics 14 00963 g003
Figure 4. ROC curves of speaker verification CNNs. Where each color represents the the curve from each speakers model.
Figure 4. ROC curves of speaker verification CNNs. Where each color represents the the curve from each speakers model.
Electronics 14 00963 g004
Figure 5. Relationship between ROC AUC and EER.
Figure 5. Relationship between ROC AUC and EER.
Electronics 14 00963 g005
Table 1. Alternatives to SVMs.
Table 1. Alternatives to SVMs.
SourceDate PublishedModelEERAccuracy
Ghoniem et al. [21]2017FHMMn/a98.3%
Thakur et al. [12]2022GMM-UBM5.31%n/a
2022i-vector4.74%n/a
Rakhmanenko et al. [13]2016GMM-UBM0.763%n/a
Ditlovich et al. [14]2017GMM-UBM16.3%n/a
2017GMM-IBM13.9%n/a
Pinheiro et al. [15]2016GMM-UBM16.86%n/a
2016T2F-GMM-UBM13.73%n/a
Sarmah et al. [17]2015GMM-UBM7.50%n/a
Song et al. [11]2018GMM-UBM5.6%n/a
2018VQ8.3%n/a
2018i-vector5.0%n/a
Li et al. [18]2015GMM-i-vector1.71%n/a
Das et al. [19]2016i-vector11.30%n/a
Chen et al. [20]2017i-vector0.164% to 6.802%n/a
Hourri et al. [22]2019K-means0.32%n/a
Note: “n/a” indicates that the metric was not reported in the corresponding study.
Table 2. Accessibility of work related to the target domain.
Table 2. Accessibility of work related to the target domain.
SourceDate PublishedCode AvailabilityDatasetModel
Ours2024YesLibirispeechSVM and DL
AbdalmalTak et al. [24]2016NoELSDSRSVM
Zergat et al. [25]2013NoTIMITSVM
Kamruzzaman et al. [26]2010NoNot SpecifiedSVM
Charan et al. [23]2017NoNot SpecifiedSVM
Rashno et al. [27]2015NoTIMITSVM
Ali et al. [28]2016NoUrdu datasetSVM
Tayebi et al. [29]2022NoLibirispeechDL
Choi et al. [30]2022NoVoxCeleb-1DL
Xu et al. [31]2020YesVoxCeleb-1 and 2DL
Li et al. [32]2020NoVoxCeleb-1DL
Chen et al. [33]2024NoCommon VoiceDL
Zhang et al. [34]2021NoVoxCeleb-1 and 2DL
Table 3. Features computed in related studies.
Table 3. Features computed in related studies.
PapersMFCCPLPR-PLPLSFBFCCLPCCPCAt-SNE
Ours
AbdalmalTak et al. [24]
Zergat et al. [25]
Kamruzzaman et al. [26]
Charan et al. [23]
Rashno et al. [27]
Ali et al. [28]
Note: “✓” indicates that the corresponding method was used while blank indicates that it was not.
Table 4. Five-number summary and mean of SNR for pre- and post-denoising, in decibels.
Table 4. Five-number summary and mean of SNR for pre- and post-denoising, in decibels.
MetricMinQ1MedianQ3MaxMean
Before denoising0.95411.80215.805818.51837.328415.1214
After denoising0.59728.765812.23914.835129.057111.8289
Table 5. Twenty versus thirteen coefficients employed in model training. p-value rounded to nearest 10,000th.
Table 5. Twenty versus thirteen coefficients employed in model training. p-value rounded to nearest 10,000th.
Model/TestAccuracyPrecisionRecallF1-ScoreROC AUCEERMSEMCC
SVM Mann0.0550.1760.0570.0530.0420.0660.0550.060
SVM Kruskal0.0530.1720.0550.0510.0410.0640.0530.058
CNN Mann0000.00100.00100.001
CNN Kruskal0000.00100.00100.001
Table 6. Five-number summary and mean of speaker verification SVMs.
Table 6. Five-number summary and mean of speaker verification SVMs.
MetricMinQ1MedianQ3MaxMean
Accuracy0.90440.95700.96910.98030.99700.9666
Precision0.90970.96480.97690.98700.99790.9740
Recall0.88900.94450.96180.97390.99700.9581
F1-score0.90150.95550.96850.97980.99690.9658
ROC AUC0.96910.99500.99820.99951.00000.9964
EER0.00100.01820.02890.04380.09740.0324
MCC0.80840.91370.93880.96060.99390.9335
MSE0.00300.01970.03090.04300.09560.0334
Table 7. Five-number summary and mean of speaker verification CNNs.
Table 7. Five-number summary and mean of speaker verification CNNs.
MetricMinQ1MedianQ3MaxMean
Accuracy0.99420.99991.00001.00001.00000.9999
Precision0.99341.00001.00001.00001.00000.9998
Recall0.99281.00001.00001.00001.00000.9999
F1-score0.66220.80400.85130.88480.92880.8395
ROC AUC0.94180.99520.99810.99941.00000.9951
EER0.00000.00300.00460.00860.05210.0074
MCC0.90000.98180.98990.99300.99900.9846
MSE0.00000.00000.00000.00010.00510.0001
Table 8. Paper comparisons.
Table 8. Paper comparisons.
Source: SVM Ours
Accuracy Precision Recall F1 ROC AUC EER MSE MCC Model Features Data
96.91%97.69%96.18%96.85%99.82%2.89%4.30%96.06%RBF kernel SVM20 MFCC coefficients with PCA applied245 speakers with 50/50 gender split
Source: CNN Ours
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
100%100%100%85.13%99.81%0.46%0%98.99%See Figure 2See SVM oursSee SVM ours
Source: Abdalmalak et al. [24]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
n/an/an/an/a98%5.6% to 7.1%n/an/a3 parallel SVMs (linear, RBF, logistic regression kernels)MFCC, BFCC, PLP, R-PLPAll 22 speakers from the ELSDSR dataset; each speaker has 9 utterances, split 7/2 for training and testing
n/an/an/an/a96%9.3% to 11.68%n/an/aLogistic regression kernel SVM
n/an/an/an/a94%12.5% to 14.9%n/an/aLinear kernel SVM
n/an/an/an/a94%12.5% to 14.9%n/an/aRBF kernel SVM
Source: Zergat et al. [25]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
n/an/an/an/an/a0.51%n/an/aRBF kernel SVM12 MFCC coefficients and LSF with PCA applied180 speakers from TIMIT dataset split 50/50 for training/testing with 15/9 s per speaker
n/an/an/an/an/a0.63%n/an/a12 MFCC PCA applied
n/an/an/an/an/a2.56%n/an/aLSF with PCA applied
Source: Kamruzzaman et al. [26]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
91.88%n/an/an/an/an/an/an/aSVM trained with Chunking20 MFCC coefficients8 speakers with 20 samples of the word “zero” per speaker
95%n/an/an/an/an/an/an/aSVM trained w/SMO
Source: Charan et al. [23]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
23%n/an/an/an/an/an/an/aSVM20 MFCC coefficients with PCA15 total speakers with 40–45 samples each
18.5%n/an/an/an/an/an/an/a 20 LPCC coefficients with PCA
22%n/an/an/an/an/an/an/a 20 PLP coefficients with PCA
57.9%n/an/an/an/an/an/an/a 20 MFCC coefficients with t-SNE
38%n/an/an/an/an/an/an/a 20 LPCC coefficients with t-SNE
52.8%n/an/an/an/an/an/an/a 20 PLP coefficients with t-SNE
Source: Rashno et al. [27]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
n/an/an/an/an/a2.140%n/an/aRBF kernel39 → 14 with ACO72 male and 26 females chosen randomly from TIMIT with 6 sentences each for training and 4 each for testing
n/an/an/an/an/a8.751%n/an/aMLP kernel39 → 13 with ACO
n/an/an/an/an/a1.745%n/an/apolynomial kernel39 → 14 with ACO
n/an/an/an/an/a5.122%n/an/aRBF kernel39 → 16 with GA
n/an/an/an/an/a10.123%n/an/aMLP kernel39 → 15 with GA
n/an/an/an/an/a3.128%n/an/apolynomial kernel39 → 14 with GA
Source: Ali et al. [28]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
88.6%n/an/an/an/an/an/an/aRBF kernel36 MFCC coeff250 speakers from the Urbu dataset split with a 2:1:1 train, validation, and test ratio
90.40%n/an/an/an/an/an/an/aDBN-1
91.40%n/an/an/an/an/an/an/aDBN-1 and 36 MFCC coeff
72.20%n/an/an/an/an/an/an/aDBN-2
87.00%n/an/an/an/an/an/an/aDBN-2 and 36 MFCC coeff
90.60%n/an/an/an/an/an/an/a DBN-1 and 2
92.60%n/an/an/an/an/an/an/a DBN-1 and 2 and 36 MFCC coeff
Source: Tayebi et al. [29]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
n/an/an/an/an/a3.92%n/an/aGE2E trained on 2 enrollment utterances40-dimensional log-Mel-filterbank of 25 ms frames with 10 ms stepsEvaluated on LibriSpeech test-clean subset
n/an/an/an/an/a2.57%n/an/aGE2E trained on 3 enrollment utterances
n/an/an/an/an/a2.41%n/an/aGE2E trained on 4 enrollment utterances
n/an/an/an/an/a2.27%n/an/aGE2E trained on 7 enrollment utterances
n/an/an/an/an/a2.17%n/an/aGE2E trained on 10 enrollment utterances
n/an/an/an/an/a2.01%n/an/aGE2E trained on 15 enrollment utterances
Note: “n/a” indicates that the metric was not reported in the corresponding study.
Table 9. Paper comparisons continued.
Table 9. Paper comparisons continued.
Source: Choi et al. [30]
Accuracy Precision Recall F1 ROC AUC EER MSE MCC Model Features Data
n/an/an/an/an/a2.70%n/an/aBC-CMT-Tiny (273.6K params)80-dimensional log-Mel-filterbank of 25 ms frames with 10 ms stepsTrained on 5994 speakers with an average of 185 utterances per speaker each lasting on average 7.8 s
n/an/an/an/an/a1.05%n/an/aBC-CMT-Small (1.4M params)
n/an/an/an/an/a0.86%n/an/aBC-CMT-Base (6.3M params)
Source: Xu et al. [31]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
n/an/an/an/an/a3.48%n/an/aResNet-50 ArchitectureSpectrograms generated by a sliding window using a Hamming window with a width of 20 ms and a step of 10 msVoxCeleb2 training set consisting of 1,128,246 utterances from 5994 speakers with an average duration of 8.28 s
Source: Li et al. [32]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
n/an/an/an/an/a2.16%n/an/aResNet34-AM-SoftmaxCalculated by DNNTrained on 7205 speakers each with two sets of 64 random 3 s segments of utterances
Source: Chen et al. [33]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
n/an/an/an/a90.1%16.4%n/an/a3D-CNN with 469.465K Params40 Mel-scale logarithmic energy coefficientsUses Common Voice breakdown described in Section 2.4.6
n/an/an/an/a92.5%14.3%n/an/aLightweight 3D-CNN with 235.381K Params
Source: Zhang et al. [34]
AccuracyPrecisionRecallF1ROC AUCEERMSEMCCModelFeaturesData
n/an/an/an/an/a3.88%n/an/aFine-tuned thin-ResNet34 with all labels and all layers40-dimensional log-mel features with hamming window of 25 ms and 10 ms stepRandom 2 s non overlapping speech segments from VoxCeleb-1
n/an/an/an/an/a1.87%n/an/aRandom 2 s non overlapping speech segments from VoxCeleb-2
Note: “n/a” indicates that the metric was not reported in the corresponding study.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kohler, O.; Imtiaz, M. Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches. Electronics 2025, 14, 963. https://doi.org/10.3390/electronics14050963

AMA Style

Kohler O, Imtiaz M. Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches. Electronics. 2025; 14(5):963. https://doi.org/10.3390/electronics14050963

Chicago/Turabian Style

Kohler, Odin, and Masudul Imtiaz. 2025. "Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches" Electronics 14, no. 5: 963. https://doi.org/10.3390/electronics14050963

APA Style

Kohler, O., & Imtiaz, M. (2025). Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches. Electronics, 14(5), 963. https://doi.org/10.3390/electronics14050963

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop