1. Introduction
Speaker verification (SV) is an authentication technique to verify a person using their speech sample. It is a binary classification system. Due to its non-invasive nature, SV has attracted great interest in many authentication services such as voice mail, home automation, computer login, online resource access, IoT, etc. Depending on the constraint of lexicon or phonetic content in the speech sample, SV systems can be broadly categorized as text-independent (TI) or text-dependent (TD). In TD-SV, speakers utter the same pass-phrase during both the enrollment and test phases to maintain the matched phonetic content. On the other hand, the speakers are free to speak any text during the training and test phases in TI-SV, i.e., there is no constraint to speak the same pass-phrase during both training and testing. Therefore, TD-SV is able to yield much lower error rates than TI-SV, especially when using short utterances. Additionally, the response time of TD-SV, due to the need for short utterances only, is much shorter compared to TI-SV, which makes it attractive for real-time applications.
A variety of methods were proposed in the literature to improve the performance of TD-SV. These methods are grouped into feature domain [
1], model domain [
2,
3], and score domain [
4]. In the feature domain, one type of feature includes engineered short-time cepstral features, such as Mel-frequency cepstral coefficients (MFCC) [
5], power normalized cepstral coefficients [
6], and perceptual linear prediction [
7]. Another contains learned bottleneck (BN) features, which are derived from deep neural networks (DNN) where a DNN is trained to discriminate or predict a chosen target. Afterward, the frame-level output of a particular hidden layer is projected onto a low dimensional space to obtain BN features [
8]. The low dimensional space is usually found using principal component analysis (PCA). In [
9], audio segments of variable lengths are represented by fixed-length vectors using the concept of the sequence-to-sequence autoencoder. A fusion embedding network is proposed in [
10] to combine the advantage of TI-SV and TD-SV in joint learning. A multi-task learning network, which is based on a phoneme-aware and channel-wise attentive learning strategy, is proposed for TD-SV to disentangle the speaker and text information [
11]. A memory layer and multi-head attention mechanism-based DNN is proposed to improve the efficiency of TD-SV systems in [
12]. A synthesis-based data augmentation method is introduced in [
13] to increase the speakers’ and text-controlled speech data for TD-SV. In this work, we focus on the feature domain, in particular, deep features at frame level.
In training DNNs for feature extraction, various training targets are used, and examples are speakers [
8], phones [
1], pass-phrases [
8], senones [
14], time-contrastive learning targets [
1], and auto-regressive prediction coding (APC) targets [
15]. Most of the BN feature extraction methods require label information such as speaker identities, pass-phrases, and phones. The generation of label information can be time-consuming and expensive. As an alternative, self-supervised and semi-supervised learning is very appealing, which can leverage a large amount of unlabeled data available in the real world. Recently, APC [
16] and time-contrastive learning (TCL) [
1] BN features were introduced for speech representation learning for SV. In APC-BN, a DNN is trained with the objective to predict the future feature vector using the current and past frames. Then, the last hidden layer is used for BN feature extraction. Given that the objective of APC is to predict the content of the next frame, it is unknown whether the last hidden layer is the optimal choice. On the other hand, TCL uniformly divides the speech signal into a number of predefined segments, and then, the frames within a particular segment are assigned one same class label. Afterward, a DNN is trained to discriminate these classes for BN feature extraction. TCL aims to capture the temporal/phonetic information in the speech signal in a self-supervised manner and is shown to be very useful for TD-SV. As both the recently proposed APC and TCL BN features are extracted in a self-supervised manner, it is of interest and relevance to compare their performance and behavior in the same framework.
In addition to the selection of training targets, the other essential choices in DNN design include activation functions and loss functions, which are both key elements for DNN training. A loss function measures the error between the network output and the desired target, and in error back-propagation training, the derivative of the loss function is used to guide the training through the gradient descent approach. Various loss functions were introduced in the literature for improved representation learning for such tasks as speech recognition, speaker verification, and image classification, and the examples are joint softmax-center loss [
17], modified-softmax [
18], arcFace [
19], focal [
20], orthogonal softmax layer (OSL) [
21], triplet-loss [
22], the simple framework for contrastive learning (SimCLR) [
23], and cross-entropy.
Activation functions, on the other hand, control the output of DNN hidden neurons, as well as the gradient contribution during the error back-propagation process for network parameter optimization. Among others, sigmoid [
24,
25] and ReLU [
26] are most widely used in the state-of-the-art systems such as speaker recognition [
27,
28,
29,
30] and language recognition [
31,
32], speech recognition [
33,
34], prosodic representation [
35], and image processing [
19,
22,
23].
Although widely used, the sigmoid function suffers from a major problem, namely, gradient vanishing. This is because the function squishes the input space into a range between 0 and 1, and hence, a large change in input may have a small change in the output, leading to a very small derivative. The multiplication through hidden layers in back-propagation decreases the gradient exponentially. In the end, initial layers are not updated properly, and thus, the model is not trained effectively and lacks in generalization ability [
36,
37]. To avoid the vanishing gradient problem, the ReLU activation function is widely used as well. As it preserves the large dynamic range of input in the output (from 0 to maximum), as compared to the sigmoid function, it provides a better generalization performance and is simple. As per [
38], the sigmoid function is ineffective for training DNNs due to the gradient vanishing problem, and ReLU lacks in probabilistic interpretation and, thus, requires stochastic regularization for better training of DNN. To combine stochastic regularization with a deterministic activation function, the GELU activation function is introduced in [
38]. It is shown in [
38] that GELU outperforms ReLU, the exponential linear unit (ELU) in different tasks including speech recognition, language processing, and computer vision. For extracting speaker embeddings, GELU is found being used in Transformers and multi-layer perceptron-based speaker verification networks (MLP-SVNet) systems [
39,
40].
The methods for BN feature extraction in TD-SV usually consider sigmoid activation function, and if discriminative loss function is needed, cross-entropy is used for discriminating, e.g., speakers, pass-phrases, senons, and TCL segments. The focus is on defining training targets, while loss functions and activation functions are significantly under-explored. Therefore, we aim at filling in this gap in this work, namely, to study the effect of different loss and activation functions, in connection with training targets, for BN feature extraction in TD-SV.
The contributions of this work are five-fold. First, we systematically study the impact of training targets, activation functions, and loss functions for the extraction of BN features on the performance of TD-SV in one joint framework, i.e., the evaluation of different training targets and activation and loss functions is based on the same DNN structure for BN feature extraction and the same TD-SV back-end and task. Second, we investigate ReLU and GELU activation functions for BN feature extraction for TD-SV and compare them with the commonly used sigmoid function in this context. Third, we study the impact of a set of loss functions on TD-SV performance. Fourth, we compare the performance of speaker-discriminant (Spkr) BN, TCL-BN, and APC-BN features, with the first being supervised and the last two being self-supervised. Finally, we analyze the performance of BN features extracted from different hidden layers and the performance of the score-level fusion of TD-SV systems based on different features.
We show that (1) both ReLU and GELU are able to reduce TD-SV error rates significantly compared with the commonly used sigmoid function in most cases, and GELU generally performs the best; (2) cross-entropy, joint-softmax, and focal loss functions outperform the others; (3) TCL is the best-performing training target; and (4) the fusion of different systems in the score domain further reduces the error rate. For the TD-SV system, we consider two well-known state-of-the-art techniques: the Gaussian mixture model-universal background model (GMM-UBM) [
41] and i-vector [
2] with scoring based on supervised probabilistic linear discriminate analysis (PLDA) training [
42,
43].
The paper is organized as follows.
Section 2 presents three training targets and their corresponding BN features.
Section 3 and
Section 4 introduce loss functions and activation functions, respectively.
Section 5 presents the GMM-UBM and i-vector methods used for speaker modeling. The experimental setup is described in
Section 6.
Section 7 provides results and discussions. Finally, the paper is concluded in
Section 8.
6. Experimental Setup
For evaluation, male speakers of the m-part-01 task in the RedDots challenge 2016 database are used as per protocol [
51] and the database (is composed of 35 target males, 14 unseen male imposters, 6 target females, and 7 unseen female speakers). The task consists of 320 target models (from 35 target male speakers) for training using the recording of three voice samples for a particular pass-phrase. Each utterance is very short in duration, an average of 2–3 s. Three types of non-target trials are available for the performance of the TD-SV system:
Target-wrong (TW): When a genuine speaker speaks a wrong phrase, i.e., a different pass-phrase/sentence in testing compared to their enrollment phrase.
Imposter-correct (IC): When an imposter speaks a sentence/pass-phrase in testing where the pass-phrase is the same as that of the target enrollment sessions.
Imposter-wrong (IW): When an imposter speaks a sentence/pass-phrase to access the system where the pass-phrase is different from that of the target enrollment sessions.
The evaluation data set is further divided into a development set (devset) and a test set (called the evaluation-set interchangeably) as per [
52,
53,
54]. The development set consists of a disjoint set of nine speakers (who are excluded from the system evaluation) and the rest for evaluation. Finally, it yields 72 and 248 target models for development and evaluation, respectively. It is important to note that the trials in the devset are derived by cross-claiming of one speaker against the others (within the nine speakers). However, the evaluation set consists of some imposter trials that are from speakers outside the enrollment speakers, i.e., unknown, and this makes the evaluation set more challenging than the devset and useful for real-world scenarios where the system can encounter unknown imposters.
Table 1 shows the number of different trials available in the development and evaluation sets. For more details about the database, see [
51].
For the spectral feature, 57 dimensional MFCC feature vectors (19 static and their first and second derivatives) are extracted from speech samples with RASTA filtering [
55] using a 25 ms hamming window and a 10 ms frame shift. After extracting the features, rVAD [
56], an open-source unsupervised voice activity detection (VAD) algorithm (
https://github.com/zhenghuatan/rVAD, accessed on 16 March 2022) is applied to discard the low energized frames. Finally, the selected frames are normalized to zero mean and unit variance at the utterance level.
In the GMM-UBM system, the GMM-UBM of 512 mixtures (having diagonal co-variance matrices) is trained using 6300 speech files from the TIMIT database [
57] with over 438 males and 192 females. Three iterations of MAP adaptation are considered during the training of the speaker-dependent model with the value of relevance factor 10. For training DNNs for BN feature extraction and training total variability and PLDA for i-vector systems, 72,764 utterances over 27 pass-phrases (of 157 male and 143 female speakers) from the RSR2015 database [
58] are used.
For BN feature extraction, DNNs with six hidden layers are trained with the following configuration: a batch size of 1024, learning rate of , 30 training epochs, 1024 neurons per hidden layer, and the contextual input of 11 frames (i.e., 5 left frames, 1 current frame, and 5 right frames). The number of target speakers in BN-spkr is 300. BN features are extracted by projecting the frame level output for a particular hidden layer (before applying the activation function) of DNNs onto 57 dimensional space using PCA to align with the dimension of the MFCC feature for a fair comparison.
TensorFlow [
59] is used for training the DNNs for all BN features, except for APC-BN. The examples from the same class within a mini-batch are considered as positive, and the examples from classes other than a particular positive class are treated as negative for similarity measures for those loss functions (triplet-loss and SimCLR) that require positive and negative examples. The process is repeated for all samples within the mini-batch. The values of
s,
m, and
are considered, respectively, 64,
, and
in both Archface and SimCLR.
regularization is considered during the training of DNNs with a penalty value of
. In Leaky ReLU, the value of the slope parameter is considered to be
.
For extracting APC-BN features, the DNN encoder is trained as per [
15], which consists of 3 hidden layers in the gated recurrent unit (GRU) with the following configuration: a batch size of 32, a learning rate of
, and
, as in Equation (
2) (which gives the best performance in [
15]).
In PLDA, speaker and channel factors are kept full, and the same pass-phrase utterances from a particular speaker are considered as an individual speaker. It gives 8100 classes (4239 males and 3861 females). The i-vector system is implemented using the Kaldi toolkit [
60]. PCA is trained by the data set used for training the GMM-UBM.
System performance is measured in terms of equal error rate (EER) and minimum detection cost function (minDCF), as per the 2008 SRE [
61]. Note that our discussions on experimental results will be primarily centered around EER to be concise as EER and minDCE results mostly agree with each other. The detection cost function is defined as
where
,
, and
.