On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification

Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck features, key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training targets, we consider speaker identity, time-contrastive learning (TCL) and auto-regressive prediction coding with the first being supervised and the last two being self-supervised. Furthermore, we study a range of loss functions when speaker identity is used as the training target. With regard to activation functions, we study the widely used sigmoid function, rectified linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid, irrespective of training target. Among the three training targets, TCL performs the best. Among the various loss functions, cross entropy, joint-softmax and focal loss functions outperform the others. Finally, score-level fusion of different systems is also able to reduce the error rates. Experiments are conducted on the RedDots 2016 challenge database for TD-SV using short utterances. For the speaker classifications, the well-known Gaussian mixture model-universal background model (GMM-UBM) and i-vector techniques are used.


I. INTRODUCTION Speaker verification (SV) is an authentication technique
to verify a person using their speech sample.It is a binary classification system.Due to its non-invasive nature, SV has attracted great interest for many authentication services such as voice mail, home automation, computer login, online resource access, IoT, etc. Depending on the constraint of lexicon or phonetic content in the speech sample, SV systems can be broadly categorized as text-independent (TI) or text-dependent (TD).In TD-SV, speakers utter the same pass-phrase during both enrollment and test phases to maintain the matched phonetic content.Therefore, TD-SV is able to yield much lower error rates than TI-SV, especially when using short utterances.Besides, the response time of TD-SV, due to the need for short utterances only, is much shorter compared to TI-SV, which makes it attractive for real-time applications.
A variety of methods have been proposed in the literature to improve the performance of TD-SV.These methods are A. K. Sarkar is with Indian Institute of Information Technology, Sri City, India (E-mail: sarkar.achintya@gmail.com).Z.-H.Tan is with the Department of Electronic Systems, Aalborg University, and Pioneer Centre for AI, Denmark (E-mails: zt@es.aau.dk).
grouped into feature domain [1], model domain [2], [3] and score domain [4].In the feature domain, one type of features includes engineered short-time cepstral features, such as Melfrequency cepstral coefficients (MFCC) [5], power normalized cepstral coefficients [6], and perceptual linear prediction [7].Another contains learned bottleneck (BN) features, which are derived from deep neural networks (DNN) where a DNN is trained to discriminate or predict a chosen target.Afterward, the frame-level output of a particular hidden layer is projected onto a low dimensional space to obtain BN features [8].The low dimensional space is usually trained using principle component analysis (PCA).In this work we focus on the feature domain, in particular, deep features.
In training DNNs for feature extraction, various training targets have been used, and examples are speakers [8], phones [1], pass-phrases [8], senones [9], time-contrastive learning target [1], and auto-regressive prediction coding (APC) target [10].Most of the BN feature extraction methods require label information such as speaker identities, pass-phrase and phones.Generation of label information can be timeconsuming and expensive.As an alternative, self-supervised and semi-supervised learning is very appealing, which can leverage the large amount of unlabelled data available in realworld.Recently, APC [11] and TCL [1] BN features have been introduced for speech representation learning for SV.In APC-BN, a DNN is trained with objective to predict the future feature vector using the current and past frames.Then, the last hidden layer is used for BN feature extraction.Given that the objective of APC is to predict content of next frame, it is unknown whether the last hidden layer is the optimal choice.On the other hand, TCL uniformly divides the speech signal into a number of predefined segments and then the frames within a particular segment are assigned one same class label.Afterward, a DNN is trained to discriminate these classes for BN feature extraction.TCL aims to capture the temporal information in the speech signal in self-supervised manner.As both the recently proposed APC and TCL BN features are extracted in self-supervised manner, it is of interest and relevance to compare their performance and behaviour in the same framework.
Besides the selection of training targets, the other essential choices in DNN design include activation functions and loss functions, which are both key elements for DNN training.A loss function measures the error between the network output and the desired target and, in error back-propagation training, the derivative of the loss function is used to guide the training through the gradient descent approach.Various loss functions have been introduced in literature for improved representation learning for such tasks as speech recognition, speaker verification and image classification, and examples are joint softmax-center loss [12], modified-softmax [13], arcFace [14], focal [15], orthogonal softmax layer (OSL) [16], tripletloss [17], simple framework for contrastive learning (SimCLR) [18] and cross-entropy.
Although widely used, the sigmoid function suffers from a major problem, namely gradient vanishing.This is because the function squishes the input space into a range between 0 and 1 and hence a large change in input may have a small change in the output, leading to very small derivative.The multiplication through hidden layers in back-propagation decreases the gradient exponentially.In the end, initial layers do not get updated properly and thus the model is not trained effectively and lacks in generalization ability [31], [32].To avoid the vanishing gradient problem, ReLU activation function is widely used as well.As it preserves the large dynamic range of input in the output (from 0 to maximum), as compared to the sigmoid function, it provides better generalization performance and it is simple.As per [33], the sigmoid function is ineffective for training DNNs due to the gradient vanishing problem, and ReLU lacks of probabilistic interpretation and thus requires stochastic regularization for a better training of DNN.To combine stochastic regularization with a deterministic activation function, GELU activation function is introduced in [33].It is shown in [33] that GELU outperforms ReLU, exponential linear unit (ELU) in different tasks including speech recognition, language processing and computer vision.
Methods for BN feature extraction in TD-SV usually consider sigmoid activation function and if discriminative loss function is needed, cross-entropy is used for discriminating, e.g., speakers, pass-phrases, senons and TCL segments.The focus has been on defining training targets, while loss functions and activation functions are significantly under-explored.Therefore, we aim at filling in this gap in this work, namely to study the affect of different loss and activation functions, in connection with training targets, for BN feature extraction in TD-SV.
The contributions of this work are five-fold.First, we systematically study the impact of training targets, activation functions and loss functions on the performance of TD-SV in one joint framework.Secondly, we introduce ReLU and GELU activation functions for BN feature extraction for TD-SV and compare them with the commonly used sigmoid function in this context.Thirdly, we study the impact of a set of loss functions on TD-SV performance.Fourthly, we compare the performance of speaker-discriminant (Spkr) BN, TCL-BN, and APC-BN features with the first being supervised and the last Fig. 1: A DNN system, trained to discriminate or predict targets, for generating BN features using the second hidden layer.
two being self-supervised.Finally, we analyse the performance of BN features extracted from different hidden layers and the performance of score-level fusion of TD-SV systems based on different features.We show that (1) both ReLU and GELU are able to reduce TD-SV error rates significantly as compared with the commonly used sigmoid function in most cases, and GELU generally performs the best, (2) cross entropy, joint-softmax and focal loss functions outperform the others, 3) TCL is the best performing training target, and 4) the fusion of different systems in score domain further reduces the error rate.
For the TD-SV system, we consider two well-known stateof-the-art techniques: Gaussian mixture model-universal background model (GMM-UBM) [34] and i-vector [2] with probabilistic linear discriminate analysis (PLDA) based scoring [35], [36].It is observed in [35], [36] that GMM-UBM and i-vector remain a better choice for TD-SV using short utterances than other methods such as x-vector [37].This is likely due to limited training data/speakers available in the existing TD-SV databases [38].
The paper is organized as follows.Section II presents three training targets and their corresponding BN features.Sections III and IV introduce loss functions and activation functions, respectively.Section V presents the GMM-UBM and i-vector methods used for speaker modeling.Experimental setup is described in Section VI.Section VII provides results and discussions.Finally, the paper is concluded in Section VIII.

II. BN FEATURES AND THEIR TRAINING TARGETS
We consider both supervised and self-supervised learning methods.The former method uses manually generated labels while the latter derives training target from data itself without using human labels.More specifically, for supervised learning, speaker identities are used as the training targets, and for selfsupervised learning, TCL and APC training targets are used.
After training the DNN, frame level output from a particular hidden layer is projected onto a low dimensional space to get BN features.Figure 1 shows a block diagram of extracting BN features from the second hidden layer of a DNN.

A. Spkr-BN
This is a supervised feature extraction method where a feed forward DNN is trained using speaker identity labels as the training target to discriminate the speakers at the output layer [8], [22], [23].The generated BN feature is called Spkr-BN.[1] This is a self-supervised learning method where each speech signal is uniformly segmented into a fixed number of segments and then the data points within a particular segment are assigned one same class label as the training target; the first segment of a signal belongs class one, the second segment class two, and so on.These generated targets are then used for the training of a DNN with cross-entropy loss functions, and the derived feature is called uTCL-BN.The objective is to capture temporal information in the speech signal in an unsupervised manner (without using automatic speech recognition or any manual label information).

B. TCL-BN
In another case, speech signals are first randomized and then concatenated into a single long-duration stream.The stream is splitted into chunks of M frames with M = 6, and the derived feature is called sTCL-BN.
For the c number of classes in TCL, c segments are taken each time from one entire signal or a part of a stream and the frames within the n th segments are assigned class label n as where x denotes the frame-based feature vector.In this study, we consider the value of c = 10 as per [1].[10] In this self-supervised learning method, a DNN encoder is trained to output a sequence (o 1 , o 2 , . . ., o N ) as a prediction of a given target sequence (t 1 , t 2 , . . .t N ) that is generated by right-shifting the input sequence (x 1 , x 2 , . . ., x N ) of t n time steps.Then the objective function is defined as the ℓ1 loss between them

C. APC-BN
( which is to be minimized.The output from a particular hidden layer of the DNN for a given utterance at frame-level is extracted to get the high dimensional deep APC feature for text-independent speaker verification and identification [10].In [11], the deep APC feature vectors are further projected onto a low dimensional space to get APC-BN feature for TD-SV.

III. LOSS FUNCTIONS
In this section, we describe a set of loss functions that have been successfully applied to various application domains and will be used in this work for training DNNs to extract bottleneck features.In particular, we focus on loss functions for classification.Note that, in the case of APC-BN, the ℓ1 loss is used for prediction/regression as already presented in the section above.

A. Cross-entropy
In this method, a feed-forward DNN is trained to discriminate the classes at the output layer with cross-entropy (CE) as the loss function where L CE , θ, y i , x i and p(.) denote the CE loss, parameters of the DNN, the class label of the i th input feature vector and a posteriori output at the DNN output layer, respectively.
B. Joint-softmax-center [12] This loss function is introduced in [12] to develop robust discriminative deep features considering two loss functions together in training DNNs for face recognition.To investigate the effectiveness of this loss function for TD-SV, we train a feed-forward DNN with joint supervision of softmax L s and center loss L c functions for extracting BN features as where, z i ǫ R d denotes the i th d dimensional deep feature belonging to the y i class.W j ǫ R d and b ǫ R n denote the j th column of the weight matrix W ǫ R d×n in the last layer of DNN and bias, respectively.N and n denote the number of samples in a mini-batch and the number of classes, respectively.c yi ǫ R d denotes the centroid of y i class in deep feature space.c yi is updated over each mini-batch and L c characterizes the intra-class variation.(.) ′ denotes the transpose operation.We consider d = 128 (the embedding feature dimension, i.e., the dimension of the last DNN layer) and the balancing factor λ for two loss being 0.003 (as per [12]).

C. Modified softmax [13]
It is observed in [13] that learned feature with softmax exhibits an angular distribution and hence the combination of different euclidean distance based loss functions (triplet loss [17] and contrastive loss [39])) may not be well suited with softmax.Therefore, softmax function with angular margin is introduced in [13] for face recognition and the learned feature with this loss function will be angularly distributed.In our work, a feed-forward DNN is trained to discriminate the speakers at the output layer with a modified softmax based cross-entropy function L ms as where, θ j,i (0 ≤ θ j,i ≤ π) denotes the angle between the d dimensional deep feature (or embedding) z i (of i th sample belonging to the y i th class) and weight vector W j (the j th column of weight matrix W ǫ R d×n ).n denotes the number of class.θ yi defines the angle between the learned feature z i and the weight vector W yi (the y th i column of W ). We consider the embedding feature dimension (i.e., dimension of the last layer of DNN) d = 128.For more details see [13].[14] This loss function is introduced in [14] to improve the discrimination capability of a face recognition model by adding angular penalty margin on the embedding features in the hyper-plane.The discrimination is obtained by increasing and decreasing of the inter and intra class dispersion, respectively.It is shown in [14] that ArcFace yields better accuracy in face recognition than the existing 10 benchmark methods such as triplet-loss, softmax-loss and center-loss.The ArcFace loss function is defined as

D. ArcFace
log e s(cos(θy i +m) e s(cos(θy i +m)) + n j=1,j =yi e s cos θj (7) where, θ j defines the angle between the weight vector W j (the j th column vector of weight matrix W ǫ R d×n ) and the deep feature vector z i ǫ R d (of i th sample belonging to the y i th class).θ yi defines the angle between the feature z i (of class y i ) and weight vector W yi .d denotes the dimension of the embedded deep feature of the i th sample of class y i .N and n denote the batch size and number of class, respectively.m adds the angular margin penalty between the z i and W yi to increase the compactness and discrepancy for the intra-class and inter-class, respectively.s is a scaling factor.The angle θ j , feature z i and weight vector W j are related as In our experiments, the dimension of DNN output layer, i.e., the value of d, is set to 128.For more details see [14].

E. Focal [15]
This loss function is proposed in [15] specially for the object detection in imbalance class scenarios, which basically downgrades the importance of the easily classified examples to avoid being overwhelmingly dominated by the easy negative examples in the model training.This system is analogous to the BN-spkr with cross entropy loss.The only difference is that it incorporates a modulating factor (1 − p t ) Γ with the cross-entropy based loss function.It can be expressed as where Γ ǫ [0, 5].For Γ = 0, Eq. ( 9) becomes equivalent to cross-entropy based loss function and high value of Γ increases the effect of modulating factor.For the well classified case of target t sample, p t → 1 and the modulator becomes 0, and thus the loss is down-weighted for the well-classified examples.More details can be found in [15].The value of Γ is considered 2 as in [15].In our experiments, the number of speech samples and their duration vary across speakers, so it represents an imbalance class scenario.

F. OSL [16]
To reduce the over-fitting problem of DNN trained with a small training set, the inclusion of orthogonal softmax layer in classification is proposed in [16] for scene classification.It maximizes the classification margin by increasing the angle among the weight-vectors of different classes.In this method, an orthogonal softmax layer is defined at the output layer of DNN as where * represent element-wise product and Ω indicates the predefined fixed block diagonal mask matrix.OSL makes orthogonal the weight vectors in the classification layer during both the training and test processes, which leads to a tighter generalization error bound.ψ and r stand for the input and output vectors of the layer, respectively.
G. Triplet-loss [17] This loss function is proposed in [17] for embedding a face image into a low dimensional space with the purpose of discriminating the positive examples from the negative ones based on a distance margin.This method achieves very high accuracy in face recognition.To use the loss function for BN feature extraction in TD-SV, a feed-forward DNN is trained to discriminate speakers with a loss function that minimizes the distance between the anchor and positive and maximizes the distance between the anchor and negative class.It can be expressed as where z a , z p and z n represent anchor, positive and negative embeddings, respectively.For the distance measure d(., .) in Eq. ( 11), input feature vectors of training speakers are embedded into 128 dimensional vector space at the last layer of DNN.Triplet score is calculated on the embedded space, i.e., at the last layer of DNN.We consider online triplet loss, i.e., an example within the same class as the anchor is considered as positive and an example from different classes than that of the anchor is considered as negative within the data samples of a particular mini-batch.Afterward, the frame level output from a particular hidden layer of DNN for a given utterance is projected onto a low dimensional space to get BN feature.

H. SimCLR [18]
The SimCLR is proposed in [18] for useful visual representation in image classification.It yields best result in top-1 accuracy compared to other methods in ImageNet dataset.The SimCLR function L CLR (i, j) for a pair of example within positive (same class) is defined as, where sim(z i , z j ) = z t i zj zi zj and 1 [k =i] indicates 1 iff i = k.τ is called temperature parameter.z i denotes the d dimensional embedded deep feature for input sample x i .We consider d = 128, i.e., the dimension of DNN output/embedding layer.The final loss is computed over all positive pairs available, i.e., both (i, j) and (j, i) in the particular mini-batch data.For more details see [18].

IV. ACTIVATION FUNCTIONS
In this section, we describe the different activation functions which are broadly used in many fields including speech processing.
A. Sigmoid [19] This is a non-linear activation function defined as (15) where v is the input to the activation function.As in Eq. ( 13), the sigmoid function squishes its input to a value between 0 to 1 and hence the large change in the input yields small change in output (with the maximum value of 1) as shown in Fig. 2. So, the parameter optimization of a DNN through error back-propagation faces the known gradient vanishing problem.Specifically, multiplication of gradient with a small value (as Eq. ( 15) shows), across different layers in deep networks during the back-propagation process yields exponential decaying of gradient.As a result, the weights and biases of the initial layers will not be updated sufficiently during the training process.Nevertheless, this function is widely used in speaker and language recognition.

B. ReLU [21]
ReLU is a piece-wise linear activation function defined as ReLU preserves the dynamic range of the input in the output when the input is greater than zero as shown in Eq. ( 16) and Fig. 2. Therefore, it does not suffer from the gradient vanishing problem as the sigmoid function does.Besides, it provides better and faster convergence [33] as compared with the sigmoid function, which makes it very popular in stateof-the-art DNN systems with a variety of applications [40].However, it is not statistically motivated.
C. GELU [33] As discussed above, the sigmoid function suffers from the gradient vanishing problem and the ReLU function is statistically less motivated.To tackle the problem of lack of probabilistic interpretation of ReLU, stochastic regularization, e.g., dropout, is often introduced to improve the training of DNNs.In an attempt to merge probabilistic regularization with an activation function, GELU is proposed.It is a standard Gaussian cumulative distribution function which introduces the non-linearity onto the output of a DNN neuron based on their values, instead of using the input sign as in ReLU.GELU is defined as where v and φ(v) are the input to the activation function and cumulative distribution function N (0, 1), respectively.Figure 2 illustrates the sigmoid, ReLU and GELU activation functions.

V. CLASSIFIER
In this section, we describe the different modeling techniques which are commonly used in speaker verification.

A. GMM-UBM
In this method [34], a GMM-UBM is trained using data from many non-target speakers.Then target speaker models are obtained from the GMM-UBM, λ ubm , with maximum a posteriori (MAP) adaptation in the enrollment phase.During test, the feature vector of the test utterance X = {x 1 , x 2 , . . ., x N } is scored against the claimant λ tar and GMM-UBM models.Afterward, log likelihood ratio (LLR) value is calculated for decision making: Figure 3 illustrates a text dependent speaker verification system using the GMM-UBM technique.

B. i-vector
In this method [2], a speech signal is represented using a low-dimensional vector called i-vector, which is obtained by projecting the signal onto a low dimensional subspace (called total variability (T) space) of a speaker independent GMM-UBM super-vector, where speaker and channel information is assumed to be dense.For a given speech signal of a speaker, the speaker and channel dependent GMM super-vector S can be expresses as where M denotes the speaker-independent GMM super-vector.and ω is called an i-vector.During the enrollment phase, each speaker is represented by an average i-vector computed over his/her training utterance-wise (or speech session-wise) i-vectors.In the test phase, i-vector of a test utterance ω t is scored against the claimant specific i-vector ω e (obtained during enrolment) with PLDA [4].Figure 4 illustrates TD-SV using the i-vector technique.

VI. EXPERIMENTAL SETUP
For evaluation, male speakers of the m-part-01 task in the RedDots challenge 2016 database is used as per protocol [41].The task consists of 320 target models for training using the recording of three voice samples for a particular pass-phrase.Each utterance is very short in duration of an average of 2-3s.Three types of non-target trials are available for the performance of TD-SV system: • Target-wrong (TW): When a genuine speaker speaks a wrong, i.e., a different pass-phrase/sentence in testing compared to their enrollment phrase.
• Imposter-correct (IC): When an imposter speaks a sentence/pass-phrase in testing, where the pass-phrase is the same as that of the target enrollment sessions.• Imposter-wrong (IW): When an imposter speaks a sentence/pass-phrase to access the system where the passphrase is different from that of the target enrollment sessions.The evaluation data set is further divided into development set (devset) and test set (called evaluation-set interchangeably) as per [42].The development set consists of a disjoint set of nine speakers (who are excluded from the system evaluation) and the rest for evaluation.Finally, it yields 72 and 248 target models for development and evaluation, respectively.It is important to note that the trials in the devset are derived by cross claiming of one speaker against the others (within the nine speakers).However, the evaluation set consists of some imposter trials which are from speakers outside the enrollment speakers, i.e., unknown and this makes the evaluation set more challenging than the devset and useful for real-world scenarios where the system can encounter unknown imposters.Table I shows the number of different trials available in the development and evaluation sets.For more details about the database see [41].For ceptral feature, 57 dimensional MFCC feature vectors (19 static and their ∆, ∆∆) are extracted from speech samples with RASTA filtering [43] and using a 25ms hamming window and a 10ms frame shift.After extracting features, rVAD [44], an open-source unsupervised voice activity detection (VAD) algorithm 1 , is applied to discard the low energized frames.Finally, the selected frames are normalized to zero mean and unit variance at the utterance level.
In the GMM-UBM system, the GMM-UBM of 512 mixtures (having diagonal co-variance matrices) is trained using 6300 speech files from the TIMIT database [45] with over 438 males and 192 females.Three iterations of MAP adaptation are considered during the training of speaker-dependent model with the value of relevance factor 10. For training DNNs for BN feature extraction and training total-variability and PLDA for i-vector systems, 72764 utterances over 27 pass-phrases (of 157 male and 143 female speakers) from the RSR2015 database [46] are used.
For BN feature extraction, DNNs with six hidden layers are trained with the following configuration: batch size of 1024, learning rate of 0.001, 30 training epochs, 1024 neurons per hidden layer, and contextual input of 11 frames (i.e., 5 left frames, 1 current frame, and 5 right frames).The number of target speakers in BN-spkr is 300.BN features are extracted by projecting the frame level output for a particular hidden layer (before applying the activation function) of DNNs onto 57 dimensional space using PCA to align with the dimension of the MFCC feature for a fair comparison.
TensorFlow [47] is used for training the DNNs for all BN features, except for APC-BN.Examples from the same class within a mini-batch are considered as positive, and examples from classes other than a particular positive class are treated as negative for similarity measures for those loss functions (triplet-loss and SimCLR) which require positive and negative examples.The process is repeated for all samples within the mini-batch.The values of s, m and τ are considered, respectively, 64, 0.5 and 0.5 in both Archface and SimCLR.L 2 regularization is considered during the training of DNNs with penalty value of 0.0001.
For extracting APC-BN features, the DNN encoder is trained as per [10], which consists of 3 hidden layers of gated recurrent unit (GRU) with following configuration: batch size of 32, learning rate of 0.001, and t n = 5 as in Eq. ( 2) (which gives the best performance in [10]).
In PLDA, speaker and channel factors are kept full and the same pass-phrase utterances from a particular speaker are considered as an individual speaker.It gives 8100 classes (4239 males and 3861 females).The i-vector system is implemented using the Kaldi toolkit [48].PCA is trained by the data set used for training the GMM-UBM.
System performance is measured in terms of equal error rate (EER) and minimum detection cost function (minDCF) as per the 2008 SRE [49].Note that our discussions on experimental results will be primarily centered around EER to be concise as EER and minDCE results mostly agree with each other.

VII. RESULTS AND DISCUSSIONS
This section presents experimental results using the methods presented above and analyses the results.

A. Performance of Spkr-BN features
In Table II, we present the TD-SV performance of Spkr-BN features using different activation functions, different loss functions and different DNN hidden layers on the development and evaluation sets using the GMM-UBM technique for SV.For simplicity, the average EER and MinDCF values across TW, IC and IW non-target trials are included.The TD-SV performance of each BN feature is represented by its performance on the evaluation set, for which the particular hidden layer performing the best (giving the lowest average EER) on the development set is chosen.The same hidden layer (i.e., the best performing layer for GMM-UBM) is used for evaluating the i-vector technique.
First we compare the performance of different activation functions.From Table II, it is noticed that GELU based BN features give, in most cases, the lowest average EER values as compared with sigmoid and ReLU.More specifically, the widely used sigmoid function in general performs significantly worse, and the performance difference between GELU and RELU is small.This demonstrates the superiority of GELU as the activation function for DNN based BN feature extraction in TD-SV.
Then we compare the different loss functions.It is seen that CE, joint-softmax-center and focal show overall lowest average EER values and they are largely on par.They are followed by ArchFace and OSL loss functions.Triplet and SimCLR loss functions perform the worst in these experiments, and when these two loss functions are applied, the impact of choosing different activation functions is negligible.This could be due to the fact that they require special care of selecting or even generating negative and positive examples (considering SimCLR is a self-supervised learning approach) [51], [18].Now let us look at the TD-SV performance of BN features using different hidden layers on devset as shown in Table II.We can see that for ReLU and GELU, early hidden layer based BN features in general perform better.It is interestingly observed that when BN features are extracted with hidden layers close to the output of the DNN, sigmoid based features yield lower error rates than those using ReLU and GELU.This could be explained by the fact that the sigmoid function suffers from the vanishing gradient problem and thus the training focuses more on the later layers than initial layers.

B. Performance of TCL-BN features
In Table III, we compare the performance of TCL-BN features with the cross-entropy loss function but with different activation functions and different hidden layers using the GMM-UBM technique for SV.It can be seen that the uTCL-BN method outperforms sTCL-BN, which is inline with [1].For uTCL-BN, the GELU activation function is able to give significantly lower EER on the evaluation set as compared with sigmoid and ReLU functions.Furthermore, uTCL-BN with GELU (with EER of 1.08%) also outperforms, by a big margin, the best performing Spkr-BN feature, which is based on cross-entropy (with EER of 1.26%) or joint-softmax-center (with EER of 1.25%) with GELU as well.
To further investigate the reason why GELU based BN features yield much lower EER in TD-SV than sigmoid, we scatter-plot Spkr-BN and uTCL-BN features for different activation functions using T-SNE [50] with the same parameters, as shown in Fig. 5.The figure depicts that GELU based features demonstrate more discriminative patterns than sigmoid based ones and MFCCs.ReLU based features show similar patterns to GELU based ones, which is also reflected by the EER values of the corresponding features.
As SV is fundamentally a classification problem, the more discriminative feature is expected to yield better separability between classes in the score domain.Therefore, we plot in Fig. 6 the score distributions of target-true (genuine) and impostorcorrect (impostor) trials of the Spkr-BN-based GMM-UBM systems on the evaluation set (see Table II) with sigmoid and GELU activation functions.The figure shows that GELU based system yields mostly the higher scores for the targettrue and lower scores for imposter-correct trials compared to the sigmoid based system.This further indicates that GELU is a better choice.
TABLE II: TD-SV performance (average EER/MinDCF) of Spkr-BN features using different loss functions and hidden layers on the development and evaluation sets using the GMM-UBM technique.The performance on the evaluation set is based on the particular hidden layer that performs the best on the development set.

C. Performance of APC-BN features
In Table IV, we present the TD-SV performance of APC-BN features using different activation functions and different hidden layers on the development and evaluation sets using the GMM-UBM technique.From Table IV, it can be observed that GELU in general outperforms sigmoid and ReLU, and they all are significantly superior to MFCC.In addition, concatenation of APC-BN features extracted from different hidden layers further slightly reduces the average EER and minDCF values.This indicates that different layers of an APC network capture different speaker-related information and hence it is beneficial to combine them.Note that we also performed the experiments by concatenating features extracted from different hidden layers for uTCL-BN or Spkr-BN, but none of the combination Fig. 5: Scatter plots of MFCCs and BN features extracted for the target speakers whose utterances are available in the evaluation set, using T-SNE [50] with the same parameters.All features are extracted from the same utterances for a fair comparison.yields any gain and thus is not shown in the paper.

D. Overall comparison and score fusion
In Table V, we first summarize and compare the results across three different types of BN features: Spkr-BN in Table II, uTCL-BN in Table III and APC-BN in Table IV by picking up the best performing configuration from each category.We can see 1) all BN features outperform MFCCs significantly, 2) uTCL-BN performs the best, followed by APC-BN, which both use self-supervised training targets, and 3) GELU is the best performing activation function across all three training targets.Table V further presents the detailed performance for each of the three non-target type trials.An interesting observation from the table is that both APC-BN and uTCL-BN show large reduction in EER for the target-wrong and imposter-wrong trials as compared with Spkr-BN, while Spkr-BN performs better for imposter-correct trials.It indicates that APC-BN and uTCL-BN are better at modelling temporal or phonetic information available in the speech signal in a self-supervised manner, which benefits TD-SV.It should be noted that there are a variety of supervised and selfsupervised training targets available in literature, and we select a few typical examples only in this work with no intention to make exhaustive comparison in this spectrum.Furthermore, the simple score fusion (averaging scores with equal importance) of the three systems selected from each category brings further performance improvement over their standalone counterparts.This indicates that these features carry information complementary to each other.

E. TD-SV performance of BN features with the i-vector technique
Table VI compares the performance of TD-SV on the evaluation set using the i-vector technique for those features seen in Table V.From Table VI, it is observed that the i-vector technique exhibits similar patterns in TD-SV performance to those of GMM-UBM systems shown in Table V.Moreover, the score fusion drastically reduces the EER/MinDCF values with respect to their standalone counterparts.

VIII. CONCLUSION
In this paper, we systematically studied a set of deep bottleneck (BN) feature extraction methods that are based on either supervised or self-supervised training targets for textdependent speaker verification (TD-SV).We investigated their performance in combination with different activation functions and different loss functions in a joint framework.We further analysed the performance of using different hidden layers for deep feature extraction.We have obtained a set of interesting results.First, all BN features outperform cepstral features significantly.Secondly, the two self-supervised learning methods, utterance-wise time-contrastive learning (uTCL) and autoregressive prediction coding (APC), both demonstrate promising and better results as compared with one supervised learning approach that discriminates speaker identities.Among the three activation functions, Gaussian error linear unit (GELU) consistently and significantly outperforms sigmoid.Among a number of loss functions, cross-entropy, joint-softmax and focal outperform the others.In the end, we show score-level fusion of different BN features gives further improvement.We also believe that a better fusion strategy can further improve the fusion system.Fusion in the feature domain is of interest to investigate [52] and we keep it for future work.

Fig. 6 :
Fig. 6: Distribution of the target-true and imposter-correct scores of GMM-UBM TD-SV system in the evaluation set forSpkr-BN with sigmoid and GELU activation functions.All systems use the same trials for a fair comparison.

TABLE I :
Number of trials available for the development and evaluation sets.

TABLE III :
TD-SV performance (average EER/MinDCF) of uTCL-BN features using different activation functions and different hidden layers on the development and evaluation sets using the GMM-UBM technique.The loss function is cross entropy.

TABLE IV :
TD-SV performance (average EER/MinDCF) of APC-BN features using different activation functions and different hidden layers on the development and evaluation sets using the GMM-UBM technique.The loss function is ℓ1.Ly{1,3} denotes the concatenation of outputs from hidden layers 1 and 3.

TABLE V :
TD-SV performance for the different types of non-target trials for different combinations of activation functions and loss functions on the evaluation set using the GMM-UBM technique.

TABLE VI :
TD-SV performance of using the i-vector technique for a number of features presented inTable V on the evaluation set.