Regularized Within-Class Precision Matrix Based PLDA in Text-Dependent Speaker Veriﬁcation

: In the ﬁeld of speaker veriﬁcation, probabilistic linear discriminant analysis (PLDA) is the dominant method for back-end scoring. To estimate the PLDA model, the between-class covariance and within-class precision matrices must be estimated from samples. However, the empirical covariance / precision estimated from samples has estimation errors due to the limited number of samples available. In this paper, we propose a method to improve the conventional PLDA by estimating the PLDA model using the regularized within-class precision matrix. We use graphical least absolute shrinking and selection operator (GLASSO) for the regularization. The GLASSO regularization decreases the estimation errors in the empirical precision matrix by making the precision matrix sparse, which corresponds to the reﬂection of the conditional independence structure. The experimental results on text-dependent speaker veriﬁcation reveal that the proposed method reduce the relative equal error rate by up to 23% compared with the conventional PLDA.


Introduction
Automatic speaker verification (ASV) is a technique to verify a user's identity by comparing an utterance of a user (test utterance) with the reference utterance of a known target speaker (enrollment utterance). The procedure for ASV can be divided into two steps: front-end feature extraction and back-end scoring. In the front-end step, a fixed-size feature vector is extracted from a variable-length utterance, for both the enrollment and test utterance. The feature vector (called speaker embedding) should be extracted to represent the speaker information well. In the back-end step, a similarity score between the two speaker embeddings, one for the enrollment utterance and the other for the test utterance, is computed to accept or reject the identity claim [1].
Speaker verification can be divided into two categories: text-independent speaker verification (TI-SV) [2] and text-dependent speaker verification (TD-SV). In this paper, we focus on TD-SV only and explain based on TD-SV. The main difference between TI-SV and TD-SV is that whether the phrase of utterance is limited. In TI-SV, no limitation exists for the phrase, which enables users speak any types of phrases. TI-SV systems must compensate for phrase variability to improve the verification performance. In this case, sufficiently long utterances, for example longer than about 10 s, are required to effectively compensate for the phrase variability, or the performance would be significantly degraded. It means that users have to speak long enough, which makes the use of TI-SV systems inconvenient. Moreover, the longer utterance, the larger computational cost. These shortcomings can be solved by TD-SV. In TD-SV, the available lexicon is limited for a few kinds of phrases, and the phrases of both the enrollment and test utterances should be the same. Even though the limitation for the phrase makes TD-SV be less flexible than TI-SV, it enables TD-SV to show both the higher performance and lower Owing to remarkable development in deep learning, many studies have investigated the use of deep neural networks (DNNs) to extract more discriminative speaker embeddings. The DNNs have the advantages that they can represent complex nonlinear models and be directly optimized to discriminate between classes. Deep speaker embedding is extracted from a hidden layer of a speaker-and-phrase discriminate DNN and is expected to have more discriminative power. The deep speaker embeddings have become the state-of-the-art in the field of ASV [25].
Early deep speaker embeddings (called d-vectors) were based on a fully connected neural network (FCNN) [26,27]. The FCNNs were trained to classify frame-level acoustic features in a temporal context to the corresponding speaker and phrase. However, the FCNN cannot properly model the time-dependency of the frame-level features in a context. To overcome this drawback, the methods of using other kinds of DNNs, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), were proposed [28][29][30]. RNN is a neural network designed to model time-dependency of time-series data using additional recurrent connections. In this paper, we use the d-vector based on long short-term memory (LSTM) [31], which is an extension of RNN for modeling long term time dependencies efficiently by using gates mechanisms [32].
A residual network (ResNet) [33] is the network that has constant bypass weight connections between layers to optimize very deep networks efficiently. Motivated by the success of ResNet in the field of image recognition, many studies have employed ResNet to model deep speaker embedding (called the r-vector to distinguish from the original d-vector) and have shown remarkable performance [34][35][36][37]. In [38], the squeeze-and-excitation (SE) block was proposed, which is the building block of CNNs to recalibrate channel-wise responses adaptively by explicitly modeling the relationship between channels. The SE block has been successfully adopted in ResNet, which is called a squeeze-and-excitation residual network (SE-ResNet). In this study, we extract the r-vectors from SE-ResNet34 [6,11].
The x-vector [39,40] is the deep speaker embedding extracted from a model based on a time-delay neural network (TDNN) [41]. The TDNN is the network that computes an output at each time step using the inputs from a small temporal context window similarly to CNNs. We can model long-term context information of input by stacking multiple TDNN layers, even if each TDNN layer has a small temporal context. Therefore, the TDNN-based model can efficiently capture long-term speaker information. Each layer of the TDNN models a small temporal context of the output from the previous layer. The output frames of the TDNN are aggregated over temporal pooling to capture long-term information. The x-vector/PLDA has shown state-of-the-art performance in TI-SV [25].
Like the i-vector, deep speaker embeddings have also been used with the PLDA and exhibit performance improvements. One of the main differences between the i-vector and deep speaker embeddings is that no assumption exists regarding the shape, such as the probability distribution and covariance structure, of deep speaker embeddings. This difference also affects our proposed method, which is described later.

Probabilistic Linear Discriminant Analysis (PLDA)
The PLDA is a generative probabilistic method that models between-class and within-class variabilities using latent variables. The goal of PLDA is to determine a more discriminative subspace that maximizes between-class variability and minimizes within-class variability. Some variants of PLDA exist [15,21,23,42,43]. In this paper, we use the PLDA implemented in the Kaldi toolkit [44], which is the two-covariance PLDA [43] based on [15]. In Kaldi's PLDA, a speaker embedding x is modeled as follows: x = µ + Au (2) u|v ∼ N(v, I) where µ is the global mean of the embeddings in the original space, A is the PLDA projection matrix, v represents the class in the projected space, u represents an example of that class in the projected space, and Ψ is the between-class diagonal covariance in the projected space. The PLDA model parameters {µ, A, Ψ} is estimated by the eigenvalue equation Φ −1 w Φ b A = AΨ, where Φ b and Φ w are the between-class covariance matrix and within-class covariance matrix, respectively.
The expectation-maximization (EM) [45] algorithm is used to estimate Φ b and Φ w . Note that the EM algorithm is a greedy algorithm, which guarantees the convergence by updating the parameters iteratively. However, there is no guarantee to converge toward the global optimum. Therefore, the estimated PLDA model cannot guarantee the global optimum. It starts with the initial values of the between-class covariance Φ (0) b and within-class covariance Φ (0) w matrices, which can be directly computed from the training dataset, as follows: where C is the number of speakers in the training dataset, µ c is the mean of the embeddings for the c-th speaker, N c is the number of utterances for the c-th speaker, and x c n is the n-th embedding for the c-th speaker. A detailed explanation can be found in [15,46].
The log-likelihood ratio, which literally means the log ratio between two likelihoods and is used as the similarity score in our experiments, between two embeddings, x 1 and x 2 , is computed by the following: Appl. Sci. 2020, 10, 6571 5 of 21 is the projected embedding, the subscript · i is an arbitrary index of embedding for distinguishing each other, and N( µ, Σ) denotes the Gaussian probability density function with mean µ and covariance Σ. An arbitrary embedding in the original space x i is projected into the projected space through the transformation u i = A T (x i − µ).
Note that x i can be both x 1 and x 2 . In other words, the same transformation u i = A T (x i − µ) is applied to x i , regardless of the kind of subscript · i . Therefore, x i can correspond to both x 1 and x 2 .

Gaussian Markov Random Field
Consider a D-dimensional random vector x = [x 1 , . . . , x D ] T (corresponding to the embedding in our case), which is a set of D random variables x i . The random vector x is a Markov random field (MRF; an undirected graphical model) if it satisfies Markov properties [47]. An undirected graph G = (V, E), where V is a set of vertices and E is a set of edges, can describe the random vector. Each vertex in V corresponds to one of the random variables in x, that is, V = {x 1 , . . . , x D }. Each edge e ij in E represents the dependency between x i and x j such that i j. In undirected graphical models, all edges e ij ∈ E are unordered pairs, that is, e ij = e ji . The Markov property relates to conditional independence, and three kinds of Markov properties exist: the pairwise, local, and global Markov properties. In this study, we focus on the pairwise Markov property. The pairwise Markov property is that variables x i and x j are conditionally independent given all the other variables x −ij , x i ⊥x j x −ij , which is equivalent to stating that no edge e ij exists in E [48].
The random vector x is a Gaussian Markov random field (GMRF; a Gaussian undirected graphical model) if it satisfies the Markov property and follows a multivariate normal distribution N(µ, Σ) with mean µ ∈ R D and covariance matrix Σ ∈ R D×D . In other words, the GMRF is the MRF following a multivariate normal distribution. In the GMRF, the set of edges E is represented by the precision matrix Σ −1 . The edge e ij corresponds to the element in the i-th row and the j-th column of the precision matrix Σ −1 ij , and no edge e ij exists if and only if Σ −1 ij = 0. Therefore, the following three statements are equivalent to each other: (i) there is no edge e ij , (ii) Σ −1 ij = 0, and (iii) x i ⊥x j x −ij . To summarize, we can reflect the conditional independence structure to the variables by making Σ −1 sparse [47].

GLASSO
The GLASSO is a variable selection method to estimate a sparse precision matrix using the L 1 (lasso) penalty. In other words, it estimates a sparse undirected graphical model. Consider that we have samples (corresponding to speaker embeddings in our case) that follow a multivariate normal distribution N(µ, Σ). Let Θ = Σ −1 be the true precision matrix and S be the empirical covariance matrix. The GLASSO-regularized precision matrixΘ is defined by maximizer of the L 1 -penalized Gaussian log-likelihood:Θ where det(·) and tr(·) are the determinant and trace of a matrix, respectively, ρ (> 0) is a regularization parameter, and · 1 is the L 1 norm operator (sum of the absolute values of the elements). We omit diagonal elements from the penalty. Therefore, only off-diagonal elements are penalized. The GLASSO is a biased estimator that shrinks all non-zero elements in the estimated precision matrix toward zero. A higher ρ results in (i) more regularization, (ii) lower estimation error with an accompanying higher bias in the estimated precision matrix, and (iii) a sparser estimated precision, and vice versa. Thus, the GLASSO requires the sparse assumption of the true precision matrix for the desired good asymptotic properties, such as selection consistency [49,50]. In addition, the selection consistency requires a strict condition on the covariance matrix S, the irrepresentable condition [49,50]. Typically, the GLASSO can detect the conditional independence structure on the considered covariates only when the covariates are not severely dependent. Therefore, a proper value of ρ must be selected for the performance, and simultaneously, the underlying covariance structure of the considered model should be considered. To summarize, the GLASSO regularization can reduce the estimation error in the precision matrix by pursuing a sparse structure, but the associated improvement depends on the underlying model.
In [51], solve (8) using convex duality, taking advantage of the fact that the optimization of (8) is a kind of convex optimization. In other words, [51] solve (8) by estimating Σ, rather than Σ −1 , using block coordinate descent algorithm, as follows. Let W be the regularization of S. In general, W is initialized to S + ρI. This algorithm optimizes each column (and corresponding row) of W iteratively until convergence. For each iteration, W and S are partitioned as follows: for each i-th column, where W 11 and S 11 are the submatrix of W and S (obtained by excluding the i-th row and column), respectively, and w 12 and s 12 are the i-th column of W and S (except the i-th diagonal elements w 22 and s 22 ), respectively. To optimize w 12 ,β is obtained using the following equation: and w 12 is replaced by W 11β in each iteration. A more detailed explanation for the computation can be found in [17,51]. The complexity of the GLASSO algorithm is roughly O n 3 for reasonably sparse problems with n vertices [52]. Notice that the block coordinate descent algorithm does not guarantee convergence. Therefore, the reasonable conditions on S should be required for convergence [53], such as that the covariates in S is not severely dependent, as mentioned above.

GLASSO Applied PLDA
In this paper, we propose a method of applying the GLASSO to the PLDA (denoted as GLASSO-PLDA), where the GLASSO-regularized within-class precision matrix is used to estimate the PLDA model, instead of the original within-class precision matrix. Once the empirical between-class covariance matrix Φ b and the empirical within-class covariance matrix Φ w are estimated, we regularize the empirical within-class precision matrix Φ −1 w using the GLASSO (using Equation (8)): whereΦ −1 w is the regularized within-class precision matrix. The PLDA parameters are then estimated That is, we solve the following eigenvalue equationΦ usage of the GLASSO-PLDA for computing the log-likelihood ratio (7) is the same as those of the conventional PLDA.
The GLASSO-PLDA is based on our assumption that an optimal solution for the within-class precision matrix (as mentioned in Section 1, we focus on the within-class precision matrix only) exists at some point between a high variance (corresponding to low ρ) and high bias (corresponding to high ρ). Notice that there is a trade-off between estimation error and bias, as mentioned in Section 3.2. Naturally, the performance of the likelihood ratio test depends on a good estimation of the PLDA model parameters, which depends on a good estimation of the within-class precision matrix. Therefore, it is important to reduce the estimation error in the empirical within-class precision matrix by finding the optimal value of ρ. The optimal ρ can be considered as what can achieve the best performance on evaluation trials. In practice, it should be found with a validation dataset the domain of which is close to the target domain, because we do not know the target domain in advance.
According to our experiments described in Section 5.3, the GLASSO-PLDA exhibits performance improvement in TD-SV only if a prerequisite is satisfied. The prerequisite is that the within-class covariance and accompanying precision matrix of embeddings should be close to diagonal. Unless the prerequisite is satisfied, the GLASSO converges at a point far from the optimal solution, does not converge, or even fails to estimate. The failure of the estimation is due to thatΦ −1 w is ill-conditioned, which means thatΦ −1 w has the value of infinity or is not a number. We describe the effect of the prerequisite on the performance in Section 5.3.

Prerequisite: Close-to-Diagonal Within-Class Covariance/Precision Matrix
Even though a common normality assumption exists regarding the embeddings for both PLDA and GLASSO, we do not consider the normality assumption in our research. As mentioned in Section 2.1 and , we used four kinds of speaker embeddings, that is i-vector, d-vector, r-vector, and x-vector. Among these embeddings, only the i-vector satisfies the normality assumption. The i-vector has demonstrated performance improvement with both PLDA and GLASSO-PLDA. However, all deep speaker embeddings used in our research, which have no assumption of normality, also exhibited performance improvement with the PLDA. In contrast, the deep speaker embeddings show performance degradation or failure of the estimation with the GLASSO-PLDA. The experimental results will be described in Section 5.3. Some research has investigated making the deep speaker embeddings follow a normal distribution [54][55][56]. However, we found that deep speaker embeddings obtained using these methods are still not suitable for the GLASSO-PLDA (see Appendix A), and the embeddings obtained by [55] were corrupted and lost discriminative power. From this result, we conclude that the absence of the normality assumption on the embeddings does not matter, and regard all embeddings used in our experiments as following a multivariate normal distribution.
We focus on the diagonality of the within-class covariance/precision matrix. Close-to-diagonal covariance matrix of normal variables relates to that the covariates are not severely dependent, which is the condition for the GLASSO, as mentioned in Section 3.2. In addition, the closer the covariance matrix is to the diagonal, the closer the accompanying precision matrix is somewhat to the diagonal. For a quantitative comparison, we define the degree of the diagonality (denote as δ) of the matrix Θ as the ratio of the L 1 norm of the covariance Θ 1 to the L 1 norm of the diagonal elements diag(Θ) 1 : The higher δ means that Θ is more close to the diagonal matrix, and δ satisfies 0 ≤ δ ≤ 1. As mentioned above, the prerequisite of the GLASSO-PLDA is that the empirical within-class covariance matrix Φ w and accompanying precision matrix Φ −1 w should be close to diagonal. Therefore, it is important to check whether Φ w and Φ −1 w of each kind of embedding are close to diagonal. In most cases, the empirical covariance matrix is close to the diagonal matrix due to the estimation error. Therefore, both Φ w and Φ −1 w of the i-vectors are close to diagonal. In contrast, both Φ w and Φ −1 w of all deep speaker embeddings are far from diagonal, even considering the estimation error. In practice, no assumption of the diagonality exists for the covariance of all deep speaker embeddings, and all kinds of covariances of all deep speaker embeddings are far from diagonal. It means that only i-vector satisfies the prerequisite of the GLASSO-PLDA. The characteristics in the two covariance structures of the i-vector and deep speaker embeddings lead to different results in applying the regularization method, the GLASSO, to the PLDA. In practice, as described in Section 5.3, only the i-vector, which satisfies the prerequisite, showed the performance improvements when using the GLASSO-PLDA. All the deep speaker embeddings, which have Φ w and Φ −1 w that are far from diagonal, showed the performance degradations when using the GLASSO-PLDA.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 21 showed the performance improvements when using the GLASSO-PLDA. All the deep speaker embeddings, which have and −1 that are far from diagonal, showed the performance degradations when using the GLASSO-PLDA.  Our goal of the use of the GLASSO is to reduce the estimation errors in the precision matrix. We aim to remove small noises in the precision matrix raised by the estimation errors. In the case of closeto-diagonal precision, like i-vector, the GLASSO-PLDA exhibited the performance improvement. Therefore, the GLASSO can remove many of the small noises effectively when the precision matrix is close to diagonal. However, in the case of far-from-diagonal precision, like deep speaker embeddings, the GLASSO-PLDA demonstrated the performance degradation or failure of the estimation of the PLDA parameters. The GLASSO also shrinks the principal elements that have relatively larger values toward zero, on the contrary a too-large bias toward zero, when the precision matrix is far from diagonal. Thus, the loss from a high bias is greater than the gain from the low estimation error for that case. This result may imply that the GLASSO can improve the performance only if the covariance/precision matrix is close to diagonal.
We assume that deep speaker embeddings also show the performance improvements with the GLASSO-PLDA, like i-vector, if we make the within-class covariance matrix and the accompanying precision matrix −1 close to diagonal. To make the and −1 of the deep speaker embeddings close to diagonal, we orthogonalize the deep speaker embeddings using principal component analysis (PCA). The PCA transform makes the total covariance close to diagonal through orthogonalization. The diagonalization of the total covariance has the effect of making both within-class covariance and the accompanying precision matrix close to diagonal. We use the showed the performance improvements when using the GLASSO-PLDA. All the deep speaker embeddings, which have and −1 that are far from diagonal, showed the performance degradations when using the GLASSO-PLDA.  Our goal of the use of the GLASSO is to reduce the estimation errors in the precision matrix. We aim to remove small noises in the precision matrix raised by the estimation errors. In the case of closeto-diagonal precision, like i-vector, the GLASSO-PLDA exhibited the performance improvement. Therefore, the GLASSO can remove many of the small noises effectively when the precision matrix is close to diagonal. However, in the case of far-from-diagonal precision, like deep speaker embeddings, the GLASSO-PLDA demonstrated the performance degradation or failure of the estimation of the PLDA parameters. The GLASSO also shrinks the principal elements that have relatively larger values toward zero, on the contrary a too-large bias toward zero, when the precision matrix is far from diagonal. Thus, the loss from a high bias is greater than the gain from the low estimation error for that case. This result may imply that the GLASSO can improve the performance only if the covariance/precision matrix is close to diagonal.
We assume that deep speaker embeddings also show the performance improvements with the GLASSO-PLDA, like i-vector, if we make the within-class covariance matrix and the accompanying precision matrix −1 close to diagonal. To make the and −1 of the deep speaker embeddings close to diagonal, we orthogonalize the deep speaker embeddings using principal component analysis (PCA). The PCA transform makes the total covariance close to diagonal through orthogonalization. The diagonalization of the total covariance has the effect of making both within-class covariance and the accompanying precision matrix close to diagonal. We use the orthogonalized deep speaker embeddings for the GLASSO-PLDA instead of the original embeddings (denote as PCA-GLASSO-PLDA). We do not reduce the dimensionality in the PCA transform to Our goal of the use of the GLASSO is to reduce the estimation errors in the precision matrix. We aim to remove small noises in the precision matrix raised by the estimation errors. In the case of close-to-diagonal precision, like i-vector, the GLASSO-PLDA exhibited the performance improvement. Therefore, the GLASSO can remove many of the small noises effectively when the precision matrix is close to diagonal. However, in the case of far-from-diagonal precision, like deep speaker embeddings, the GLASSO-PLDA demonstrated the performance degradation or failure of the estimation of the PLDA parameters. The GLASSO also shrinks the principal elements that have relatively larger values toward zero, on the contrary a too-large bias toward zero, when the precision matrix is far from diagonal. Thus, the loss from a high bias is greater than the gain from the low estimation error for that case. This result may imply that the GLASSO can improve the performance only if the covariance/precision matrix is close to diagonal.
We assume that deep speaker embeddings also show the performance improvements with the GLASSO-PLDA, like i-vector, if we make the within-class covariance matrix Φ w and the accompanying precision matrix Φ −1 w close to diagonal. To make the Φ w and Φ −1 w of the deep speaker embeddings close to diagonal, we orthogonalize the deep speaker embeddings using principal component analysis (PCA). The PCA transform makes the total covariance close to diagonal through orthogonalization. The diagonalization of the total covariance has the effect of making both within-class covariance and the accompanying precision matrix close to diagonal. We use the orthogonalized deep speaker embeddings for the GLASSO-PLDA instead of the original embeddings (denote as PCA-GLASSO-PLDA). We do not reduce the dimensionality in the PCA transform to avoid information loss. It is important to check whether Φ w and Φ −1 w actually become closer to diagonal after the PCA transform. Figures 3 and 4 reveal the within-class covariance and within-class precision matrix estimated from the transformed embeddings (denoted as Φ w_pca and Φ −1 w_pca ), respectively. After applying the PCA transform, all the within-class covariance/precision matrix (corresponding to Φ w_pca and Φ −1 w_pca , respectively) become closer to diagonal. The diagonality δ of Φ w_pca and Φ  Figure 4c) for the x-vectors, respectively. As described in Section 5.3, the orthogonalized deep speaker embeddings exhibited performance improvements with the GLASSO-PLDA (corresponding to the PCA-GLASSO-PLDA), as in the i-vector. Therefore, the empirical within-class covariance/precision matrix should be close to diagonal for the GLASSO regularization. Thus, if we find the transformation that makes the within-class covariance/precision matrix close to diagonal, the performance of the PLDA can improve by soft-thresholding the noises in the accompanying empirical within-class precision matrix using GLASSO regularization.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 21 performance improvements with the GLASSO-PLDA (corresponding to the PCA-GLASSO-PLDA), as in the i-vector. Therefore, the empirical within-class covariance/precision matrix should be close to diagonal for the GLASSO regularization. Thus, if we find the transformation that makes the withinclass covariance/precision matrix close to diagonal, the performance of the PLDA can improve by soft-thresholding the noises in the accompanying empirical within-class precision matrix using GLASSO regularization.

Database
For the evaluation of the task of TD-SV, we used parts 1 and 2 of the robust speaker recognition (RSR) 2015 dataset [57]. Both parts consist of utterances from 300 speakers and are divided into background (50 male and 47 female speakers), development (50 male and 47 female speakers), and evaluation (57 male and 49 female speakers) subsets. The speakers from parts 1 and 2 are the same, and no speaker overlap exists across the subsets. For each part, each speaker utters 30 different phrases in nine different sessions. The average duration of utterances is 3.2 s for part 1, and 1.99 s for part 2, including silence. We used the background set to build gender-independent models, the development set for validation on gender-independent trials, and the evaluation set to evaluate the performance of the proposed system on the gender-dependent trials. Figures 1-4 were based on RSR part 1.

Experimental Setup
We extracted 400-dimensional i-vectors. For each utterance, 25-ms frames were extracted at 10- performance improvements with the GLASSO-PLDA (corresponding to the PCA-GLASSO-PLDA), as in the i-vector. Therefore, the empirical within-class covariance/precision matrix should be close to diagonal for the GLASSO regularization. Thus, if we find the transformation that makes the withinclass covariance/precision matrix close to diagonal, the performance of the PLDA can improve by soft-thresholding the noises in the accompanying empirical within-class precision matrix using GLASSO regularization.

Database
For the evaluation of the task of TD-SV, we used parts 1 and 2 of the robust speaker recognition (RSR) 2015 dataset [57]. Both parts consist of utterances from 300 speakers and are divided into background (50 male and 47 female speakers), development (50 male and 47 female speakers), and evaluation (57 male and 49 female speakers) subsets. The speakers from parts 1 and 2 are the same, and no speaker overlap exists across the subsets. For each part, each speaker utters 30 different phrases in nine different sessions. The average duration of utterances is 3.2 s for part 1, and 1.99 s for part 2, including silence. We used the background set to build gender-independent models, the development set for validation on gender-independent trials, and the evaluation set to evaluate the performance of the proposed system on the gender-dependent trials. Figures 1-4 were based on RSR part 1.

Experimental Setup
We extracted 400-dimensional i-vectors. For each utterance, 25-ms frames were extracted at 10ms intervals. Preprocessing was performed for all frames in the following order: removing the direct current (DC) offset, pre-emphasis filtering with a coefficient of 0.97, and applying a Hamming

Database
For the evaluation of the task of TD-SV, we used parts 1 and 2 of the robust speaker recognition (RSR) 2015 dataset [57]. Both parts consist of utterances from 300 speakers and are divided into background (50 male and 47 female speakers), development (50 male and 47 female speakers), and evaluation (57 male and 49 female speakers) subsets. The speakers from parts 1 and 2 are the same, and no speaker overlap exists across the subsets. For each part, each speaker utters 30 different phrases in nine different sessions. The average duration of utterances is 3.2 s for part 1, and 1.99 s for part 2, including silence. We used the background set to build gender-independent models, the development set for validation on gender-independent trials, and the evaluation set to evaluate the performance of the proposed system on the gender-dependent trials. Figures 1-4 were based on RSR part 1.

Experimental Setup
We extracted 400-dimensional i-vectors. For each utterance, 25-ms frames were extracted at 10-ms intervals. Preprocessing was performed for all frames in the following order: removing the direct current (DC) offset, pre-emphasis filtering with a coefficient of 0.97, and applying a Hamming window. A 60-dimensional (19 static + energy + delta + acceleration) mel-frequency cepstral coefficients (MFCCs) was extracted from each preprocessed frame. We applied utterance-level cepstral mean normalization (CMN) using a 300-frame sliding window, then voice activity detection (VAD) to remove silent frames. A gender-independent Gaussian mixture model universal background model (GMM-UBM) consists of 1024 mixture components with diagonal covariance, which was trained for 10 iterations. A gender-independent 400-dimensional i-vector extractor was trained for five iterations. Length normalization was applied to i-vectors.
The common configurations for extracting deep speaker embeddings were as follows. A 40-dimensional mel-filterbank feature was extracted from each preprocessed frame. The preprocessing is the same as that used for i-vector extraction. As in the i-vector extraction, the CMN and VAD were applied to the mel-filterbank feature. Except for extracting d-vectors, the sequences of mel-filterbank features of the utterances were truncated or padded along the time axis to have lengths of 250 and 150 for parts 1 and 2, respectively. For extracting d-vectors, the sequences of mel-filterbank features were handled by using distortion-free method [58]. The speaker-and-phrase discriminative networks have two softmax classifiers: one is the speaker classifier, and the other is the phrase classifier. Unless otherwise noted, all weights of the networks were initialized from the Glorot normal distribution [59] and those of the classifiers were orthogonalized with no bias. The AMSGrad [60], a variant of the Adam [61] optimizer, was used to minimize the cross-entropy loss with a learning rate of 10 −3 . We trained 100 epochs with a mini-batch size of 32 and selected the best model by validation.
The extraction process for the d-vector is the same as in [58]. We extracted a 512-dimensional d-vector from a 2-layer LSTM. Each layer of the LSTM has 512 units. On top of the LSTM is a self-attention [62] layer with four attention heads, followed by a batch normalization [63] layer. The recurrent weights of the LSTM were orthogonalized. The biases of the forget gate of the LSTM were initialized to 1 [64], and the other biases were initialized to 0. The d-vector is the result of the batch normalization of the sum of the output of the attention layer.
We extracted the 256-dimensional r-vector from SE-ResNet34. Except for the SE blocks, SE-ResNet34 has the same structure as ResNet34 in [37] up to the last residual block. The layers were stacked on top of the last residual block in the following order: the statistical pooling layer to compute the mean and standard deviation along the time axis for each channel, the flattening layer, the 256-dimensional fully connected layer followed by a batch normalization layer, and the classifiers. We used the batch-normalized output vector as the r-vector.
We extracted the 512-dimensional x-vector. The architecture of the x-vector network is the same as the standard DNN in [29]. We applied batch normalization after each activation layer. The x-vector is the output of the layer segment 2 in [29].
The PLDA model was trained using each embedding for 10 iterations. For the GLASSO-PLDA, we first estimated the GLASSO-PLDA models with different values of ρ in the range 0 to 0.5 at intervals of 0.0005. The maximum number of iterations of the GLASSO was set to 100, and the tolerance for convergence (corresponding to the duality gap [65]) was set to 10 −4 . We then computed equal error rates (EERs) for each GLASSO-PLDA model on the validation trials. The EER is the rate when the false positive rate and false negative rate are equal. We use EER as the performance metric for our experiments. For evaluation, we selected the best GLASSO-PLDA model based on the EERs of the validation trials.
All the acoustic features and linear models (corresponding to GMM, i-vector extractor, and PLDA) were implemented using the Kaldi toolkit. All the DNN-based models (corresponding to the extractors of all deep speaker embeddings) were implemented using PyTorch [66]. The GLASSO algorithm was implemented in scikit-learn [67].

Results
We first evaluated the EERs of the proposed method only with the i-vector, which satisfies the prerequisite (close-to-diagonal within-class covariance matrix) of the GLASSO-PLDA. Figure 5 displays the EERs on the validation trials of the RSR 2015 part 1 (Figure 5a) and part 2 (Figure 5b). The dashed red line depicts the EERs of the original PLDA (corresponding to the baseline), and the solid blue line depicts the EERs of the GLASSO-PLDA (corresponding to the proposed method). We confirmed that the GLASSO converged in all conditions, and the performances were improved with the proposed method. The same trend also can be observed in the evaluation trials. Figure 6 reveals the EERs on the evaluation trials of the RSR 2015 part 1 (Figure 6a,c) and part 2 (Figure 6b,d). Except for the interval of the regularization parameter ρ > 0.0535 on the male evaluation trials of RSR 2015 part 2 (Figure 6b), the proposed method also demonstrated performance improvements on the evaluation trials. Therefore, the proposed GLASSO-PLDA can improve the performance by detecting the optimum (sparse within-class precision matrix) when the sparse assumption of the true within-class precision matrix holds, and the prerequisite is satisfied.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 21 the optimum (sparse within-class precision matrix) when the sparse assumption of the true withinclass precision matrix holds, and the prerequisite is satisfied.  Next, we performed experiments using deep speaker embeddings, which generally do not satisfy the prerequisite of the GLASSO-PLDA. Figure 7 lists the EERs of the validation trials of RSR the optimum (sparse within-class precision matrix) when the sparse assumption of the true withinclass precision matrix holds, and the prerequisite is satisfied.  Next, we performed experiments using deep speaker embeddings, which generally do not satisfy the prerequisite of the GLASSO-PLDA. Figure 7   Next, we performed experiments using deep speaker embeddings, which generally do not satisfy the prerequisite of the GLASSO-PLDA. Figure 7 lists the EERs of the validation trials of RSR 2015 part 1 for the (a) d-vector, (b) r-vector, and (c) x-vector. The dashed red line depicts the EERs of the original PLDA, and the solid blue and green lines depict the EERs of the GLASSO-PLDA and PCA-GLASSO-PLDA, respectively. Unlike for the i-vector, the GLASSO did not converge because the within-class covariance/precision matrices of all deep speaker embedding are far from diagonal. For this reason, the GLASSO-PLDA demonstrated performance degradation for all deep speaker embeddings, as mentioned in Section 4. In contrast, the GLASSO with PCA converged by making the within-class covariance/precision matrix close to diagonal. As a result, the PCA-GLASSO-PLDA displayed performance improvements for all deep speaker embeddings in all the conditions. The same trends can be observed in the evaluation trials of RSR 2015 part 1 (Figure 8), the validation trials of RSR 2015 part 2 (Figure 9), and the evaluation trials of RSR 2015 part 2 ( Figure 10). There are no EER graphs of the GLASSO-PLDA for the x-vector on the trials of RSR 2015 part 2 (there is no solid blue line in Figures 9c and 10c,f), due to the failure of the estimation of the GLASSO-PLDA model caused by the failure of GLASSO regularization. These results indicate that the GLASSO-PLDA can improve performance when the within-class covariance/precision matrix is close to diagonal.         Tables 1 and 2 summarize the EERs of the baseline and proposed methods for each embedding on RSR 2015 parts 1 and 2, respectively. The proposed method is the GLASSO-PLDA for the i-vector and the PCA-GLASSO-PLDA for deep speaker embeddings. For the proposed method, the regularization parameter was set to that of the lowest EER of the validation trials. In RSR 2015 part 1 trials, the proposed method revealed relative EER reductions of approximately 7% (for the x-vector) to 19% (for the d-vector) on the validation trials, reductions of approximately 7% (for the i-vector) to 13% (for the d-vector) on the male evaluation trials, and reductions of approximately 5% (for the xvector) to 23% (for the d-vector) on the female evaluation trials. In RSR 2015 part 2, the proposed method similarly revealed relative EER reductions of approximately 7% (for the x-vector) to 16% (for the r-vector) on the validation trials, reductions of approximately 3% (for the i-vector) to 13% (for the r-vector) on the male evaluation trials, and reductions of approximately 6% (for the x-vector) to 15% (for the d-vector) on the female evaluation trials. Finally, we performed a score-level fusion of all proposed systems for each dataset. The weights for each score were estimated using logistic  Tables 1 and 2 summarize the EERs of the baseline and proposed methods for each embedding on RSR 2015 parts 1 and 2, respectively. The proposed method is the GLASSO-PLDA for the i-vector and the PCA-GLASSO-PLDA for deep speaker embeddings. For the proposed method, the regularization parameter was set to that of the lowest EER of the validation trials. In RSR 2015 part 1 trials, the proposed method revealed relative EER reductions of approximately 7% (for the x-vector) to 19% (for the d-vector) on the validation trials, reductions of approximately 7% (for the i-vector) to 13% (for the d-vector) on the male evaluation trials, and reductions of approximately 5% (for the xvector) to 23% (for the d-vector) on the female evaluation trials. In RSR 2015 part 2, the proposed method similarly revealed relative EER reductions of approximately 7% (for the x-vector) to 16% (for the r-vector) on the validation trials, reductions of approximately 3% (for the i-vector) to 13% (for the r-vector) on the male evaluation trials, and reductions of approximately 6% (for the x-vector) to 15% (for the d-vector) on the female evaluation trials. Finally, we performed a score-level fusion of all proposed systems for each dataset. The weights for each score were estimated using logistic Tables 1 and 2 summarize the EERs of the baseline and proposed methods for each embedding on RSR 2015 parts 1 and 2, respectively. The proposed method is the GLASSO-PLDA for the i-vector and the PCA-GLASSO-PLDA for deep speaker embeddings. For the proposed method, the regularization parameter ρ was set to that of the lowest EER of the validation trials. In RSR 2015 part 1 trials, the proposed method revealed relative EER reductions of approximately 7% (for the x-vector) to 19% (for the d-vector) on the validation trials, reductions of approximately 7% (for the i-vector) to 13% (for the d-vector) on the male evaluation trials, and reductions of approximately 5% (for the x-vector) to 23% (for the d-vector) on the female evaluation trials. In RSR 2015 part 2, the proposed method similarly revealed relative EER reductions of approximately 7% (for the x-vector) to 16% (for the r-vector) on the validation trials, reductions of approximately 3% (for the i-vector) to 13% (for the r-vector) on the male evaluation trials, and reductions of approximately 6% (for the x-vector) to 15% (for the d-vector) on the female evaluation trials. Finally, we performed a score-level fusion of all proposed systems for each dataset. The weights for each score were estimated using logistic regression with the scores from the validation trials. With score-level fusion, we achieved significant performance improvements.

Discussion
In this section, we first evaluate the proposed method in TI-SV tasks. Next, we compare the performance of the proposed method with the banding method, which is more simple method than the GLASSO for making matrix sparse.

Evaluation in Text-Independent Speaker Verification
We explain the difference between TD-SV and TI-SV in terms of variability. The total variability of samples Σ can be decomposed as the sum of the between-class variability Σ b and the within-class variability Σ w ; that is, Σ = Σ b + Σ w . In TD-SV, Σ b contains the speaker variability Σ spk and phrase variability Σ phr , and Σ w contains other variabilities, such as channel variability Σ ch and residual variability Σ . Therefore, Σ b = Σ spk + Σ phr and Σ w = Σ ch + Σ . In TI-SV, in contrast,Σ b contains only Σ spk , and Σ w contains other variabilities Σ phr , Σ ch , and Σ . Therefore, Σ b = Σ spk and Σ w = Σ phr + Σ ch + Σ . To summarize, the difference between TD-SV and TI-SV is whether Σ w contains Σ phr or not. For TD-SV, Σ w = Σ ch + Σ does not contain phrase variability Σ phr . For TI-SV, in contrast, Σ w = Σ phr + Σ ch + Σ contains Σ phr . Therefore, the GLASSO regularization of the empirical within-class precision matrix Φ −1 w does not affect Σ phr for TD-SV but affects Σ phr for TI-SV. For the evaluation of the task of TI-SV, we used the VoxCeleb1 dataset [68], which is divided into development (148,642 utterances from 1211 speakers) and test (4874 utterances from 40 speakers) sets. The average duration of utterances is 8.2 s. We split the development set into training (144,990 utterances from 1183 speakers) and validation (3652 utterances from 28 speakers; speaker id10357-id10384) sets.
We created 37,904 validation (10,492 target and 27,412 impostor) trials using the validation set. We used the training set to build gender-independent models, the validation set for validation, and the test set to evaluate the performance of the proposed system. We used the VoxCeleb1 dataset to build and evaluate the i-vector-based system only, which satisfies the prerequisite of the GLASSO-PLDA. The configuration for extracting i-vectors was the same as that described in Section 5.2. Figure 11 displays the EERs on the validation (Figure 11a) and evaluation (Figure 11b) trials of the VoxCeleb dataset. Unlike the results in the TD-SV tasks, the performances were degraded with the proposed method in the TI-SV tasks, even though the prerequisite was satisfied and the GLASSO converged. These results mean that the proposed method is suitable for only TD-SV tasks, where Σ w = Σ ch + Σ does not contain phrase variability Σ phr . In other words, the true within-class precision matrix is sparse only if Σ w does not contain Σ phr . The optimums of both channel variability Σ ch and residual variability Σ have conditional independence structure, but that of Σ phr does not. Therefore, the sparse assumption of Φ −1 w does not hold and the proposed method is not suitable for TI-SV tasks, where Σ w = Σ phr + Σ ch + Σ contains phrase variability Σ phr .
precision matrix −1 does not affect ℎ for TD-SV but affects ℎ for TI-SV. For the evaluation of the task of TI-SV, we used the VoxCeleb1 dataset [68], which is divided into development (148,642 utterances from 1211 speakers) and test (4874 utterances from 40 speakers) sets. The average duration of utterances is 8.2 s. We split the development set into training (144,990 utterances from 1183 speakers) and validation (3652 utterances from 28 speakers; speaker id10357-id10384) sets. We created 37,904 validation (10,492 target and 27,412 impostor) trials using the validation set. We used the training set to build gender-independent models, the validation set for validation, and the test set to evaluate the performance of the proposed system. We used the VoxCeleb1 dataset to build and evaluate the i-vector-based system only, which satisfies the prerequisite of the GLASSO-PLDA. The configuration for extracting i-vectors was the same as that described in Section 5.2. Figure 11 displays the EERs on the validation (Figure 11a) and evaluation (Figure 11b) trials of the VoxCeleb dataset. Unlike the results in the TD-SV tasks, the performances were degraded with the proposed method in the TI-SV tasks, even though the prerequisite was satisfied and the GLASSO converged. These results mean that the proposed method is suitable for only TD-SV tasks, where = ℎ + does not contain phrase variability ℎ . In other words, the true within-class precision matrix is sparse only if does not contain ℎ . The optimums of both channel variability ℎ and residual variability have conditional independence structure, but that of ℎ does not. Therefore, the sparse assumption of −1 does not hold and the proposed method is not suitable for TI-SV tasks, where = ℎ + ℎ + contains phrase variability ℎ .

Comparison with Matrix Banding
In this section, we describe another option for our proposed method, matrix banding. Matrix banding is a simple method to make arbitrary matrix a band matrix [69]. The band matrix is a sparse matrix all non-zero elements of which are in at most some consecutive and diagonally bordered bands including main diagonal. In other words, all out-of-band elements of the band matrix are zero. Therefore, the matrix banding is the method that can make arbitrary matrix sparse with less computational burden than the GLASSO. We compare the performances of the proposed method (GLASSO-PLDA) with those of the matrix banding-based PLDA (denoted as banding-PLDA), in which the within-class precision matrix is regularized with the banding rather than the GLASSO. There is no reason for using the GLASSO-PLDA if the banding-PLDA generally shows better performances than the GLASSO-PLDA.
Tables 3 and 4 summarize the EERs of the proposed method and banding-PLDA for each embedding on RSR 2015 part 1 and 2, respectively. For the both proposed method and banding-PLDA, all deep speaker embeddings were orthogonalized using the PCA. Therefore, the EERs of the proposed method in Tables 3 and 4 are the same as those in Tables 1 and 2, respectively. For the banding-PLDA, we constrained the bands to symmetric because within-class precision matrix must be symmetric. means the bandwidth for one side (left or right). The total bandwidth is 2 + 1 because the bands are

Comparison with Matrix Banding
In this section, we describe another option for our proposed method, matrix banding. Matrix banding is a simple method to make arbitrary matrix a band matrix [69]. The band matrix is a sparse matrix all non-zero elements of which are in at most some consecutive and diagonally bordered bands including main diagonal. In other words, all out-of-band elements of the band matrix are zero. Therefore, the matrix banding is the method that can make arbitrary matrix sparse with less computational burden than the GLASSO. We compare the performances of the proposed method (GLASSO-PLDA) with those of the matrix banding-based PLDA (denoted as banding-PLDA), in which the within-class precision matrix is regularized with the banding rather than the GLASSO. There is no reason for using the GLASSO-PLDA if the banding-PLDA generally shows better performances than the GLASSO-PLDA.
Tables 3 and 4 summarize the EERs of the proposed method and banding-PLDA for each embedding on RSR 2015 part 1 and 2, respectively. For the both proposed method and banding-PLDA, all deep speaker embeddings were orthogonalized using the PCA. Therefore, the EERs of the proposed method in Tables 3 and 4 are the same as those in Tables 1 and 2, respectively. For the banding-PLDA, we constrained the bands to symmetric because within-class precision matrix must be symmetric. k means the bandwidth for one side (left or right). The total bandwidth is 2k + 1 because the bands are constrained to be symmetric. k = 0 means the diagonal matrix. The GLASSO-PLDA generally exhibited better performances than the banding-PLDA. In TD-SV, therefore, the GLASSO, which is based on the L 1 -penalized Gaussian log-likelihood, is more effective than the banding.

Conclusions
In this paper, we improved the conventional PLDA by proposing the GLASSO-PLDA, in which the GLASSO-regularized within-class precision matrix was used to estimate the PLDA model. The GLASSO makes empirical within-class precision matrices sparse. It has the effects of reducing the estimation error in the within-class precision matrices and of reflecting a conditional independence structure on the variables. We assumed that the empirical within-class precision matrices would have large errors due to the limited amount of data and expected that the reduction of the estimation error would lead to performance improvement. From the experimental results on the trials on a public database RSR 2015 parts 1 and 2, we found that the GLASSO-PLDA demonstrated the performance improvement when the within-class covariance/precision matrix of the embedding is close to diagonal. That is, the performance of the PLDA can be improved using GLASSO regularization on the empirical within-class precision matrix when the covariance/precision matrix is close to diagonal. With system fusion, we also have achieved significant performance improvements in the task of TD-SV. The GLASSO-PLDA can be directly applied to the TD-SV systems based on the conventional PLDA without changing the structure of the systems. Therefore, it can be applied to any kinds of applications that uses the TD-SV systems based on the PLDA.
In the future, we will apply the GLASSO-PLDA onto noisy condition, where the within-class variability Σ w = Σ ch + Σ + Σ noise contains not only both channel variability Σ ch and residual variability Σ but also noise variability Σ noise . Notice that the within-class variability on clean condition, Σ w = Σ ch + Σ , does not contain Σ noise . In detail, we will evaluate the performance of the GLASSO-PLDA based TD-SV system that is trained using only clean utterances on various noisy conditions, which is the setting close to the environment where real applications are used. This experiment is to confirm whether the GLASSO-PLDA can compensate for the Σ w = Σ ch + Σ + Σ noise without noisy training utterances. As shown in our experiments, the optimums of both Σ ch and Σ have conditional independence structure. If the optimum of Σ noise has also conditional independence structure; that is, if the sparse assumption of the within-class precision matrix also holds in noisy condition, the GLASSO-PLDA would bring performance improvements in the condition. It means that the GLASSO-PLDA can compensate for the noise variability even without using noisy utterances.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A
In this appendix, we describe the effect of making the deep speaker embeddings follow a normal distribution [54][55][56]. Figures A1 and A2 list the EERs of the PLDA and Gaussian-constrained (GC)-GLASSO-PLDA on the trials of RSR 2015 parts 1 and 2, respectively. We used d-vectors as speaker embeddings, which generally exhibited good performance in our experiments. The GC-GLASSO-PLDA means that the embeddings were extracted from the DNN trained using the GC training method [54][55][56]. However, the GC embeddings are still unsuitable for GLASSO-PLDA, because the GLASSO-PLDA with the GC embeddings actually showed performance degradations, as shown in Figures A1 and A2. If the GC embeddings are suitable for the GLASSO-PLDA, the GLASSO-PLDA with the GC embeddings would show performance improvements. Therefore, we can claim that the GC embeddings are not suitable for the GLASSO-PLDA, regardless of whether the GC embeddings actually follow a normal distribution. It means that the GC training cannot make the embeddings follow a Gaussian distribution, or the prerequisite of the GLASSO-PLDA may not be related to the normality assumption of embeddings. variability but also noise variability . Notice that the within-class variability on clean condition, = ℎ + , does not contain . In detail, we will evaluate the performance of the GLASSO-PLDA based TD-SV system that is trained using only clean utterances on various noisy conditions, which is the setting close to the environment where real applications are used. This experiment is to confirm whether the GLASSO-PLDA can compensate for the = ℎ + + without noisy training utterances. As shown in our experiments, the optimums of both ℎ and have conditional independence structure. If the optimum of has also conditional independence structure; that is, if the sparse assumption of the within-class precision matrix also holds in noisy condition, the GLASSO-PLDA would bring performance improvements in the condition. It means that the GLASSO-PLDA can compensate for the noise variability even without using noisy utterances.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A
In this appendix, we describe the effect of making the deep speaker embeddings follow a normal distribution [54][55][56]. Figures A1 and A2 list the EERs of the PLDA and Gaussian-constrained (GC)-GLASSO-PLDA on the trials of RSR 2015 parts 1 and 2, respectively. We used d-vectors as speaker embeddings, which generally exhibited good performance in our experiments. The GC-GLASSO-PLDA means that the embeddings were extracted from the DNN trained using the GC training method [54][55][56]. However, the GC embeddings are still unsuitable for GLASSO-PLDA, because the GLASSO-PLDA with the GC embeddings actually showed performance degradations, as shown in Figures A1 and A2. If the GC embeddings are suitable for the GLASSO-PLDA, the GLASSO-PLDA with the GC embeddings would show performance improvements. Therefore, we can claim that the GC embeddings are not suitable for the GLASSO-PLDA, regardless of whether the GC embeddings actually follow a normal distribution. It means that the GC training cannot make the embeddings follow a Gaussian distribution, or the prerequisite of the GLASSO-PLDA may not be related to the normality assumption of embeddings.