Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Veriﬁcation System

: One of the most important parts of a text-independent speaker veriﬁcation system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspeciﬁed variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker veriﬁcation system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).


Introduction
Speaker recognition aims to analyze the speaker representation from input audio. A subfield of speaker recognition is speaker verification, which determines whether the utterance of the claimed speaker should be accepted or rejected by comparing it to the utterance of the registered speaker. Speaker verification is divided into text dependent and text independent. Text-dependent speaker verification aims to recognize only the specified utterances when verifying the speaker. Examples include Google's "OK Google" and Samsung's "Hi Bixby." Meanwhile, text-independent speaker verification is not limited to the type of utterances to be recognized. Therefore, the problems to be solved using text-independent speaker verification are more difficult. If the performance is guaranteed, text-independent speaker verification can be utilized in various biometric systems and e-learning platforms, such as biometric authentication for chatbots, voice ID, and virtual assistants.
Owing to advances in computational power and deep learning techniques, the performance of text-independent speaker verification has been improved. Text-independent speaker verification using deep neural networks (DNN) is divided into two streams. The first one is an end-to-end system [1]. The input of the DNN is a speech signal, and the output is the verification result. This is a single-pass operation in which all processes can be operated at once. However, the input speech of a variable length is difficult to handle. To address this problem, several studies have applied a pooling layer or temporal average layer to an end-to-end system [2,3]. The second is a speaker embedding-based system [4][5][6][7][8][9][10][11][12][13][14], which generates an input of variable length into a vector of fixed length using a DNN. The generated vector is used as an embedding to represent the speaker. The speaker embedding-based system can handle input speech of variable length and can generate speaker representations from various environments.
As shown in Figure 1, a DNN has been used as a speaker embedding extractor in a speaker embedding-based system. In general, a speaker embedding-based system executes the following processes [4][5][6][7]: • The speaker classification model is trained.

•
The speaker embedding is extracted by using the output value of the inner layer of the speaker classification model.

•
The similarity between the embedding of the registered speaker and the claimed speaker is computed.

•
The acceptance or rejection is determined by a previously decided threshold value.
Electronics 2020, 9, x FOR PEER REVIEW 2 of 15 [1]. The input of the DNN is a speech signal, and the output is the verification result. This is a singlepass operation in which all processes can be operated at once. However, the input speech of a variable length is difficult to handle. To address this problem, several studies have applied a pooling layer or temporal average layer to an end-to-end system [2,3]. The second is a speaker embedding-based system [4][5][6][7][8][9][10][11][12][13][14], which generates an input of variable length into a vector of fixed length using a DNN. The generated vector is used as an embedding to represent the speaker. The speaker embeddingbased system can handle input speech of variable length and can generate speaker representations from various environments. As shown in Figure 1, a DNN has been used as a speaker embedding extractor in a speaker embedding-based system. In general, a speaker embedding-based system executes the following processes [4][5][6][7]: • The speaker classification model is trained. • The speaker embedding is extracted by using the output value of the inner layer of the speaker classification model. • The similarity between the embedding of the registered speaker and the claimed speaker is computed. • The acceptance or rejection is determined by a previously decided threshold value.
In addition, back-end methods, for example, probabilistic linear discriminant analysis, can be used [8][9][10]. The most important part in the above system is the speaker embedding generation [13]. Speaker embedding is a high-dimensional feature vector that contains speaker information. An ideal speakerembedding maximizes inter-class variations and minimizes intra-class variations [10,14,15]. The component that directly affects the speaker embedding generation is the encoding layer. The encoding layer takes a frame-level feature and converts it into a compact utterance-level feature. It also converts variable-length features to fixed-length features.
To improve the representational power of the speaker embedding, residual learning derived from ResNet [21] and squeeze-and-excitation (SE) blocks [22] were adapted for the speaker models [10,[13][14][15][16]20,23]. Residual learning maintains input information through mappings between layers called "shortcut connections." A large-scale CNN using shortcut connections can avoid gradient degradation. The SE block consists of a squeeze operation (which condenses all of the information on the features) and an excitation operation (which scales the importance of each feature). Therefore, a channel-wise feature response can be adjusted without significantly increasing the model complexity in the training.  The most important part in the above system is the speaker embedding generation [13]. Speaker embedding is a high-dimensional feature vector that contains speaker information. An ideal speaker-embedding maximizes inter-class variations and minimizes intra-class variations [10,14,15]. The component that directly affects the speaker embedding generation is the encoding layer. The encoding layer takes a frame-level feature and converts it into a compact utterance-level feature. It also converts variable-length features to fixed-length features.
To improve the representational power of the speaker embedding, residual learning derived from ResNet [21] and squeeze-and-excitation (SE) blocks [22] were adapted for the speaker models [10,[13][14][15][16]20,23]. Residual learning maintains input information through mappings between layers called "shortcut connections." A large-scale CNN using shortcut connections can avoid gradient degradation. The SE block consists of a squeeze operation (which condenses all of the information on the features) and an excitation operation (which scales the importance of each feature). Therefore, a channel-wise feature response can be adjusted without significantly increasing the model complexity in the training. The main limitation of the previous encoding layers is that the model uses only the output feature of the last pooling layer as input. In other words, the model uses only one frame-level feature when performing speaker embedding. Therefore, similar to [14,24], a previous study presented a shortcut connection-based multi-layer aggregation to improve the speaker representations when calculating the weight at the encoding layer [13]. Specifically, the frame-level features are extracted from between each residual layer in ResNet. Then, these frame-level features are fed into the input of the encoding layer using shortcut connections. Consequently, a high-dimensional speaker embedding is generated.
However, the previous study [13] has limitations. First, the model parameter size is relatively large, and the model generates high-dimensional speaker embeddings (1024 dimensions, about 15 million model parameters). This leads to inefficient training and thus requires a sufficiently large amount of data for training. Second, the multi-layer aggregation approach increases not only the speaker's information but also the intrinsic and extrinsic variation factors, for example, emotion, noise, and reverberation. Some of these unspecified factors increase variability while generating speaker embedding. Hence, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system, as shown in Figure 2. We present an improved version of the previous study, as described in the following steps: • A ResNet with a scaled channel width and layer depth is used as a baseline. The scaled ResNet has fewer parameters than the standard ResNet [21]. • A self-attention mechanism is applied to perform multi-layer aggregation with dropout regularizations and batch normalizations [25]. It helps construct a more discriminative utterance-level feature while considering frame-level features of each layer.

•
A feature recalibration layer is applied to the aggregated feature. Channel-wise dependencies are trained using fully-connected layers and nonlinear activation functions.

•
Deep length normalization [11] is also used for a recalibrated feature in the training process.
The remainder of this paper is organized as follows. Section 2 describes a baseline system using shortcut connections-based multi-layer aggregation. Section 3 introduces the proposed self-attentive multi-layer aggregation method with feature recalibration and normalization. Section 4 discusses our experiments, and conclusions are drawn in Section 5. The main limitation of the previous encoding layers is that the model uses only the output feature of the last pooling layer as input. In other words, the model uses only one frame-level feature when performing speaker embedding. Therefore, similar to [14,24], a previous study presented a shortcut connection-based multi-layer aggregation to improve the speaker representations when calculating the weight at the encoding layer [13]. Specifically, the frame-level features are extracted from between each residual layer in ResNet. Then, these frame-level features are fed into the input of the encoding layer using shortcut connections. Consequently, a high-dimensional speaker embedding is generated.
However, the previous study [13] has limitations. First, the model parameter size is relatively large, and the model generates high-dimensional speaker embeddings (1024 dimensions, about 15 million model parameters). This leads to inefficient training and thus requires a sufficiently large amount of data for training. Second, the multi-layer aggregation approach increases not only the speaker's information but also the intrinsic and extrinsic variation factors, for example, emotion, noise, and reverberation. Some of these unspecified factors increase variability while generating speaker embedding. Hence, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system, as shown in Figure 2. We present an improved version of the previous study, as described in the following steps: • A ResNet with a scaled channel width and layer depth is used as a baseline. The scaled ResNet has fewer parameters than the standard ResNet [21]. • A self-attention mechanism is applied to perform multi-layer aggregation with dropout regularizations and batch normalizations [25]. It helps construct a more discriminative utterance-level feature while considering frame-level features of each layer. • A feature recalibration layer is applied to the aggregated feature. Channel-wise dependencies are trained using fully-connected layers and nonlinear activation functions. • Deep length normalization [11] is also used for a recalibrated feature in the training process.
The remainder of this paper is organized as follows. Section 2 describes a baseline system using shortcut connections-based multi-layer aggregation. Section 3 introduces the proposed self-attentive multi-layer aggregation method with feature recalibration and normalization. Section 4 discusses our experiments, and conclusions are drawn in Section 5. Overview of proposed network architecture: Self-attentive multi-layer aggregation with a feature recalibration layer and a deep length normalization layer (We extract a speaker embedding after the normalization layer on each utterance).

Prior System
In a previous study [13], a shortcut connections-based multi-layer aggregation with ResNet-18 was proposed. Its main difference from the standard ResNet-18 [21] is the manner that speaker Overview of proposed network architecture: Self-attentive multi-layer aggregation with a feature recalibration layer and a deep length normalization layer (We extract a speaker embedding after the normalization layer on each utterance).

Prior System
In a previous study [13], a shortcut connections-based multi-layer aggregation with ResNet-18 was proposed. Its main difference from the standard ResNet-18 [21] is the manner that speaker embedding Electronics 2020, 9, 1706 4 of 14 is aggregated. Multi-layer aggregation uses not only the output feature of the last residual layer but also the output features of all previous residual layers. These features are concatenated into one feature through shortcut connections. The concatenated feature is fed into several fully-connected layers to construct high-dimensional speaker embedding. The prior system improved the performance by a simple method.
However, it has large parameters because the system uses multi-layer aggregation, as presented in Table 1. The model parameters of standard ResNet-18 and standard ResNet-34 number are approximately 11.8 million and 21.9 million, respectively. Conversely, the model parameters of the prior system based on ResNet-18 and ResNet-34 are approximately 15.6 million and 25.7 million, respectively. In addition, the forward-backward training times of standard ResNet-18 and standard ResNet-34 are approximately 6.025 ms and 10.326 ms, respectively. However, the forward-backward training times of the prior system based on ResNet-18 and ResNet-34 are approximately 6.576 ms and 10.820 ms, respectively (when measuring the forward-backward training time, three units of GTX1080Ti and 96 mini-batch size were used).

Modifications
As discussed in Section 2.1, the prior system improved the performance; however, the model parameters were too large. The prior system is modified considering scaling factors, such as layer depth, channel width, and input resolution, for efficient learning in the CNN [26]. First, we used high-dimensional log-Mel filter banks with data augmentation for the input resolution. We extracted an input feature map of size D × L, where D is the number of single-frame spectral features and L is the number of frames. Here, Mel-filter banks determine dimension D from zero to 8,000 Hz. Subsequently, the channel width is reduced, and the layer depth is expanded because ResNet can improve the performance without significantly increasing the parameters when the layer depth is increased.
Consequently, the scaled ResNet-34 was constructed, as shown in Table 2. The scaled ResNet-34 is composed of three, four, six, and three residual blocks. It has reduced the number of channels by half compared to the standard ResNet-34 [21]. In addition, shortcut connections-based multi-layer aggregation is added to the model using the GAP encoding method. The output features of each GAP are concatenated and fed into the output layer. Then, high-dimensional speaker embedding is generated from a penultimate layer in a network. Thus, the scaled ResNet-34 has only approximately 5.9 million model parameters compared to the prior system, as presented in Table 3. In addition, the forward-backward training time in milliseconds of the scaled ResNet-34 is faster than the prior system based on ResNet-34 (the forward-backward training time in milliseconds of the scaled ResNet-34 is approximately 5.658 ms). Table 2. Architecture of scaled ResNet-34 using multi-layer aggregation as a baseline (D = input dimension; L = input length; N = number of speakers; GAP = global average pooling; SE = speaker embedding).

Layer
Output Size Channels Blocks Encoding

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization
As discussed in Section 1, the previous study has two problems. The model parameter problem is addressed by building a scaled ResNet-34. However, the problem of multi-layer aggregation remains. Multi-layer aggregation uses output features of multiple layers to develop the speaker embedding system. It is assumed that not only speaker information but also other unspecified factors exist in the output feature of the layer. The unspecified factor lowers the speaker verification performance. Therefore, we proposed three methods: self-attentive multi-layer aggregation, feature recalibration, and deep length normalization.

Model Architecture
As presented in Figure 2 and Table 4, the proposed network mainly consists of a scaled ResNet and an encoding layer. Frame-level features are trained in the scaled ResNet, and utterance-level features are trained in the encoding layer.
In the scaled ResNet, given an input feature X = [x 1 , x 2 , . . . , x l , . . . , x L ] of length L (x l ∈ R d ), output features P i = p 1 , p 2 , . . . , p c , . . . , p C (p c ∈ R) from each residual layer of the scaled ResNet are generated using SAP. Here, the length C i is determined by the number of channels in the i th residual layer. Then, the generated output features are concatenated into one feature V as in Equation (1) The is a set of frame-level features and is used as the input of the encoding layer.
The encoding layer comprises a feature recalibration layer and a deep length normalization layer. In the feature recalibration layer, the concatenated feature V is recalibrated by fully-connected layers and nonlinear activations. Consequently, a recalibrated featureV = [v 1 ,v 2 , . . . ,v c , . . . ,v C ] (v c ∈ R) is generated. Then, the recalibrated feature is normalized according to the length of inputV in the Electronics 2020, 9, 1706 6 of 14 deep-length normalization layer. The normalized feature is used as a speaker embedding and is fed into the output layer. Further, a log probability for speaker classes s, P(spk s x 1 , x 2 , . . . , x l , . . . , x L ), is generated in the output layer.

Layer
Output Size Channels Blocks Encoding

Self-Attentive Multi-Layer Aggregation
As shown in Figures 2 and 3, SAP is applied to each residual layer using shortcut connections. For every input feature, given an output feature of the first convolution layer or the i th residual layers after conducting an average pooling, Y i = y 1 , y 2 , . . . , y n , . . . , y N of length N (y n ∈ R c ) is obtained. The number of dimensions c is determined by the number of channels.  Then, the average feature is fed into a fully-connected hidden layer to obtain = [ ] using a hyperbolic tangent activation function. Given ∈ ℝ and a learnable context vector ∈ ℝ , the attention weight is measured by training the similarity between and with a softmax normalization as in Equation (2).
Then, the embedding ∈ ℝ is generated using the weighted sum of the normalized attention weights and as in Equation (3).

Fully-Connected Layer
Attention Weights ∑ Figure 3. Overview of self-attentive pooling procedure.
Then, the average feature is fed into a fully-connected hidden layer to obtain H i = [h 1 , h 2 , . . . , h n , . . . , h N ] using a hyperbolic tangent activation function. Given h n ∈ R c and a learnable context vector u ∈ R c , the attention weight w n is measured by training the similarity between h n and u with a softmax normalization as in Equation (2). Then, the embedding e ∈ R c is generated using the weighted sum of the normalized attention weights w n and y n as in Equation (3).
The embedding vector e can be rewritten as P i = p 1 , p 2 , . . . , p c , . . . , p C (p c ∈ R) in the order of the dimensions. Consequently, the SAP output feature P i is generated. This process helps generate a more discriminative feature while focusing on the frame-level features of each layer. Moreover, dropout regularization and batch normalization are used in P i . Then, the generated features are concatenated into one feature, V, as in Equation (1).

Feature Recalibration
After the self-attentive multi-layer aggregation, the concatenated feature V is fed into the feature recalibration layer. The feature recalibration layer aims to train the correlations between each channel of the concatenated feature; this is inspired by [22].
Given an input feature where C is the sum of all channels), the feature channels are recalibrated using two fully-connected layers and nonlinear activations, as in Equation (4).
Here, δ refers to the leaky rectified linear unit activation; σ refers to the sigmoid activation; W 1 is the front fully-connected layer, W 1 ∈ R c× c r , and W 2 is the back fully-connected layer, W 2 ∈ R c r ×c . According to the reduction ratio r, a dimensional transformation is performed between the two fully-connected layers, such as a bottleneck structure, while channel-wise multiplication is performed. The rescaled channels are then multiplied by the input feature V. Consequently, an output featuré is generated. This generated featureV is the result of recalibration according to the importance of the channels.

Deep Length Normalization
As in [11], deep length normalization was applied to the proposed model. The L2 constraint is applied to the length axis of the recalibrated featureV with a scale constant, α, as in Equation (5).
Then, the normalizedV is fed into the output layer for speaker classification. This feature is used as a speaker embedding, as shown in Figure 4.
Then, the normalized is fed into the output layer for speaker classification. This feature is used as a speaker embedding, as shown in Figure 4.

Datasets
In our experiments, we used the VoxCeleb1 [27] and VoxCeleb2 [16] datasets presented in Table  5. These datasets comprise various utterances of celebrities collected in real environments from YouTube, including noise, laughter, cross talk, channel effects, music, and other sounds [27]. All utterances were encoded at a 16-kHz sampling rate with 2 bytes per sample. These are large-scale text-independent speaker verification datasets, comprising more than 100 thousand and 1 million utterances with 1,251 and 6,112 speakers, respectively.

Datasets
In our experiments, we used the VoxCeleb1 [27] and VoxCeleb2 [16] datasets presented in Table 5. These datasets comprise various utterances of celebrities collected in real environments from YouTube, including noise, laughter, cross talk, channel effects, music, and other sounds [27]. All utterances were encoded at a 16-kHz sampling rate with 2 bytes per sample. These are large-scale text-independent Electronics 2020, 9, 1706 8 of 14 speaker verification datasets, comprising more than 100 thousand and 1 million utterances with 1251 and 6112 speakers, respectively. We used the VoxCeleb1 evaluation dataset, which includes 40 speakers and 37,220 pairs of official test protocols [27], as shown in Figure 5. The test protocols comprises eight pairs per utterance of the VoxCeleb1 evaluation set (four pairs of the same speaker and four pairs of different speakers). Among all possible 38,992 (4874 × 8) utterances, 37,720 pairs were determined. The pair decision is made in consideration of balance such as gender, utterance length, and the number of pairs per speaker. In addition, it is an open-set test that evaluates all speaker pairs that are unavailable for the training dataset.

Experimental Setup
During data preprocessing, we used 64-dimensional log Mel-filter-bank energies with a 25 ms frame length and 10 ms frame shift, which are the mean variance normalized over a sliding window of 3 s. For each training step, a 12 s interval was extracted from each utterance through cropping or padding. In addition, a preprocessing method was used to conduct time and frequency masking on the input features [28].
The model training specifications are as follows: we used a standard cross-entropy loss function, with a standard stochastic gradient descent optimizer, with a momentum of 0.9, a weight decay of 0.0001, and an initial learning rate of 0.1, reduced by a 0.1 scale factor on the plateau [29]. All experiments were trained for 200 epochs with a 96 mini-batch size. The scaling constant was set to 10, and the reduction ratio was set to 8 [11,22]. As shown in Figure 6, we confirmed that the training loss converges for the baseline model, as described in Section 2.2, and the proposed model, as described in Section 3.1, was trained.  Figure 5. Example of official test protocol from VoxCeleb1 evaluation dataset (In the first column, 1 refers to the same speaker and 0 refers to different speakers. The second and third columns refer to the speakers to be compared).

Experimental Setup
During data preprocessing, we used 64-dimensional log Mel-filter-bank energies with a 25 ms frame length and 10 ms frame shift, which are the mean variance normalized over a sliding window of 3 s. For each training step, a 12 s interval was extracted from each utterance through cropping or padding. In addition, a preprocessing method was used to conduct time and frequency masking on the input features [28].
The model training specifications are as follows: we used a standard cross-entropy loss function, with a standard stochastic gradient descent optimizer, with a momentum of 0.9, a weight decay of 0.0001, and an initial learning rate of 0.1, reduced by a 0.1 scale factor on the plateau [29]. All experiments were trained for 200 epochs with a 96 mini-batch size. The scaling constant α was set to 10, and the reduction ratio r was set to 8 [11,22]. As shown in Figure 6, we confirmed that the training loss converges for the baseline model, as described in Section 2.2, and the proposed model, as described in Section 3.1, was trained.
padding. In addition, a preprocessing method was used to conduct time and frequency masking on the input features [28].
The model training specifications are as follows: we used a standard cross-entropy loss function, with a standard stochastic gradient descent optimizer, with a momentum of 0.9, a weight decay of 0.0001, and an initial learning rate of 0.1, reduced by a 0.1 scale factor on the plateau [29]. All experiments were trained for 200 epochs with a 96 mini-batch size. The scaling constant was set to 10, and the reduction ratio was set to 8 [11,22]. As shown in Figure 6, we confirmed that the training loss converges for the baseline model, as described in Section 2.2, and the proposed model, as described in Section 3.1, was trained. From the trained model, we generated a 512-dimensional speaker embedding for each utterance, as shown in Figure 7. The standard cosine similarity is computed for the speaker pair, and the equal error rate (EER, %) is calculated. The EER value is the crossing point of the two curves, the false rejection rate and the false acceptance rate, according to the decision threshold. This can also be expressed on the receiver operating characteristic (ROC) curve using the true-positive rate and false -positive rate. All of our proposed methods were implemented using the PyTorch toolkit [30]. From the trained model, we generated a 512-dimensional speaker embedding for each utterance, as shown in Figure 7. The standard cosine similarity is computed for the speaker pair, and the equal error rate (EER, %) is calculated. The EER value is the crossing point of the two curves, the false rejection rate and the false acceptance rate, according to the decision threshold. This can also be expressed on the receiver operating characteristic (ROC) curve using the true-positive rate and false -positive rate. All of our proposed methods were implemented using the PyTorch toolkit [30].

Experimental Results
To evaluate the proposed methods, we first tested the baseline using different encoding methods and other networks and then we compared our proposed method with state-of-the-art encoding methods. Table 6 presents the results of baseline modifications, as described in Section 2.2. It demonstrates the effectiveness of modifications to the encoding methods. We experimented with basic encoding layers, such as GAP and SAP. We then combined the proposed methods individually to the baseline, for example, self-attentive multi-layer aggregation, feature recalibration, and deep length normalization. Specifically, the scaled ResNet-34 with GAP and SAP achieved EER values of 6.85 % and 6.68%, respectively. Because multi-layer aggregation was not applied with these encoding methods, the number of dimensions of the speaker embedding was 256. In addition, the gap in performance between GAP and SAP was not large. We then applied the multi-layer aggregation for scaled ResNet-34 with GAP and SAP. In particular, the scaled ResNet-34 using multi-layer aggregation and GAP is our baseline system described in Section 2.2. Although speaker embedding dimensions and model parameters were larger in number than those of GAP and SAP, the EER value

Experimental Results
To evaluate the proposed methods, we first tested the baseline using different encoding methods and other networks and then we compared our proposed method with state-of-the-art encoding methods. Table 6 presents the results of baseline modifications, as described in Section 2.2. It demonstrates the effectiveness of modifications to the encoding methods. We experimented with basic encoding layers, such as GAP and SAP. We then combined the proposed methods individually to the baseline, for example, self-attentive multi-layer aggregation, feature recalibration, and deep length normalization. Specifically, the scaled ResNet-34 with GAP and SAP achieved EER values of 6.85 % and 6.68%, respectively. Because multi-layer aggregation was not applied with these encoding methods, the number of dimensions of the speaker embedding was 256. In addition, the gap in performance between GAP and SAP was not large. We then applied the multi-layer aggregation for scaled ResNet-34 with GAP and SAP. In particular, the scaled ResNet-34 using multi-layer aggregation and GAP is our baseline system described in Section 2.2. Although speaker embedding dimensions and model parameters were larger in number than those of GAP and SAP, the EER value was reduced from 6.85% to 5.83% and from 6.68% to 5.42%, respectively. Additional applications to self-attentive multi-layer aggregation using feature recalibration and deep length normalization also achieved EER values of 5.07% and 4.95%, respectively. In addition, the ROC curve of the proposed model showed the EER point, as shown in Figure 8. Consequently, the experimental results showed that when all of the proposed methods were applied, the model parameters increased by approximately 0.5 M compared to the scaled ResNet-34 with GAP, whereas the EER value improved by 1.9%. Table 6. Experimental results for modifying the baseline construction, using the VoxCeleb1 training and evaluation dataset (Dim = speaker embedding dimension; Params = model parameters; EER = equal error rate; GAP = global average pooling; SAP = self-attentive pooling; MLA = multi-layer aggregation; FR = feature recalibration; DLN = deep length normalization).

Model
Encoding   Table 6). Table 7 shows a comparison of our proposed methods with other networks. All experiments used the VoxCeleb1 training and evaluation datasets. First, the i-vector extractor was trained according to the implementation in [27]. After generating 400-dimensional i-vectors, PLDA was applied to reduce the number of dimensions of i-vectors to 200. The i-vector with the PLDA system achieved an EER value of 8.82%. In addition, an x-vector system was trained according to the implementation in [18]. The x-vector system is based on the use of time-delay neural networks (TDNN) using an SP method, which is commonly applied for text-independent speaker verification along with a ResNet-based system. The 1500-dimensional x-vector was extracted from the TDNN, which achieved an EER value of 8.19%. Our proposed methods based on the scaled ResNet-34 showed an improved performance, compared to the previous systems (i.e., EER value of 4.95%).   Table 6). Table 7 shows a comparison of our proposed methods with other networks. All experiments used the VoxCeleb1 training and evaluation datasets. First, the i-vector extractor was trained according to the implementation in [27]. After generating 400-dimensional i-vectors, PLDA was applied to reduce the number of dimensions of i-vectors to 200. The i-vector with the PLDA system achieved an EER value of 8.82%. In addition, an x-vector system was trained according to the implementation in [18]. The x-vector system is based on the use of time-delay neural networks (TDNN) using an SP method, which is commonly applied for text-independent speaker verification along with a ResNet-based system. The 1500-dimensional x-vector was extracted from the TDNN, which achieved an EER value of 8.19%. Our proposed methods based on the scaled ResNet-34 showed an improved performance, compared to the previous systems (i.e., EER value of 4.95%).  Tables 8 and 9 show a comparison of our proposed methods with state-of-the-art encoding approaches. Here, we compared encoding methods using a ResNet-based model and the cross-entropy loss function. Various encoding methods were compared, including TAP [10,16], learnable dictionary encoding (LDE) [10], SAP [10], GAP [15], NetVLAD [7], and GhostVLAD [7]. Table 8. Experimental results comparing our proposed methods with state-of-the-art encoding methods using the VoxCeleb1 training and evaluation dataset (Dim = speaker embedding dimension; EER = equal error rate; TAP = temporal average pooling; LDE = learnable dictionary encoding; SAP = self-attentive pooling; GAP = global average pooling; MLA = multi-layer aggregation; FR = feature recalibration; DLN = deep length normalization).

Model
Encoding  Table 9. Experimental results comparing our proposed methods with state-of-the-art encoding methods using the VoxCeleb2 training datasets and the VoxCeleb1 evaluation datasets (Dim = speaker embedding dimension; EER = equal error rate; TAP = temporal average pooling; SAP = self-attentive pooling; MLA = multi-layer aggregation; FR = feature recalibration; DLN = deep length normalization). In Table 8, all experiments used the VoxCeleb1 training and evaluation datasets. ResNet-34 with TAP, LDE, SAP, or GAP achieved EER values of 5.48%, 5.21%, 5.51%, and 5.39%, respectively [10,15]. The speaker embedding dimensions of these systems were 128 or 256, which were smaller than those of the proposed methods. However, our proposed encoding methods based on the scaled ResNet-34 achieved an EER value of 4.95%. The performance was an improvement to that of other systems.

Model Encoding Method Dim EER
In Table 9, all experiments used the VoxCeleb2 training datasets and VoxCeleb1 evaluation datasets. As presented in Table 5, the VoxCeleb2 training datasets are seven times more than the VoxCeleb1 training datasets. Table 9 shows that increasing the amount of training dataset was effective for performance improvement. ResNet-34 and ResNet-50 with TAP achieved EER values of 5.04% and 4.95%, respectively [16]. In addition, a thin-ResNet-34 with NetVLAD and GhostVLAD achieved EER values of 3.57% and 3.22%, respectively [7]. The number of speaker embedding dimensions of these systems was 512, which is the same as that of our proposed methods. Our proposed encoding methods based on the scaled ResNet-34 achieved an EER value of 2.86%. Consequently, the experimental results showed that our proposed methods were superior to other state-of-the-art methods.
Furthermore, in the case of on-device speaker verification, the lower the speaker embedding dimension, the faster the system. Our proposed methods have limitations as a high-dimensional speaker embedding method, compared to other state-of-the-art encoding methods. Therefore, future research is required to address this dimension problem. In a future study, on-device speaker verification using low-dimensional speaker embedding will be conducted.

Conclusions
In previous multi-layer aggregation methods for text-independent speaker verification, the number of model parameters was relatively large, and unspecified variations increased during training. Therefore, we proposed a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. First, we set the ResNet with the scaled channel width and layer depth as a baseline. Second, self-attentive multi-layer aggregation was applied when training the frame-level features of each residual layer in the scaled ResNet. Finally, the feature recalibration layer and deep length normalization were applied to train the utterance-level feature in the encoding layer. The experimental results using the VoxCeleb1 evaluation dataset showed that the proposed method achieved an EER value performance comparable to that of state-of-the-art models.