## 1. Introduction

Speaker recognition aims to analyze the speaker representation from input audio. A subfield of speaker recognition is speaker verification, which determines whether the utterance of the claimed speaker should be accepted or rejected by comparing it to the utterance of the registered speaker. Speaker verification is divided into text dependent and text independent. Text-dependent speaker verification aims to recognize only the specified utterances when verifying the speaker. Examples include Google’s “OK Google” and Samsung’s “Hi Bixby.” Meanwhile, text-independent speaker verification is not limited to the type of utterances to be recognized. Therefore, the problems to be solved using text-independent speaker verification are more difficult. If the performance is guaranteed, text-independent speaker verification can be utilized in various biometric systems and e-learning platforms, such as biometric authentication for chatbots, voice ID, and virtual assistants.

Owing to advances in computational power and deep learning techniques, the performance of text-independent speaker verification has been improved. Text-independent speaker verification using deep neural networks (DNN) is divided into two streams. The first one is an end-to-end system [

1]. The input of the DNN is a speech signal, and the output is the verification result. This is a single-pass operation in which all processes can be operated at once. However, the input speech of a variable length is difficult to handle. To address this problem, several studies have applied a pooling layer or temporal average layer to an end-to-end system [

2,

3]. The second is a speaker embedding-based system [

4,

5,

6,

7,

8,

9,

10,

11,

12,

13,

14], which generates an input of variable length into a vector of fixed length using a DNN. The generated vector is used as an embedding to represent the speaker. The speaker embedding-based system can handle input speech of variable length and can generate speaker representations from various environments.

As shown in

Figure 1, a DNN has been used as a speaker embedding extractor in a speaker embedding-based system. In general, a speaker embedding-based system executes the following processes [

4,

5,

6,

7]:

The speaker classification model is trained.

The speaker embedding is extracted by using the output value of the inner layer of the speaker classification model.

The similarity between the embedding of the registered speaker and the claimed speaker is computed.

The acceptance or rejection is determined by a previously decided threshold value.

In addition, back-end methods, for example, probabilistic linear discriminant analysis, can be used [

8,

9,

10].

The most important part in the above system is the speaker embedding generation [

13]. Speaker embedding is a high-dimensional feature vector that contains speaker information. An ideal speaker-embedding maximizes inter-class variations and minimizes intra-class variations [

10,

14,

15]. The component that directly affects the speaker embedding generation is the encoding layer. The encoding layer takes a frame-level feature and converts it into a compact utterance-level feature. It also converts variable-length features to fixed-length features.

Most encoding layers are based on various pooling methods, for example, temporal average pooling (TAP) [

10,

14,

16], global average pooling (GAP) [

13,

15], and statistical pooling (SP) [

6,

14,

17,

18]. In particular, self-attentive pooling (SAP) has improved performance by focusing on the frames for a more discriminative utterance-level feature [

10,

19,

20], and pooling layers provide compressed speaker information by rescaling the input size. These are mainly used with convolutional neural networks (CNN) [

10,

13,

14,

15,

16,

17,

20]. The speaker embedding is extracted using the output value of the last pooling layer in a CNN-based speaker model.

To improve the representational power of the speaker embedding, residual learning derived from ResNet [

21] and squeeze-and-excitation (SE) blocks [

22] were adapted for the speaker models [

10,

13,

14,

15,

16,

20,

23]. Residual learning maintains input information through mappings between layers called “shortcut connections.” A large-scale CNN using shortcut connections can avoid gradient degradation. The SE block consists of a squeeze operation (which condenses all of the information on the features) and an excitation operation (which scales the importance of each feature). Therefore, a channel-wise feature response can be adjusted without significantly increasing the model complexity in the training.

The main limitation of the previous encoding layers is that the model uses only the output feature of the last pooling layer as input. In other words, the model uses only one frame-level feature when performing speaker embedding. Therefore, similar to [

14,

24], a previous study presented a shortcut connection-based multi-layer aggregation to improve the speaker representations when calculating the weight at the encoding layer [

13]. Specifically, the frame-level features are extracted from between each residual layer in ResNet. Then, these frame-level features are fed into the input of the encoding layer using shortcut connections. Consequently, a high-dimensional speaker embedding is generated.

However, the previous study [

13] has limitations. First, the model parameter size is relatively large, and the model generates high-dimensional speaker embeddings (1024 dimensions, about 15 million model parameters). This leads to inefficient training and thus requires a sufficiently large amount of data for training. Second, the multi-layer aggregation approach increases not only the speaker’s information but also the intrinsic and extrinsic variation factors, for example, emotion, noise, and reverberation. Some of these unspecified factors increase variability while generating speaker embedding.

Hence, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system, as shown in

Figure 2. We present an improved version of the previous study, as described in the following steps:

The remainder of this paper is organized as follows.

Section 2 describes a baseline system using shortcut connections-based multi-layer aggregation.

Section 3 introduces the proposed self-attentive multi-layer aggregation method with feature recalibration and normalization.

Section 4 discusses our experiments, and conclusions are drawn in

Section 5.

## 3. Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization

As discussed in

Section 1, the previous study has two problems. The model parameter problem is addressed by building a scaled ResNet-34. However, the problem of multi-layer aggregation remains. Multi-layer aggregation uses output features of multiple layers to develop the speaker embedding system. It is assumed that not only speaker information but also other unspecified factors exist in the output feature of the layer. The unspecified factor lowers the speaker verification performance. Therefore, we proposed three methods: self-attentive multi-layer aggregation, feature recalibration, and deep length normalization.

#### 3.1. Model Architecture

As presented in

Figure 2 and

Table 4, the proposed network mainly consists of a scaled ResNet and an encoding layer. Frame-level features are trained in the scaled ResNet, and utterance-level features are trained in the encoding layer.

In the scaled ResNet, given an input feature

$\mathit{X}=\left[{\mathit{x}}_{1},{\mathit{x}}_{2},\dots ,{\mathit{x}}_{l},\dots ,{\mathit{x}}_{L}\right]$ of length

$L$ (

${\mathit{x}}_{l}\in {\mathbb{R}}^{d}$), output features

${\mathit{P}}_{i}=\left[{\mathit{p}}_{1},{\mathit{p}}_{2},\dots ,{\mathit{p}}_{c},\dots ,{\mathit{p}}_{C}\right]$ (

${\mathit{p}}_{c}\in \mathbb{R}$) from each residual layer of the scaled ResNet are generated using SAP. Here, the length

${C}_{i}$ is determined by the number of channels in the

${i}^{th}$ residual layer. Then, the generated output features are concatenated into one feature

$\mathit{V}$ as in Equation (1) (where

$[+]$ indicates concatenation).

The concatenated feature $\mathit{V}=\left[{\mathit{v}}_{1},{\mathit{v}}_{2},\dots ,{\mathit{v}}_{c},\dots ,{\mathit{v}}_{C}\right]$ (length $C={C}_{1}+{C}_{2}+{C}_{3}+{C}_{4}+{C}_{5}$, ${\mathit{v}}_{c}\in \mathbb{R}$) is a set of frame-level features and is used as the input of the encoding layer.

The encoding layer comprises a feature recalibration layer and a deep length normalization layer. In the feature recalibration layer, the concatenated feature $\mathit{V}$ is recalibrated by fully-connected layers and nonlinear activations. Consequently, a recalibrated feature $\stackrel{\xb4}{\mathit{V}}=\left[{\stackrel{\xb4}{\mathit{v}}}_{1},{\stackrel{\xb4}{\mathit{v}}}_{2},\dots ,{\stackrel{\xb4}{\mathit{v}}}_{c},\dots ,{\stackrel{\xb4}{\mathit{v}}}_{C}\right]$ (${\stackrel{\xb4}{\mathit{v}}}_{c}\in \mathbb{R}$) is generated. Then, the recalibrated feature is normalized according to the length of input $\stackrel{\xb4}{\mathit{V}}$ in the deep-length normalization layer. The normalized feature is used as a speaker embedding and is fed into the output layer. Further, a log probability for speaker classes $s$, $\mathit{P}(sp{k}_{s}|{\mathit{x}}_{1},{\mathit{x}}_{2},\dots ,{\mathit{x}}_{l},\dots ,{\mathit{x}}_{L}$), is generated in the output layer.

#### 3.2. Self-Attentive Multi-Layer Aggregation

As shown in

Figure 2 and

Figure 3, SAP is applied to each residual layer using shortcut connections. For every input feature, given an output feature of the first convolution layer or the

${i}^{th}$ residual layers after conducting an average pooling,

${\mathit{Y}}_{i}=\left[{\mathit{y}}_{1},{\mathit{y}}_{2},\dots ,{\mathit{y}}_{n},\dots ,{\mathit{y}}_{N}\right]$ of length

$N$ (

${\mathit{y}}_{n}\in {\mathbb{R}}^{c}$) is obtained. The number of dimensions

$c$ is determined by the number of channels.

Then, the average feature is fed into a fully-connected hidden layer to obtain

${\mathit{H}}_{i}=\left[{\mathit{h}}_{1},{\mathit{h}}_{2},\text{}\dots ,\text{}{\mathit{h}}_{n},\dots ,{\mathit{h}}_{N}\right]$ using a hyperbolic tangent activation function. Given

${\mathit{h}}_{n}\in {\mathbb{R}}^{c}$ and a learnable context vector

$\mathit{u}\in {\mathbb{R}}^{c}$, the attention weight

${\mathit{w}}_{n}$ is measured by training the similarity between

${\mathit{h}}_{n}$ and

$\mathit{u}$ with a softmax normalization as in Equation (2).

Then, the embedding

$\mathit{e}\in {\mathbb{R}}^{c}$ is generated using the weighted sum of the normalized attention weights

${\mathit{w}}_{n}$ and

${\mathit{y}}_{n}$ as in Equation (3).

The embedding vector $\mathit{e}$ can be rewritten as ${\mathit{P}}_{i}=\left[{\mathit{p}}_{1},{\mathit{p}}_{2},\dots ,{\mathit{p}}_{c},\dots ,{\mathit{p}}_{C}\right]$ (${\mathit{p}}_{c}\in \mathbb{R}$) in the order of the dimensions. Consequently, the SAP output feature ${\mathit{P}}_{i}$ is generated. This process helps generate a more discriminative feature while focusing on the frame-level features of each layer. Moreover, dropout regularization and batch normalization are used in ${\mathit{P}}_{i}$. Then, the generated features are concatenated into one feature, $\mathit{V}$, as in Equation (1).

#### 3.3. Feature Recalibration

After the self-attentive multi-layer aggregation, the concatenated feature

$\mathit{V}$ is fed into the feature recalibration layer. The feature recalibration layer aims to train the correlations between each channel of the concatenated feature; this is inspired by [

22].

Given an input feature

$\mathit{V}=\left[{\mathit{v}}_{1},{\mathit{v}}_{2},\dots ,{\mathit{v}}_{c},\dots ,{\mathit{v}}_{C}\right]$ (

${\mathit{v}}_{c}\in \mathbb{R}$, where

$C$ is the sum of all channels), the feature channels are recalibrated using two fully-connected layers and nonlinear activations, as in Equation (4).

Here, $\delta $ refers to the leaky rectified linear unit activation; $\sigma $ refers to the sigmoid activation; ${\mathit{W}}_{1}$ is the front fully-connected layer, ${\mathit{W}}_{1}\in {\mathbb{R}}^{c\times \frac{c}{r}}$, and ${\mathit{W}}_{2}$ is the back fully-connected layer, ${\mathit{W}}_{2}\in {\mathbb{R}}^{\frac{c}{r}\times c}$. According to the reduction ratio $r$, a dimensional transformation is performed between the two fully-connected layers, such as a bottleneck structure, while channel-wise multiplication is performed. The rescaled channels are then multiplied by the input feature $\mathit{V}$. Consequently, an output feature $\stackrel{\xb4}{\mathit{V}}=\left[{\stackrel{\xb4}{\mathit{v}}}_{1},{\stackrel{\xb4}{\mathit{v}}}_{2},\dots ,{\stackrel{\xb4}{\mathit{v}}}_{c},\dots ,{\stackrel{\xb4}{\mathit{v}}}_{C}\right]$ (${\stackrel{\xb4}{\mathit{v}}}_{c}\in \mathbb{R}$) is generated. This generated feature $\stackrel{\xb4}{\mathit{V}}$ is the result of recalibration according to the importance of the channels.

#### 3.4. Deep Length Normalization

As in [

11], deep length normalization was applied to the proposed model. The L2 constraint is applied to the length axis of the recalibrated feature

$\stackrel{\xb4}{\mathit{V}}$ with a scale constant,

$\alpha $, as in Equation (5).

Then, the normalized

$\stackrel{\xb4}{\mathit{V}}$ is fed into the output layer for speaker classification. This feature is used as a speaker embedding, as shown in

Figure 4.