Global–Local Self-Attention Based Transformer for Speaker Veriﬁcation

: Transformer models are now widely used for speech processing tasks due to their powerful sequence modeling capabilities. Previous work determined an efﬁcient way to model speaker embeddings using the Transformer model by combining transformers with convolutional networks. However, traditional global self-attention mechanisms lack the ability to capture local information. To alleviate these problems, we proposed a novel global–local self-attention mechanism. Instead of using local or global multi-head attention alone, this method performs local and global attention in parallel in two parallel groups to enhance local modeling and reduce computational cost. To better handle local location information, we introduced locally enhanced location encoding in the speaker veriﬁcation task. The experimental results of the VoxCeleb1 test set and the VoxCeleb2 dev set demonstrated the improved effect of our proposed global–local self-attention mechanism. Compared with the Transformer-based Robust Embedding Extractor Baseline System, the proposed speaker Transformer network exhibited better performance in the speaker veriﬁcation task.


Introduction
Speaker verification determines whether the identity of the speaker of a test utterance is the same as that of the speaker of a reference speech on the basis of the enrollment utterances of the speaker. Research in this area has mainly focused on obtaining a fixed-dimensional vector representing an utterance, known as speaker embedding. These speaker embeddings are then scored to verify the speaker's identity. Research on speaker embedding extractors aims to enhance inter-speaker variability and suppress intra-speaker variability. In general, extracting speaker embeddings is a crucial factor that largely determines the performance of speaker verification systems.
In recent years, the Transformer model [1] has demonstrated excellent performance in natural language processing (NLP). Interest in the Transformer model for speech processing has exploded among researchers. Inspired by the successful application of the Transformer model in NLP, some studies [2][3][4], have tried to replace segment-level pooling layers or frame-level convolutional layers to apply the Transformer model to speaker recognition. However, these works on self-attention are still dominated by previous methods, such as Residual Network (ResNets) [5][6][7], Time-Delay Neural Network (TDNNs) [8,9], and Long Short-Term Memory (LSTM) networks [10]. Other works have used transformer as the backbone but still did not alleviate the limitation of Transformer in speaker verification, such as s-vectors [11].
Since the input is a speech signal, the speaker verification task is different from NLP. There are two challenges to applying Transformer to speaker verification: (1) Transformer are difficult to scale efficiently since acoustic features are much longer than text sentences [12]; (2) compared with CNN, Transformer is insufficient in capturing local information. Due to the success of networks such as TDNN that focus on local features, it is generally believed that local features will improve network performance, so we believe that enhancing the Transformer model's ability to capture local information will improve its performance in speaker recognition.
In this work, we propose global-local self-attention to enable the Transformer model to model local features while maintaining the ability to model long-distance dependencies. We divided attention heads into different groups and performed local or global self-attention operations for different groups. This strategy of splitting attention heads does not introduce additional computational costs, and it enhances the ability to capture global dependencies and model local information. In the encoder and decoder, we used additional skip connections to aggregate features at different levels. Furthermore, we introduced locally enhanced positional encoding to further enhance the locality of the model. Without adding extra computation, we improved the performance of Transformer in speaker confirmation tasks by combining multi-level features and enhancing local information design.
The paper is organized as follows: Section 2 reviews the previous work related to the self-attention mechanism in speaker recognition. Section 3 presents and explains our proposed model. In Section 4, we discuss the experimental details and analyze the results. Section 5 concludes this paper.

Related Work
Many convolutional neural networks dominate the field of speaker recognition and have great success. Recently, due to the excellent performance of Transformer in NLP and speech recognition, some works have studied the problem of applying Transformer to speaker recognition.
The attention mechanism is at the heart of Transformer's excellent performance. A line of work applies attention mechanisms to pooling mechanisms for speaker recognition as an alternative to aggregating temporal information. Okabe et al. [9] proposed an attentive statistics pooling method that provided the importance of the frame. The attention mechanism was combined with a TDNN-based embedding extractor to assign different weights to different frames and generate weighted means and standard deviation. Cai et al. [5] and Zhu et al. [13] proposed a pooling layer incorporating a self-attention mechanism to obtain utterance-level representations. Wu et al. [14] improved it by adopting vectorial attention instead of scalar attention. India et al. [2] presented double multi-head attention pooling, which extended the previously proposed self-multi-head attention-based method. An additional self-attention layer, which enhanced the pooling mechanism by assigning weights to the information captured by each head, was added to the pooling layer. Wang et al. [15] proposed multi-resolution multi-head attention pooling, which fused the attention weights of different resolutions to improve the diversity of attention heads. Instead of utilizing multi-head attention in parallel, Zhu et al. [3] proposed serialized multi-layer multi-head attention, which aimed to aggregate and propagate attention statistics from one layer to the next in a serialized manner.
Different from the above studies, some studies have focused on channel-wise attention. Yu et al. [16] proposed a dynamic channel-wise selection mechanism based on softmax attention, integrating information from multiple network branches with a channel-wise selection mechanism. Jiang et al. [17] introduced a gating mechanism to provide channel-wise attention by exploiting inter-dependencies across channels. These works extended the attention mechanism to the channel dimension to select more important channel information. This led to limited improvement in speaker recognition system performance.
In recent years, some works have directly stacked attention layers as a part of layers or the whole embedding extractor. Shi et al. [4,18] applied attention layers and stacked Transformer encoders on frame-level encoders and segment-level encoders, respectively, to capture speaker information locally and globally. A study by Shi et al. [4] was an improvement on Shi et al. [18] that used Transformer encoders with memory to replace the attention layer, and it proposed the idea of using Transformer blocks to process acoustic features segmented into segments. However, it did not integrate the operation of the split window into the Transformer module. Desplanques et al. [19] further incorporated channel attention with a global context into the frame-level layers and statistics pooling layer for better performance. These works, such as [8], are still dominated by sophisticated convolutional networks, Conversely, Safari et al. [20] proposed a serialized multi-layer multi-head attention. This work consisted of three main stages, namely a frame-level feature processor, a serialized attention mechanism, and a speaker classifier. The frame-level feature processor used TDNN to extract high-level representations of the input sound features. The serialized attention mechanism was included in a concatenated self-attention encoding structure that stacked Transformer encoder blocks followed by an additional attention pooling. This structure was used to aggregate variable-length feature sequences into a fixeddimensional representation to create discriminative speaker embeddings. Metilda et al. [11] proposed s-vectors, which replaced the TDNN of [8] with Transformer encoder modules, which were stacked, followed by a statistical pooling layer and two linear layers. In order to capture the speaker characteristics better, this work used self-attention as the backbone of our architecture. Its advantage is that it was not limited to a limited context and focused on all frames at each time step. These works show that Transformer models have the potential to be applied to speaker verification. However, Transformer-based embedding extractors suffer from inferior performance in speaker recognition due to the lack of capacity to capture the local feature. Wang et al. [12] proposed a multi-view attention mechanism that captured long-distance dependencies and modeled the locality by controlling the self-attention receptive field for each head by a head-wise masking matrix. This work made some progress on this problem. It used a mask to realize the calculation of local self-attention, but masking the calculated results wastes computing resources.

Proposed Architecture
Using the original self-attention alone may not be sufficient to capture local contextual features of utterances. To better capture speaker features, we proposed global-local selfattention in our architecture. In this section, we introduce the structure of each module and explain how these designs were incorporated into our proposed model. The following subsections focus on the different submodules. Figure 1 presents the complete architecture of the model. BN stands for Batch Normalization [21].
or the whole embedding extractor. Shi et al. [4,18] applied attention layers and stacked Transformer encoders on frame-level encoders and segment-level encoders, respectively, to capture speaker information locally and globally. A study by Shi et al. [4] was an improvement on Shi et al. [18] that used Transformer encoders with memory to replace the attention layer, and it proposed the idea of using Transformer blocks to process acoustic features segmented into segments. However, it did not integrate the operation of the split window into the Transformer module. Desplanques et al. [19] further incorporated channel attention with a global context into the frame-level layers and statistics pooling layer for better performance. These works, such as [8], are still dominated by sophisticated convolutional networks, Conversely, Safari et al. [20] proposed a serialized multi-layer multihead attention. This work consisted of three main stages, namely a frame-level feature processor, a serialized attention mechanism, and a speaker classifier. The frame-level feature processor used TDNN to extract high-level representations of the input sound features. The serialized attention mechanism was included in a concatenated self-attention encoding structure that stacked Transformer encoder blocks followed by an additional attention pooling. This structure was used to aggregate variable-length feature sequences into a fixed-dimensional representation to create discriminative speaker embeddings. Metilda et al. [11] proposed s-vectors, which replaced the TDNN of [8] with Transformer encoder modules, which were stacked, followed by a statistical pooling layer and two linear layers. In order to capture the speaker characteristics better, this work used selfattention as the backbone of our architecture. Its advantage is that it was not limited to a limited context and focused on all frames at each time step. These works show that Transformer models have the potential to be applied to speaker verification. However, Transformer-based embedding extractors suffer from inferior performance in speaker recognition due to the lack of capacity to capture the local feature. Wang et al. [12] proposed a multi-view attention mechanism that captured long-distance dependencies and modeled the locality by controlling the self-attention receptive field for each head by a head-wise masking matrix. This work made some progress on this problem. It used a mask to realize the calculation of local self-attention, but masking the calculated results wastes computing resources.

Proposed Architecture
Using the original self-attention alone may not be sufficient to capture local contextual features of utterances. To better capture speaker features, we proposed global-local self-attention in our architecture. In this section, we introduce the structure of each module and explain how these designs were incorporated into our proposed model. The following subsections focus on the different submodules. Figure 1 presents the complete architecture of the model. BN stands for Batch Normalization [21].

Overall Architecture
The overall architecture of our proposed method is shown in Figure 1. The input was 80-dimension mel-filter banks and leveraged a one-dimensional convolutional layer (convolutional layer with kernel size 3 and stride 1) to obtain C × T outputs, and C and T refer to the number of channel dimensions and time dimensions, respectively. Convolution utilizes overlapping windows to form coarse features, which lay the foundation for extracting speaker-discriminative embeddings. We used the architecture with encoders and decoders as the embedding extractor. After the decoder, we employed an x-vectorlike architecture consisting of attention pooling [9] followed by a fully connected layer to generate the final speaker-characterizing embedding. The whole system contains up to 25.2 million parameters.

Transformer Block
The overall topology of the Transformer block is illustrated in Figure 2a, with two differences from the original Transformer module [1]; namely, we replaced the multi-head self-attention mechanism with our proposed global-local self-attention mechanism, and to introduce the local inductive bias, locally enhanced positional encoding [22] was added as a parallel module to our proposed self-attention mechanism. The Transformer block maintained the size of the feature maps and was set with a 3.2 MLP ratio and 8 attention heads. Transformer block is formally defined as: where X l denotes the output of the lth transformer block of the encoder or decoder, and if it exists at the beginning of each module, it represents the output of the previous module.
tion utilizes overlapping windows to form coarse features, which lay the foundation for extracting speaker-discriminative embeddings. We used the architecture with encoders and decoders as the embedding extractor. After the decoder, we employed an x-vectorlike architecture consisting of attention pooling [9] followed by a fully connected layer to generate the final speaker-characterizing embedding. The whole system contains up to 25.2 million parameters.

Transformer Block
The overall topology of the Transformer block is illustrated in Figure 2a, with two differences from the original Transformer module [1]; namely, we replaced the multi-head self-attention mechanism with our proposed global-local self-attention mechanism, and to introduce the local inductive bias, locally enhanced positional encoding [22] was added as a parallel module to our proposed self-attention mechanism. The Transformer block maintained the size of the feature maps and was set with a 3.2 MLP ratio and 8 attention heads. Transformer block is formally defined as: where denotes the output of the lth transformer block of the encoder or decoder, and if it exists at the beginning of each module, it represents the output of the previous module.

Global-Local Self-Attention
Despite its strong ability to model global dependencies, the original full self-attention mechanism struggles to capture the local information for utterances longer than text. In a recent study [12], local self-attention with a sliding window was applied to speaker recognition and achieved competitive performance. Inspired by [23], we proposed a novel global-local self-attention mechanism that improves the capability to capture local features while retaining the capability to model long-distance dependencies. As shown in Figure 2b, for half of the channels of the feature maps, the self-attention mechanism is implemented as local self-attention, and each attention head has a sliding window of the same size, while for the other half of the channel, it is implemented as global self-attention without a sliding window. Similar to the original full self-attention mechanism, the input features X ∈ R C×T are linearly transformed to K attention heads, and then each attention head will perform local or global self-attention. For global-local self-attention, we a used a non-overlapping sliding window to partition X into [X 1 , · · ·, X N ] of an equal window size w. w is the size of the window, which gives the model better learning ability. Assuming that the matrix queries, keys, and values of the kth attention head all have the dimension d k , then the proposed local self-attention mechanism for the kth head is defined as: Appl. Sci. 2022, 12, 10154 5 of 10 where W Q k , W K k , W V k ∈ R C×d k , are the linear projection parameter matrices of the queries, keys, and values for the kth attention head, respectively, and d k is set as C/K. We divided the K attention heads equally into two distinct groups. K is usually an even number so that the attention heads are evenly divided into two groups. The first group of attention heads performs local self-attention, while the second group of attention heads performs global self-attention. The difference between the calculation method of global self-attention and multi-head self-attention is that the former needs to add locally enhanced positional encoding, and the output of the kth head is recorded as Global − Attention k (X).
Finally, the results of these two kinds of attention are concatenated together as the input of MLP and denoted as GL − Attention(X): GL − Attention(X) = cat(head 1 , · · ·, head k )W (7) where W ∈ R C×C is the projection matrix that projects the self-attention results into the target output dimension. The key design is to split the attention heads into two different groups and perform local and global self-attention operations in parallel. This enables local attention to function under the guidance of global attention so that global information can better interact with local information. We adopted an optimal window size that outperformed other window sizes to achieve the best performance for our proposed method.

Locally Enhanced Positional Encoding
The positional encoding mechanism plays a pivotal role in the Transformer model. Since the self-attention operation is permutation-invariant, it ignores location information within the input features. To add this information, we considered a straightforward way to add position information to the linear projection values. In addition, we wanted the input element to pay more attention to the location information of its local neighborhood. Therefore, we adopted the locally enhanced positional encoding (LePE) method. LePE is generated by applying a depth-wise convolutional layer [24] on the value V. Given the matrices Q, K, and V in the Transformer model, after adding LePE, the proposed self-attention mechanism can be formulated as: In this way, LePE conveniently adds local contextual location information to input elements.

Encoder
In our work, the encoder layer consisted of N i sequential Transformer encoder blocks. These deeper features w generally considered to be more complex features and can effectively represent the speaker's identity. However, evidence in [19] suggested that shallow feature maps in hierarchical networks also contribute to more robust speaker embeddings. We argue that this view is also present in our proposed Transformer model. After a sequential of Transformer blocks, we concatenated the outputs of each Transformer block by skip connections to generate new feature maps. Then, a fully connected layer (called a sub-block aggregation net) processed the aggregated features to output the features of the decoder layer. After concatenating the outputs of the Transformer blocks of different layers, the sub-block aggregation net processed the concatenated information and adjusted the dimension of the features to be the same as the input of the next module. The decoder layer has the same architecture as the encoder layer, including several sequential Transformer blocks and a sub-block aggregation net. The difference between an encoder and a decoder is that the former is generally deeper than the latter. For both the encoder layer and the decoder layer, we used layer normalization [25] for the aggregated information before the sub-block aggregation net. In this work, we used a 4-layer encoder and a 3-layer decoder.
In our proposed architecture, we used the sub-block aggregation net in the encoder and decoder to aggregate features at different levels, which can prevent the model from consuming too much memory. The encoder extracted the fixed-length representation from the coarse speech features output by the convolutional layer as intermediate features passed to the decoder to obtain utterance-level speaker embeddings. Finally, the output of the decoder was used by the pooling layer to generate the final speaker embedding.

Data and Futures
For this work, we used the VoxCeleb dataset as our training and test set. The dataset consisted of two versions, VoxCeleb1 [26] and VoxCeleb2 [27]. The VoxCeleb1 contains over 100,000 utterances from 1251 celebrities, while the VoxCeleb2 contains over 1 million utterances from 6112 identities. There is no overlap between the two versions. All data preparation steps were performed using the SpeechBrain VoxCeleb recipe [28]. All our systems were trained in SpeechBrain and evaluated on the VoxCeleb1 test sets.
The input features were 80-dimensional mel-filter banks with a 25 ms window and a 10 ms shift to represent the speech signal, while the input features were 512 channels. To make the length of the input divisible by the window size, we dropped the last frame. Data augmentation used the SpeechBrain VoxCeleb recipe during the training process in combination with the publicly available RIR dataset provided in [29]. Finally, we applied SpecAugment [30], which randomly masked 0 to 5 frames in the time domain and 0 to 10 channels in the frequency domain.

Experiment Setup
The AdamW optimizer with a weight decay of 0.1 was used. We used a mini-batch of 64 and an initial learning rate of 5 × 10 −4 . We used the CyclicLR scheduler with the AdamW optimizer with a minimum learning rate of 1 × 10 −5 to train all models. The step size of one cycle was set to 80 k iterations. All models were trained with AAM-softmax [31,32], with a margin of 0.2 and softmax prescaling of 30. One cycle was applied to the VoxCeleb2 dev set. To make the size of the feature maps in the Transformer block divisible by w, we chose the optimal size w from 20, 25, and 30 according to evidence in [4]. Each window size was tested on the VoxCeleb1 test set, and the results are analyzed in the results section.

System Evaluation
We adopted the standard equal error rate (EER) and the minimum normalized detection cost MinDCF as evaluation metrics to compare our proposed system with previous work. For MinDCF calculation, we assumed P target = 10 −2 and C FA = C Miss = 1. EER is mainly composed of the false acceptance rate and false rejection rate, and the value when the two are equal is taken as the evaluation metric. The calculation of MinDCF takes into account the different costs of false rejection and false acceptance as well as the prior probabilities of true speakers and imposters. We also showed the DET curve of our proposed method. All our proposed models use a cosine similarity classifier as a backend framework. We analyze the proposed model architecture with a concise ablation study in the next section.

Results
In this section, we compare the proposed method with those of several excellent studies and analyze the results. An overview of the system performance is shown in Table 1, including VGG [26], TDNN [8,9,14], ResNet [6], Transformer [11,12,20], and our Appl. Sci. 2022, 12, 10154 7 of 10 proposed architecture. According to the results of the VoxCeleb2 speaker verification task shown in Table 1, our model performed better in EER and MinDCF compared with the baseline systems based on VGG, TDNN, ResNet, and Transformer as feature extractors. This shows that enhancing the locality of Transformer can effectively improve performance. The det curve of our proposed method is shown in Figure 3. frames). This shows that a reasonable window size improves performance, and when the window size is too large or too small, the model may not be able to capture the speaker information in the current segment well, resulting in performance degradation.  We note that our work outperformed the numbers reported by all systems used for comparison. To investigate the impact of each part of the model, we performed an ablation study on the architecture introduced in Section 3. The results of the ablation experiments are given in Table 3.  When performing local self-attention, we used fixed-size and non-overlapping sliding windows to divide the input features into several equal-length segments. To study the effect of different window sizes on model performance, we set the experimental window size according to the experience of previous work, such as [4]. Table 2 presents the results at different window sizes. Among the experimentally set window sizes, the best results were obtained in the EER and MinDCF indicators when the window size was 25 frames. Performance dropped when the window size became larger (30 frames) or smaller (20 frames). This shows that a reasonable window size improves performance, and when the window size is too large or too small, the model may not be able to capture the speaker information in the current segment well, resulting in performance degradation. We note that our work outperformed the numbers reported by all systems used for comparison. To investigate the impact of each part of the model, we performed an ablation study on the architecture introduced in Section 3. The results of the ablation experiments are given in Table 3. Table 3. Ablation study of our proposed architecture.

Architecture
Voxceleb2 Dev We conducted experiment (a) replacing the proposed attention with the original full self-attention and keeping everything else the same. The results showed that our method outperformed the original full self-attention. Our proposed method yielded relative improvements of 6.8% in EER and 4.1% in MinDCF. This suggests that enhancing the locality of the self-attention mechanism in Transformer can improve the model's performance on the speaker verification task. Experiment (b) clearly demonstrates the importance of the sub-block aggregation net described in Section 3. Aggregating different levels of features through the sub-block aggregation net leads to a relative improvement of 23% in EER and 17.8% in MinDCF. The results showed that aggregating features at different levels through sub-block aggregation net enables the model to obtain richer information, which is beneficial to obtain more robust speaker embedding in our proposed model. In experiment (c), we did not use LePE and kept the other configurations the same. The results showed a relative improvement of 3.5% in EER and 11.5% in MinDCF by introducing LePE. This suggests that further enhancing the locality of the model by positional encoding can effectively improve performance.

Conclusions
In this work, we proposed a Transformer-based speaker embedding extractor for speaker verification with a novel global-local self-attention mechanism. The method balances the ability to model long-distance dependencies and capture local features by performing local and global attention in parallel. We aggregated features at different levels in the encoder and decoder to obtain more powerful speaker embeddings. The combination of these designs enables our proposed method to achieve excellent results compared with several strong baselines. Finetuning the parameters and more adequate training may further improve results. In future work, we will further improve the performance by combining this method with other techniques, such as pre-training, while exploring how to better apply Transformer to speaker recognition tasks.