Improved Convolutional Neural Network–Time-Delay Neural Network Structure with Repeated Feature Fusions for Speaker Verification

: The development of deep learning greatly promotes the progress of speaker verification (SV). Studies show that both convolutional neural networks (CNNs) and dilated time-delay neural networks (TDNNs) achieve advanced performance in text-independent SV, due to their ability to sufficiently extract the local feature and the temporal contextual information, respectively. Also, the combination of the above two has achieved better results. However, we found a serious gridding effect when we apply the 1D-Res2Net-based dilated TDNN proposed in ECAPA-TDNN for SV, which indicates discontinuity and local information losses of frame-level features. To achieve high-resolution process for speaker embedding, we improve the CNN–TDNN structure with proposed repeated multi-scale feature fusions. Through the proposed structure, we can effectively improve the channel utilization of TDNN and achieve higher performance under the same TDNN channel. And, unlike previous studies that have all converted CNN features to TDNN features directly, we also studied the latent space transformation between CNN and TDNN to achieve efficient conversion. Our best method obtains 0.72 EER and 0.0672 MinDCF on VoxCeleb-O test set, and the proposed method performs better in cross-domain SV without additional parameters and computational complexity.


Introduction
Speaker verification (SV) refers to the method of confirming the identity of a speaker through the speaker's personal information in speech signals.SV has been widely used in various fields, such as speaker diarization, speech enhancement [1], and voice conversion [2].In recent years, the research on SV algorithms based on deep learning has made great progress.In general, the procedure of the SV task falls into two steps.First, the embeddings of speaker utterances are extracted through deep neural networks, such as X-Vector systems [3][4][5].Then, the similarity scores between the registered enroll-test pairs are calculated by normalized cosine similarity methods [6][7][8][9], PLDA [10], or other back-ends [11,12].A series of speaker encoders are mainly trained by the famous angular softmax loss functions, including AM-softmax [13] and AAM-softmax [14], which are highly effective.Some studies have also published the efficient end-to-end triplet loss [15] or contrastive loss [16,17] for SV.
Today, the most efficient SV methods are TDNN and their variant networks.TDNN is generally considered a method that can fully extract the long temporal information of the input acoustic signal [18].And ECAPA-TDNN [19], proposed in 2020, has achieved stateof-the-art performance on the large-scale voiceprint recognition dataset Voxceleb [20,21] by introducing the TDNN-based model with dilated convolution, propagation, and aggregation strategies.In addition, studies [22][23][24] show that CNNs are very suitable for processing the speech spectrogram, which can fully perceive and extract the texture information in different receptive fields.The ResNet backbone network, for example, can build a lightweight, stable, and robust speaker recognition model [23,25].
Naturally, a combination of the above two achieves better results [24,[26][27][28].In the CNN-TDNN structure, a CNN front-end is added before TDNN to extract enough local information of the input spectrogram.The features are then sent to TDNN blocks to calculate the temporal contextual features.The CNN-TDNN framework is undoubtedly effective, but it always leads to a larger model size and higher calculational load.Motivated by [29], we believe there is still plenty of scope for improvement by introducing highresolution fusions.
Furthermore, residually dilated TDNN networks lead to advanced performance in SV [26,27,30] due to their excellent performance in capturing temporal contextual information.The utilization of dilated convolution extends the receptive field of the TDNN layers, enabling dense feature representation without introducing any additional parameters [19], thereby extracting features in a fully temporal resolution until global embedding and achieving high-resolution feature extraction.Compared with the pooling strategy, it also avoids the loss of temporal-frequency information and maintains the resolution of feature maps [31].However, we find that the discontinuous sampling of dilated TDNN introduces a serious gridding effect [32,33], which will result in information loss and a decline in the quality of acoustic frame-level features.
In this letter, we propose a CNN-TDNN architecture with repeated fusions for SV, and the framework is exhibited in Figure 1.Our contributions mainly include the following four aspects: (1) we propose a competitive structure without additional number of parameters for high-resolution speaker embedding, which means better performance in SV; (2) we search for CNN encoders with temporal-frequency bottlenecks to extract multi-scale features in the time dimension; (3) we study the structures of repeated multi-scale feature fusions (RMSFs) to ensure the high-resolution feature extraction of TDNNs; and (4) we train the parameter weights using English datasets and test them on the CN-Celeb set.The results indicate that our method presents surprise improvement in cross-domain adaption SV tasks.In the SV task, it is often crucial to extract key point signals that are conducive to speaker representation.To achieve high-resolution representation for speakers, making full use of multi-scale feature representation is considered effective.Thus, we employ a deep CNN encoder to obtain temporal-frequency features at different scales.
We exploit the popular residual network backbone [25] to obtain features of different scales at different depths, which will then be reshaped to 2D matrices through the proposed bottleneck transformation layers.And in our research, we found that the number of residual units in each convolution layer should be kept at three or more, to extract sufficiently good feature maps, but large channels are unnecessary.We also add SE module [35] for each unit of the residual network to calculate the attention of channels, which is widely considered effective for highlighting important local frequency regions.
Supposing that the input frame-level filter banks feature of CNN encoder is X sepc ∈ R F×T .Four branches are obtained with different downsample rate in total.We call the branch without downsampling as the main branch, and other branches are minor branches marked by red in color as shown in Figure 1b.And, we found that C = 512 is enough for a effective SV task in our network structure.

Bottleneck Transformation
In our preliminary experiments, it was found that utilizing a direct feature transformation between CNNs and TDNNs results in inferior performance.We consider that is because the CNN encoder and TDNNs have different latent spaces of features.To obtain frame-level feature maps matching the input of the following TDNN blocks, we propose the bottleneck transformation structure located at the junction of CNNs and TDNN blocks.The feature maps of each branch from the CNN encoder are first flattened, then expanded through the transformation.As is shown in Figure 2, for each branch with a shape of R C in ×F×T , the transformation is expressed as: where M i ∈ R F×T indicates the feature map of channel i and M ∈ R [C in ×F]×T represents the reshaped 2D matrix.'BN' means batchnorm.The channels are firstly squeezed to C out /4 and obtain X neck ∈ R C out /4×T , then we extend the feature map to X ∈ R C out ×T .The transformation layer compresses the feature maps, which is beneficial to more major information with less computation.We suppose that X ∈ R C×T is the output of main branch, and X i ∈ R C/2×T/2 i refers to output of the ith minor branch with a downsample factor of 2 i in time dimension, i = {1, 2, 3}.The shape of X is maintained through all the TDNN blocks to ensure a high-resolution expression, while minor branches are applied to fuse with the main branch repeatedly in every TDNN block.

TDNN Blocks with Multiple Fusion Layers
The Res2Net [36] module gets multi-scale frame-level features by enhancing the middle layer of the bottleneck block.However, we notice that serious gridding effect [32] emerged and the neighboring information is not fully taken into account when introducing constant dilation rate in 1D-Res2Net, as shown in Figure 3. Assume that the input feature X ∈ R C×T is split into s feature map subsets, and the sth feature map X s can be expressed as: For the sake of convenience, C/s is defined as 1, which means we assume X s to be a single-channel.Consequently, the dilated convolution can be mathematically represented by the subsequent formula: where x[i] and y[i] denote the input and output signal, respectively.w is the filter with a filter length of 3, and l = {−1, 0, 1}.r means the dilation rate.Dilated convolutions return to standard convolutions when r = 1.So, for group convolutions in cascade, like Figure 3 shows, we can obtain: Thus, taking y 4 as an example, where w k means the weight of TDNN_k in Figure 3a, we can intuitively deduce that: Although the receptive field of y 4 is extended to 2mr + 1, the actual receptive field is from isolated frames {i, i ± r, i ± 2r, i ± 3r}.Frankly, there is no correlation for neighboring frames when r > 1, and the accumulation between different TDNN blocks may cause a larger grid effect, which can result in information loss.The CNN-TDNN framework can be attributed to the CNN's ability to better handle local feature information.However, the research above has shown that the dilated TDNN structure suffers from information loss due to the grid effect during computation.Moreover, the existing CNN-TDNN structure only reinforces local features at the input end of TDNNs, resulting in progressive information loss during the computation process of the TDNN backbone.
To address the information loss problem, and inspired by HRNet [29] to obtain highquality features, a combination of repeated multi-scale fusions and dilated TDNN is introduced to compensate the TDNN blocks as demonstrated in Figure 4a.A fusion operator can be formulated as: with f i [ • ] denoting upsample function with a scale factor of 2 i−1 , and we simply employ the simplest nearest neighbor interpolation for upsampling.'BN' means batchnorm.Therefore, the minor branches are extended to the same shape as the main branch to implement the multi-scale fusion.This enables each TDNN block's computation to receive multi-scale information from the CNN encoder repeatedly, with information upsampled from different scales to enhance the relatedness of local neighbor features in time dimension.All of the features with strong correlation upsampled from minor branches will be fused with the main branch to fulfill the compensation of dilated TDNN blocks.
To exploit more information from different TDNN blocks, multi-layer feature aggregation (MFA) and residual summation connections [19] are adopted.The input of each TDNN block will be connected with the previous corresponding position by an element-wise addition, as is shown in Figure 4b.Especially for MFA, we made some adjustments to match the RMSFs.We aggregate the final TDNN block with all previous fusion layers except the first, just like Figure 1b depicted.

Dataset
We conduct our experiment on the famous large-scale corpora-Voxceleb1&2 dataset [20,21].For the training dataset, we use the Voxceleb2 development set, consisting of 1,092,009 utterances across 5994 speakers, and a small subset of it is reserved for validation.After training, the network is evaluated on different test subsets of Vox-celeb1: VoxCeleb1-O (Vox1-O) and VoxCeleb1-H (Vox1-H).To prove the benefits of our proposed method for complex verification tasks, we directly conduct tests on different target domain set: CNCeleb1 test set [37].We also fine-tuned the pre-trained models in the SV dataset:CNCeleb1&2 [37,38].
It should be noted that no CNCeleb subsets overlap with the Voxceleb2 development set, and we just utilize the original utterances for training or evaluating without the employment of voice activity detection.

Preprocessing and Data Augmentation
All audios are randomly cropped to 2 s in training, audios shorter than 2 s will be duplicated and concatenated to create a 2 s segment.And, the sampling rate is 16,000 Hz.Pre-emphasis with a coefficient of 0.97 is first applied.Then, 80-dimensional log Melfilterbanks are extracted with a 25 ms hamming windows and 10 ms shift.The frequency is limited to 20-7600 Hz in Mel spectragrom.Since the downsampling requirements in the CNN encoder, the number of frames is cut into an integer multiple of 8.In addition, mean normalization in time dimension is applied.
Training audios are randomly augmented for a robust model.Our augmentation methods are mainly in terms of MUSAN dataset [39] and RIR dataset [40].MUSAN provides three types of additional noise that can be mixed with the original sounds to generate augmented data.RIR has a variety of simulated room impulse responses for reverberation augmentation, which enhances the robustness of SV systems against various environmental interferences encountered during recording.In addition, we apply SpecAugment [41] in the pre-training stage, with random time masking and frequency masking, respectively, for all spectrograms.Considering the computational cost, no speed perturbation [42] is adopted in our experiments.

Baseline Systems
Considering the outstanding performance of both the CNN framework with ResNet as the backbone and the TDNN network in SV, we reproduce the prevailing SE-ResNet [23], ECAPA-TDNN [19], MFA-TDNN [27] and ECAPA-CNN-TDNN [26] as our baseline systems.
In our research, some of the models are adjusted to a proper size and all baseline systems are trained or evaluated under the same conditions for a fair comparison, due to differences between the fine-tuning strategies and the hyperparameters used in different systems.All the parameters are displayed in our experiments.For MFA-TDNN, we reproduced the MFA-TDNN and adjusted it to a larger size.All other model reproductions primarily conform to the original references, while incorporating minor modifications as needed in our comparative experiments.

Training Strategies and Fine-Tuning Details
Trainable parameters are first optimized by Adam within 120 epochs, with initial learning rate of 0.001 as well as a decay factor of 0.97.The batch size is fixed to 512 for all training tasks.We apply AAM-softmax loss function [14] for all experiments.The margin and scale of AAM-softmax are set as 0.2 and 30, respectively.Then, we fine-tune [43] the pre-trained models over the following 10 epochs.

•
Large-margin fine-tune: The large-margin fine-tune strategy is applied to the pre-trained models on the Vox-Celeb dataset.In the fine-tune stage, we reset the batch size to 64.All input utterances are cropped to 6 s, and the margin of AAM-softmax is adjusted to 0.5.The initial learning rate is 2 × 10 −5 with a decay rate of 0.9.SpecAugmentation is disabled, and the other settings remain unchanged.• Cross-language fine-tune: In addition, we fine-tune the above pre-training models on the cross-language CN-Celeb1&2 dataset [37,38] to compare the cross language SV performance between the models.Taking into account the distribution of duration within the CNCeleb dataset, we made slight adjustments to the training parameters.We crop utterance into 4 s intervals for fine-tuning.The initial learning rate is reset to 1 × 10 −5 with a decay rate of 0.9.While keeping other settings the same above.

Evaluation Protocol
After training, we extract 192-dimension speaker embeddings in test pairs and calculate the cosine similarity between embeddings of each pair, then adaptive s-norm [9] is performed with an imposter cohort size of 600.We measure the systems according to equal error rate (EER) and minimum of the normalized detection cost function (MinDCF).The EER represents the point at which the false acceptance rate and false rejection rate are equal.The MinDCF, which incorporates weighting for acceptance and rejection errors, is expressed as: C det = C miss × P miss × P tar + C f a × P f a × (1 − P tar ).In our research, we consider a prior target probability P target of 0.01 and misses C miss= = false alarms C f a = 1.

Comparison of Systems
In this section, we verify the performance of our proposed model by comparing with different advanced methods, and all results are displayed in Table 1.We mainly reproduce the advanced models in recent years for comparison, especially the efficient ECAPA-CNN-TDNN structure.We list the performance of all systems on test subsets of VoxCeleb1: Vox1-O and Vox1-H, together with the CNCeleb test set in different domains.And in addition to EER and MinDCF, we provide the model size and multiply accumulate operations (MACs) of each system reproduced, and MACs are measured based on the training strategies above.It was observed that, when tested on the Vox1-O test set, our proposed method demonstrates an approximate 13% improvement in EER compared to the ECAPA-TDNN and ECAPA-C-TDNN baseline systems, without necessitating an increase in the number of parameters or in the computational complexity of MACs.
And, in comparison to the baseline systems utilizing the ResNet architecture, our proposed method achieves a substantial enhancement in performance.Relative to the SE-ResNet34 baseline model with 64 channels, our proposed approach achieves performance improvements of 41% and 36% in the EER and MinDCF metrics, respectively.Furthermore, our method performs better in the hard-to-verification test set: Vox-H and test set of a different domain: the CN-Celeb test set.To further demonstrate the performance in cross domain SV task, we fine-tuned the pre-trained models in CNCeleb development dataset, and then again testing the models in CNCeleb test set.The experimental results are shown in Table 2.The comparison results indicate that our method achieves highly competitive performance while ensuring a light structure with less computational complexity.
In the following, we conduct ablation experiments to prove the effectiveness of our method and investigate the importance of each component, especially the proposed multiscale feature fusion method.Models are evaluated on the Vox1-O test dataset, as shown in Table 3.In 1 , RMSFs are removed and only the top fusion layer is reserved.And in 2 , we only keep the minor branch X 1 for repeated fusion.RMSFs reinforce the correlation of local neighborhood information and achieve better performance.Then, we replace the bottleneck transformation structure with a temporal dimensional convolution layer in 3 .
And, it turns out that it is effective to map the 2D features into a latent space conforming to the TDNN structure.We also verify the adjusted MFA and Res.Sum.Connection in 4 and 5 , and the results reflect that both can lead to about 10% relative improvement in EER, respectively.

Comparison with Variants of Stacking
We argue that varying the number of multi-scale fusions together with different TDNN blocks has a lot of effects in SV.To study the effectiveness and reasonableness of our proposed structure, we design a group of experiments on network variants, and the results on Vox1-O are listed in Table 4.The second line is our standard structure.In the fourth line, we change the fusion times to four, and we find it has a further improvement, especially 18.3% relative improvement in MinDCF.Then, we expand the TDNN blocks twice in the second and fifth lines, but the results suggest that the performance has deteriorated.In the end, we change the fusion times to six, but there is only a slight improvement in MinDCF.Experimental results show that three blocks of TDNN with fusion layers are enough to achieve high performance with reasonable calculation efficiency.

Conclusions
In this paper, we improve the CNN-TDNN architecture via the proposed repeated fusions method, leading to high-resolution and relatively lightweight speaker embeddings for text-independent SV.We utilize a ResNet backbone with bottleneck transformation to provide high-quality features in different time-frequency scales for TDNN blocks, and the gridding effect of dilated TDNN modules is compensated with the proposed fusion method.Experimental results demonstrate that our proposed model achieves superior performance on the VoxCeleb1 test subsets and CN-Celeb cross-domain evaluation set without additional model parameters and computational complexity.For future work prospects, we also hope to perform experiments on more diverse and challenging datasets to explore the generalization performance of the SV method.Furthermore, our future work should further consider real-world applications, such as addressing the challenges posed by recording processes and environmental interference.

Figure 1 .
Figure 1.(a) Block diagram of CNN-TDNN structure."Attentive Statistics Pooling" is referenced from [34], and the classifier settings are adopted from [14] in our experiments.(b) The overall architecture of proposed model."Trans" refers to bottleneck transformation.The default setting of C is 512 and settings of C 0 ~C4 are: [16, 16, 24, 48, 96]."TDNN block" indicates the dilated 1D-Res2Net and the dilation rate = [2, 3, 4] with 3 TDNN modules as the default."Trans" layers in multi-scale CNN encoder mean the bottleneck transformation and them change channels to C. "ASP Layer" means attentive statistic pooling layer."Fully Connected Layer" change channels to the embedding dimension, defaulting to 192.

Figure 2 .
Figure 2. The structure of bottleneck transformation.C in means number of channels of input branch and C is the number of output channels.

Figure 3 .
Figure 3. Illustrations of (a) Dilated 1D-Res2Net and (b) Gridding effect.1D CNN with a kernel of 1 for multi-channel information integration in time dimension.TDNN_i means dilated TDNN block with a set dilated rate.

Fusion_layerFigure 4 .
Figure 4.The structure of the (a) fusion layer and (b) Residual summation connection (Res.Sum.Connection).In the fusion layer, 'Identity' indicates identical mapping, 'Conv1d' is 1D convolution layer with a kernel of 1. 'BN1d' is 1D batchnorm, and 'Upsampling' means 1D nearest neighbor upsampling.The main branch with a shape of [C, T] remains constant, and other miner branches are up-sampled after multi-channel information processing.Afterwards, all branches are added together to obtain the output.

Table 1 .
The performance comparison by EER and MinDCF of produced networks and other SOTA models.All models are trained on the VoxCeleb2 dataset.

Table 2 .
Experimental results with fine-tuning in the CNCeleb development dataset.Fine-Tune indicates large-margin fine-tune strategy.And ✓and × indicate adoption and non adoption of the fine-tuning respectively. *

Table 3 .
Ablation study of proposed method.The best performance is marked in bold below.
* w/o represents without.

Table 4 .
Comparison in different structure of networks.The best performance is marked in bold.
* Structure: for [d 1 , d 2 , d 3 ] × n, d i means the dilated rate of TDNN block i, n means stacking TDNN unit n times in each block.