To address the challenge of capturing speaker-specific acoustic features from short, fixed utterances, we propose a lightweight end-to-end TD-SV framework based on Conformer architecture. We introduce a CTA-Conformer structure that incorporates a CTA module, which dynamically emphasizes time–frequency regions essential for capturing distinctive speaker attributes. By jointly optimizing feature extraction and scoring processes, the proposed model generates highly discriminative speaker embeddings while maintaining computational efficiency.
Figure 1 shows the training and inference phases of the proposed end-to-end TD-SV system based on CTA-Conformer. Unlike conventional stage-wise TD-SV systems, which separately train feature extractors and back-end models (e.g., PLDA), the proposed system unifies all components into an end-to-end structure. During the training phase, the CTA-Conformer generates speaker embeddings that encode speaker-specific features, and an angular margin loss, AAM-Softmax, is applied to enhance speaker discrimination. The inference phase consists of two stages: enrollment and verification. In enrollment, a speaker’s voice is processed to generate a reference speaker embedding, comparable to voice registration in virtual assistants. In the verification stage, a test utterance embedding is compared with the reference embedding using cosine similarity to determine whether the speaker matches the enrolled speaker.
The proposed end-to-end TD-SV framework enables global context modeling by processing entire utterances, allowing for robust extraction of speaker-specific features, particularly in conditions of short and noisy utterances. Additionally, optimization strategies, such as transfer learning and embedding-level averaging, enhance robustness under diverse microphone conditions, resulting in consistent performance improvements across various acoustic environments.
3.1. Proposed CTA-Conformer Architecture
Conventional TD-SV systems primarily employ TDNN or CNN architectures, which are effective for extracting local acoustic patterns but often fail to capture long-range temporal dependencies across an entire utterance. This limitation becomes more critical in TD-SV, where fixed phrases inherently reduce phonetic diversity and short utterances are more vulnerable to noise.
To address this limitation, we adopt Conformer-based architecture that integrates self-attention mechanisms to effectively model global contextual information [
13]. This approach is particularly beneficial for TD-SV tasks, where subtle speaker-specific characteristics—including speaking rate, intonation, and articulation—are often distributed across the utterance. By utilizing global contextual information, Conformer architecture captures distributed speaker-specific features and reduces the impact of local distortions, such as background noise, resulting in more robust speaker embeddings and improved generalization under diverse acoustic environments.
We adopted the MFA-Conformer architecture, an advanced variant of the standard Conformer [
22], which includes macaron-like feed forward networks (FFNs), multi-head self-attention (MHSA), convolution modules, and relative positional encoding to maintain temporal structure in variable-length utterances. The MFA strategy aggregates output from each Conformer block along the sequence dimension, effectively capturing speaker features at multiple temporal scales from different depths.
Figure 2 illustrates the architecture of the proposed CTA-Conformer. Unlike the MFA-Conformer, the CTA-Conformer includes the CTA module positioned before Conformer blocks. A distinctive characteristic of this placement is that CTA functions as an input enhancement module, directly operating on spectrograms prior to encoding. Unlike conventional attention mechanisms that typically operate within encoder layers or at the post-encoder feature aggregation stage, CTA module selectively emphasizes speaker-discriminative acoustic patterns at the spectrogram level. Given the established importance of standard deviation in capturing speaker-specific variance patterns [
26,
27], the CTA module is designed as a lightweight convolution-based attention mechanism. It applies frequency-wise standard deviation pooling to emphasize essential speaker-specific acoustic features in short and fixed utterances. It dynamically focuses on important regions across time and channel dimensions, enabling more precise speaker embedding extraction.
This design, detailed in
Figure 3, is motivated by the observation that speaker-related cues are often concentrated in specific time segments and frequency bands, making it essential to selectively emphasize those areas. Unlike average pooling that smooths out spectral variations, frequency-wise standard deviation pooling explicitly quantifies variability across frequency bins, thereby retaining speaker-specific spectral dynamics such as acoustic variations and articulation differences. This approach is particularly beneficial in TD-SV scenarios, where limited phonetic diversity requires reliance on subtle spectral variations to achieve robust speaker discrimination.
The CTA module operates on the input feature map
, applying attention across channel, time, and frequency dimensions. To compute the attention weights, the module performs frequency-wise standard deviation pooling along the frequency axis to capture acoustic variability patterns across frequency bins. The standard deviation for each channel
c and time step
t is calculated as follows:
where
represents the frequency-wise mean, and
denotes the input feature value at channel
c, time step
t, and frequency bin
f. The computed statistics
are processed through two 3 × 3 convolutions with batch normalization (BN) and rectified linear unit (ReLU) activation, followed by sigmoid activation
to generate attention weights
:
The CTA module computes the final output
by performing element-wise multiplication between the original input feature map
and attention weights
, enabling the model to focus on speaker-relevant time–frequency regions:
Through this process, the CTA module suppresses irrelevant frequency regions and selectively emphasizes time–frequency regions critical for speaker features. In TD-SV environments, where fixed phrases are used, time alignment and subtle variations in frequency distribution provide essential cues for distinguishing between speakers. By leveraging these structural characteristics, the CTA module enhances speaker discriminability even under limited phonetic variability and short utterance duration, thereby contributing to more robust speaker embeddings. Unlike SENet’s global channel attention that loses frequency-specific information, CTA preserves frequency-wise variance patterns essential for speaker discrimination in short utterances while maintaining computational efficiency.
3.1.1. Computational Complexity Analysis
To further demonstrate the efficiency of the proposed CTA module, we provide a detailed computational complexity analysis.
CTA Module Complexity: Given input feature map and CTA module parameters: (in_channels), (kernel_size), and (middle_channels), the computational complexity is analyzed as follows:
Baseline MFA-Conformer Complexity: For a single Conformer block with feature dimension d and sequence length T:
Multi-head self-attention: ;
Macaron-style feed-forward networks: ;
Convolution module: ;
Total per block: .
Relative Overhead: We quantify the additional computation introduced by CTA at the point where CTA is applied, i.e., after convolutional subsampling. Let
be the input to the first Conformer block (post-subsampling), and let the Conformer block use relative position MHSA with expansion ratio
r and convolution kernel size
k. The per-block relative overhead is as follows:
where for 2 s audio segments, typical values yield Overhead ≈ 2.13%.
Parameter Analysis: The CTA module introduces minimal parameter overhead:
First Conv2D (): ;
Second Conv2D (): ;
BatchNorm: ;
Total CTA parameters: .
For a baseline Conformer model with ∼14.2 M parameters, this represents only 0.26% parameter increase.
Comparison with Existing Mechanisms:
Table 1 summarizes the complexity of the attention modules. CTA adds ∼37 K parameters at
,
(0.26% of a 14.2 M baseline), and ≈2% per-block compute while scaling linearly with
C (vs.
for SE/CBAM) and leveraging frequency-wise statistic pooling with temporal gating for TD-SV.
In summary, the CTA module introduces only marginal computational and parameter overhead, ensuring the proposed model remains efficient for deployment in resource-constrained environments.
3.1.2. Utterance-Level Representation and Loss Functions
After frame-level features are processed through Conformer blocks and aggregated using the MFA strategy, attentive statistical pooling (ASP) [
26] is applied to convert the sequence into a fixed-dimensional utterance-level representation. ASP computes a weighted mean and standard deviation by assigning importance to each frame, allowing the model to focus on salient regions while preserving global distributional characteristics. The resulting representation is then passed through a fully connected layer to produce a 192-dimensional speaker embedding.
The extracted speaker embeddings are trained using distance-based loss functions, such as additive margin softmax (AM-Softmax) and additive angular margin softmax (AAM-Softmax) [
28]. While cross-entropy (CE) loss, widely used in TD-SV systems [
12,
29,
30], optimizes class probabilities through Softmax outputs, it lacks explicit constraints to ensure within-speaker compactness in the embedding space. In contrast, margin-based loss functions improve within-speaker compactness and between-speaker separation, resulting in more discriminative speaker embeddings and precise decision boundaries. These properties are particularly beneficial in TD-SV scenarios with short utterances [
31] and limited training data, where the risk of misclassifications is higher. In this study, AAM-Softmax is employed to train speaker embeddings.
3.2. Training and Optimization Strategies
3.2.1. Transfer Learning for Domain Adaptation
In practical TD-SV scenarios, discrepancies in microphone characteristics, channel conditions, and background noise introduce domain mismatch between the training and test distributions, adversely affecting model generalization. To address this problem, we employ transfer learning as a practical and effective domain adaptation (DA) strategy. First, the CTA-Conformer is pre-trained on a large-scale text-independent dataset that includes diverse recording conditions, enabling the model to learn speaker representations that are robust to such variability. Then, fine-tuning is performed using a small text-dependent dataset collected under realistic environments characterized by short utterances, far-field microphones, reverberation, and noise. During fine-tuning, only the parameters of the loss function module are updated, preserving general knowledge from text-independent data while adapting to the specific characteristics of the TD-SV domain.
3.2.2. Data Augmentation
We apply an online data augmentation strategy to improve generalization, utilizing background noise from the MUSAN dataset [
32] and reverberation from simulated RIRs datasets [
33]. Volume change, noise injection, and reverberation are applied dynamically during training, effectively doubling the dataset size and improving acoustic robustness. Furthermore, to mitigate domain mismatch between clean enrollment and real-world test conditions, we simulate multiple far-field scenarios for enrollment using diverse RIRs, while test utterances remain unchanged.
3.2.3. Embedding-Level Averaging
Real-world TD-SV applications often involve multiple microphones or multi-channel inputs, introducing acoustic variability due to differences in microphone placement and channel-specific signal-to-noise ratio (SNR). Such variability can degrade the consistency of speaker embeddings and negatively impact system performance. To address this issue, we employ embedding-level averaging during inference. For each speaker, embeddings from multiple channels are averaged to obtain a single representation vector, suppressing channel-specific noise and distortion while preserving speaker-relevant information. This approach reduces within-speaker variance and enhances between-speaker separability, resulting in more compact and discriminative embeddings.
3.2.4. Score-Level Greedy Fusion
We performed system fusion at the score-level to integrate cosine similarity scores generated by multiple TD-SV models under different configurations. While simple averaging is widely used in prior works, including low-performing models in the fusion can degrade overall verification performance. To address this issue, we adopted a score-level greedy fusion strategy based on a greedy algorithm. Algorithm 1 presents the detailed greedy fusion procedure. As illustrated in
Figure 4, the method begins with the best-performing model and sequentially adds other models only if they improve performance. The final score is computed by averaging the selected models. This process leads to better performance than conventional averaging-based methods.
Algorithm 1 Score-level Greedy Fusion for TD-SV |
Input: Model set ; validation scores ; labels y; metric in Equation ( 6) Output: Selected subset S, weights W, best score Compute single-model performance for (ascending order) Initialize , , for do Compute weights by Equation ( 7) on Fuse scores by Equation ( 8) using if then , , end if end for return
|
The algorithm addresses model subset selection by minimizing the fusion performance metric, which is formally defined as the objective function:
where
S is a subset of models from the complete model set
M,
represents the weighted combination of similarity scores from models in
S (defined in Equation (
8)), and Perf denotes either EER or minDCF depending on the evaluation criterion.
Unlike simple averaging approaches, our method employs performance-based weighting where each model’s contribution is inversely proportional to its individual error rate:
where
represents the weight for model
j,
is the performance score (EER or minDCF) of model
j, and
denotes the similarity scores from model
j. The fused score is computed as follows:
This weighting scheme ensures that better-performing models have a greater influence on the final decision, while the greedy selection strategy reduces computational complexity from for exhaustive search to while ensuring monotonic improvement. This makes the approach practical for real-world speaker verification systems with multiple model configurations.