An End-to-End Transformer-Based Architecture with Channel-Temporal Attention for Robust Text-Dependent Speaker Verification

Shin, Chaerim; Kim, Taegu; Cho, Yonghun; Shin, Kihun; Baek, Yunju

doi:10.3390/app151810240

Open AccessArticle

An End-to-End Transformer-Based Architecture with Channel-Temporal Attention for Robust Text-Dependent Speaker Verification

by

Chaerim Shin

^1,2

,

Taegu Kim

¹,

Yonghun Cho

¹

,

Kihun Shin

¹

and

Yunju Baek

^1,*

¹

School of Computer Science and Engineering, Pusan National University, Busan 46241, Republic of Korea

²

Gamba Labs, Busan 46241, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10240; https://doi.org/10.3390/app151810240

Submission received: 27 August 2025 / Revised: 14 September 2025 / Accepted: 18 September 2025 / Published: 20 September 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Text-dependent speaker verification (TD-SV), which verifies speaker identity using predefined phrases, has gained attention as a reliable contactless biometric authentication method for smart devices, internet of things (IoT), and real-time applications. However, in real-world environments, limited training data, background noise, and microphone channel variability significantly degrade TD-SV performance, particularly on resource-constrained devices that require real-time inference. To address these challenges, we propose a lightweight end-to-end TD-SV model based on a convolution-augmented transformer (Conformer) architecture enhanced with a channel-temporal attention (CTA) module as an input enhancement that specifically targets speaker-discriminative patterns in short, fixed utterances. Unlike existing attention mechanisms (Squeeze-and-Excitation Networks (SENet), Convolutional Block Attention Module (CBAM)) designed for computer vision tasks, our CTA module employs frequency-wise statistical pooling to capture acoustic variability patterns crucial for speaker discrimination within identical phonetic content. Experiments conducted on the challenging far-field and noisy SLR 85 HI-MIA dataset demonstrate that the proposed CTA-Conformer achieves an equal error rate (EER) of 2.04% and a minimum detection cost function (minDCF) of 0.20, achieving competitive performance compared to recent TD-SV approaches. Additionally, INT8 quantization reduces the model size by 75.8%, significantly improves inference speed, and enabling real-time deployment on edge devices. Our approach thus offers a practical solution for robust and efficient TD-SV in embedded internet of things (IoT) environments.

Keywords:

text-dependent speaker verification; channel-temporal attention; conformer; embedded systems; far-field; multi-channel; background noise

1. Introduction

Speaker verification (SV) is a biometric authentication technology that verifies an individual’s identity based on their unique voice characteristics, providing a contactless and user-friendly alternative to conventional password-based systems. With the widespread adoption of smart devices, internet of things (IoT) platforms, and voice-controlled interfaces, SV has become a crucial component in achieving secure and convenient user authentication [1]. SV systems are classified into text-independent speaker verification (TI-SV) and text-dependent speaker verification (TD-SV) approaches. While TI-SV extracts speaker characteristics regardless of spoken content, requiring relatively long utterances to ensure reliability, TD-SV operates on predefined phrases, enabling rapid and accurate authentication with short utterances of 1–3 s [2,3]. These characteristics make TD-SV particularly suitable for real-time applications, such as voice assistants (e.g., “Hey Siri”, “OK Google”), smart home devices, and in-vehicle systems, where low latency and prompt responses are essential.

However, TD-SV remains underexplored due to several inherent challenges [2]. Short and fixed utterances limit the available temporal information, making extraction of robust speaker embeddings challenging. Additionally, TD-SV requires datasets composed of predefined phrases, which significantly increases data acquisition cost and time. Publicly available TD-SV datasets are scarce, posing substantial hurdles for research and model development. In real-world environments, domain mismatch frequently occurs between enrollment and test conditions due to background noise variations, microphone channel variability, and recording distances changes [4,5]. These issues significantly degrade system performance under practical conditions, highlighting the need for robust and adaptable TD-SV architecture.

The inherent challenges of TD-SV pose fundamental obstacles to adopting advanced deep learning architectures. Previous TD-SV research has primarily focused on stage-wise structures, separating front-end speaker feature extraction and the back-end scoring methods. Common front-end models include i-vector embedding [6], time delay neural networks (TDNNs) [7], and convolutional neural networks (CNNs) [1,8], while back-end components typically utilize probabilistic linear discriminant analysis (PLDA) [9,10] and cosine similarity. Although effective in modeling local acoustic features, the independent optimization of these components often results in error propagation and degraded system performance.

Transformer-based models have recently achieved notable success in TI-SV by effectively capturing long-range dependencies. However, directly applying Transformer models to TD-SV remains challenging due to inherent limitations, including short utterances and limited phonetic variability. These challenges hinder the effective modeling of speaker-specific acoustic characteristics, thereby limiting the overall system performance [11]. Additionally, Transformer models typically rely on end-to-end structures that jointly optimize all components, which contrasts with the stage-wise structures common in TD-SV. This architectural mismatch complicates the direct application of Transformer models to TD-SV frameworks [12].

This study proposes several approaches to address the significant challenges of TD-SV. First, we adopt an end-to-end structure to overcome the suboptimal performance of stage-wise TD-SV systems. This architecture enables the model to automatically learn features necessary for effective speaker discrimination, leading to representations optimized for the SV task. Second, we design a speaker-embedding architecture tailored for the constraints of TD-SV. We adopt the convolution-augmented transformer (Conformer) [13] as the backbone of our model, combining convolutional layers for local acoustic modeling with self-attention for long-range dependencies. To better capture speaker-relevant features under limited phonetic variability, we introduce a channel-temporal attention (CTA) module into the Conformer architecture to enhance salient time–frequency regions critical for speaker discrimination. Finally, to enhance training efficiency and generalization under diverse acoustic conditions, we incorporate optimization strategies, including transfer learning, data augmentation, embedding-level averaging, and score-level fusion. Additionally, we apply quantization to enable real-time inference on edge devices with limited resources. The main contributions of this study are summarized as follows:

To the best of our knowledge, this is the first study to employ an end-to-end Conformer architecture specifically tailored for TD-SV.
We propose a CTA module that effectively emphasizes speaker-discriminative features under phonetic and temporal constraints.
We introduce optimization strategies that improve model robustness and generalization under diverse recording conditions.
The proposed approach improves robustness under real-world conditions and demonstrates the potential for real-time inference on resource-constrained edge devices.

The paper is organized as follows: Section 2 reviews related works on TD-SV and Transformer-based architectures for SV. Section 3 describes the proposed end-to-end CTA-Conformer architecture and the training and optimization strategies for TD-SV. Section 4 presents experiments evaluating the system performance under various conditions, including comparisons of existing methods, ablation studies, multi-channel scenarios, and on-device inference. Finally, Section 5 presents conclusions and future work.

2. Related Works

2.1. Text-Dependent Speaker Verification

TD-SV systems provide rapid authentication and low computational complexity by using short, fixed utterances. However, reliance on such short utterances limits temporal information and phonetic variability, restricting the extraction of robust speaker embeddings [6,14]. This constraint reduces the opportunities to capture diverse within-speaker characteristics, such as subtle variations in the speaking rate, intonation, and articulation, which are essential for accurate speaker modeling. Additionally, the scarcity of publicly available TD-SV datasets further complicates the effective capture of speaker-specific acoustic features, often resulting in embeddings with limited discriminative capabilities and degraded performance [15].

To address these limitations, recent studies have focused on enhancing speaker-embedding models that can extract robust speaker representations from limited phonetic content. Traditional methods, such as the i-vector approach, represent utterances as compact speaker embeddings, which are subsequently scored using PLDA [9,10,16]. PLDA effectively differentiates between-speaker and within-speaker variability, enhancing TD-SV performance through accurate modeling of speaker and session variations [6].

Recent advancements employing DNNs have notably improved TD-SV performance. TDNN approaches, exemplified by the x-vector framework [7], have been widely adopted to model temporal dependencies in short utterances. Convolutional architectures, including CNNs [17], ResNet [8], and DenseNet [1], have demonstrated substantial effectiveness in capturing local time–frequency features. Zhang et al. introduced a DenseNet architecture in which each layer is connected to all subsequent layers. This dense connectivity enhances information flow and facilitates the capture of long-range time–frequency context information [1].

Despite these advancements, the majority of existing TD-SV systems rely on a stage-wise approach, separately optimizing feature extraction and speaker classification modules. Although this approach provides stability during training, it introduces mismatches between training objectives and practical evaluation conditions, causing error propagation across modules and limiting adaptability to varying acoustic scenarios [12]. Additionally, extensive manual parameter tuning is often necessary for successful deployment in real-world applications.

2.2. Transformer-Based Architecture for Speaker Verification

While recent advances in deep learning have significantly improved speaker-embedding models for TD-SV, many of these improvements have been adapted from developments in TI-SV. TI-SV systems commonly employ attention-based architectures capable of processing long and unconstrained speech segments, which have proven particularly effective for speaker modeling in open-phrase conditions.

Transformer-based architectures, notably leveraging self-attention mechanisms, have demonstrated outstanding performance in TI-SV. Novoselov et al. proposed a Transformer-based TI-SV system incorporating the self-supervised speech representation model wav2vec 2.0 [18]. This system outperformed TDNN- and CNN-based approaches, even with limited training data and simple data augmentation. By capturing long-range temporal dependencies effectively, Transformer models extract robust speaker embeddings from utterances of varying lengths, thereby improving their ability to generalize across diverse speakers and acoustic environments. As a result, Transformer-based systems outperform traditional architectures in various benchmarks.

However, the self-attention mechanism exhibits limitations in capturing local acoustic features and time–frequency patterns, which can potentially degrade performance in tasks that require precise local context. To address these limitations, recent studies have introduced Conformer architectures, combining CNN-based local feature extraction with Transformer-based global context modeling [13]. Conformers have been widely adopted in TI-SV tasks due to their ability to jointly model local and global acoustic dependencies [19,20,21]. For instance, Y. Zhang et al. proposed a multi-scale feature aggregation Conformer (MFA-Conformer) that integrates multi-scale representations across layers, combining local features from CNNs and global features from Transformers at each layer [22]. This structure enhances TI-SV performance and inference speed compared to TDNN and CNN models.

Despite the considerable success of Transformer and Conformer architectures in TI-SV, their application to TD-SV remains limited due to short, phonetically constrained utterances, which restrict the effective modeling of long-range dependencies. Nonetheless, even within short and fixed phrases, speaker-specific attributes such as speaking rate, stress, and rhythm can differ significantly, necessitating effective global context modeling. However, most existing TD-SV systems still rely on a stage-wise architecture, which limits the effectiveness of end-to-end training approaches to capture these subtle variations.

2.3. Attention Mechanisms in Speaker Verification

Attention mechanisms originally developed for computer vision have also been adapted in SV. Squeeze-and-excitation networks (SENet) [23] enhance feature representation by recalibrating channel-wise features through global average pooling followed by two fully connected layers. However, this global average pooling assigns a single uniform weight to all frequency bands within a channel, limiting the ability to capture frequency-specific patterns critical for speaker discrimination.

The efficient channel attention network (ECA-Net) [24] improves upon SENet by replacing fully connected layers with a 1D convolution, thereby modeling local cross-channel interactions without dimensionality reduction. However, ECA-Net still overlooks temporal variance, which is crucial for characterizing speakers in short utterances typical of TD-SV scenarios.

The convolution block attention module (CBAM) [25] combines channel and spatial attention to determine both the importance of each channel and the relevance of different regions in the feature map. CBAM’s spatial attention is designed for 2D visual structures and does not align well with the time–frequency representations of speech, where temporal dynamics play a more critical role.

These limitations highlight that existing attention mechanisms require domain-specific adaptations for TD-SV applications. Our proposed CTA module addresses these issues by applying frequency-wise statistical pooling to assign differential weight across frequency while preserving temporal variance, enabling robust speaker discrimination in short and fixed utterances.

Prior work on SV has already demonstrated the effectiveness of incorporating standard deviation in statistical pooling. Okabe et al. [26] showed that weighted standard deviations, in addition to weighted means, capture long-term temporal variability of speaker traits and significantly enhance embedding discriminability, stating that standard deviation contains other speaker characteristics in terms of temporal variability over long contexts. Similarly, Wang et al. [27] empirically confirmed that removing the standard deviation leads to substantial performance degradation in x-vector systems, and that simple standard deviation pooling achieves the best results across multiple architectures, confirming that standard deviation encodes speaker-dependent information. Motivated by these findings from temporal pooling, our work extends the use of standard deviation to frequency-wise attention weighting, enabling the model to capture spectral variability for enhanced speaker discrimination.

3. Proposed TD-SV System

To address the challenge of capturing speaker-specific acoustic features from short, fixed utterances, we propose a lightweight end-to-end TD-SV framework based on Conformer architecture. We introduce a CTA-Conformer structure that incorporates a CTA module, which dynamically emphasizes time–frequency regions essential for capturing distinctive speaker attributes. By jointly optimizing feature extraction and scoring processes, the proposed model generates highly discriminative speaker embeddings while maintaining computational efficiency.

Figure 1 shows the training and inference phases of the proposed end-to-end TD-SV system based on CTA-Conformer. Unlike conventional stage-wise TD-SV systems, which separately train feature extractors and back-end models (e.g., PLDA), the proposed system unifies all components into an end-to-end structure. During the training phase, the CTA-Conformer generates speaker embeddings that encode speaker-specific features, and an angular margin loss, AAM-Softmax, is applied to enhance speaker discrimination. The inference phase consists of two stages: enrollment and verification. In enrollment, a speaker’s voice is processed to generate a reference speaker embedding, comparable to voice registration in virtual assistants. In the verification stage, a test utterance embedding is compared with the reference embedding using cosine similarity to determine whether the speaker matches the enrolled speaker.

The proposed end-to-end TD-SV framework enables global context modeling by processing entire utterances, allowing for robust extraction of speaker-specific features, particularly in conditions of short and noisy utterances. Additionally, optimization strategies, such as transfer learning and embedding-level averaging, enhance robustness under diverse microphone conditions, resulting in consistent performance improvements across various acoustic environments.

3.1. Proposed CTA-Conformer Architecture

Conventional TD-SV systems primarily employ TDNN or CNN architectures, which are effective for extracting local acoustic patterns but often fail to capture long-range temporal dependencies across an entire utterance. This limitation becomes more critical in TD-SV, where fixed phrases inherently reduce phonetic diversity and short utterances are more vulnerable to noise.

To address this limitation, we adopt Conformer-based architecture that integrates self-attention mechanisms to effectively model global contextual information [13]. This approach is particularly beneficial for TD-SV tasks, where subtle speaker-specific characteristics—including speaking rate, intonation, and articulation—are often distributed across the utterance. By utilizing global contextual information, Conformer architecture captures distributed speaker-specific features and reduces the impact of local distortions, such as background noise, resulting in more robust speaker embeddings and improved generalization under diverse acoustic environments.

We adopted the MFA-Conformer architecture, an advanced variant of the standard Conformer [22], which includes macaron-like feed forward networks (FFNs), multi-head self-attention (MHSA), convolution modules, and relative positional encoding to maintain temporal structure in variable-length utterances. The MFA strategy aggregates output from each Conformer block along the sequence dimension, effectively capturing speaker features at multiple temporal scales from different depths.

Figure 2 illustrates the architecture of the proposed CTA-Conformer. Unlike the MFA-Conformer, the CTA-Conformer includes the CTA module positioned before Conformer blocks. A distinctive characteristic of this placement is that CTA functions as an input enhancement module, directly operating on spectrograms prior to encoding. Unlike conventional attention mechanisms that typically operate within encoder layers or at the post-encoder feature aggregation stage, CTA module selectively emphasizes speaker-discriminative acoustic patterns at the spectrogram level. Given the established importance of standard deviation in capturing speaker-specific variance patterns [26,27], the CTA module is designed as a lightweight convolution-based attention mechanism. It applies frequency-wise standard deviation pooling to emphasize essential speaker-specific acoustic features in short and fixed utterances. It dynamically focuses on important regions across time and channel dimensions, enabling more precise speaker embedding extraction.

This design, detailed in Figure 3, is motivated by the observation that speaker-related cues are often concentrated in specific time segments and frequency bands, making it essential to selectively emphasize those areas. Unlike average pooling that smooths out spectral variations, frequency-wise standard deviation pooling explicitly quantifies variability across frequency bins, thereby retaining speaker-specific spectral dynamics such as acoustic variations and articulation differences. This approach is particularly beneficial in TD-SV scenarios, where limited phonetic diversity requires reliance on subtle spectral variations to achieve robust speaker discrimination.

The CTA module operates on the input feature map

X \in R^{C \times T \times F}

, applying attention across channel, time, and frequency dimensions. To compute the attention weights, the module performs frequency-wise standard deviation pooling along the frequency axis to capture acoustic variability patterns across frequency bins. The standard deviation for each channel c and time step t is calculated as follows:

μ_{c, t} = \frac{1}{F} \sum_{f = 1}^{F} X_{c, t, f}

(1)

S_{c, t} = \sqrt{\frac{1}{F} \sum_{f = 1}^{F} {(X_{c, t, f} - μ_{c, t})}^{2}}

(2)

where

μ_{c, t}

represents the frequency-wise mean, and

X_{c, t, f}

denotes the input feature value at channel c, time step t, and frequency bin f. The computed statistics

S_{c, t} \in R^{C \times T}

are processed through two 3 × 3 convolutions with batch normalization (BN) and rectified linear unit (ReLU) activation, followed by sigmoid activation

σ

to generate attention weights

ω_{c, t}

:

ω_{c, t} = σ ({Conv}_{3 \times 3} (ReLU (BN ({Conv}_{3 \times 3} (S_{c, t})))))

(3)

The CTA module computes the final output

Y_{c, t, f}

by performing element-wise multiplication between the original input feature map

X_{c, t, f}

and attention weights

ω_{c, t}

, enabling the model to focus on speaker-relevant time–frequency regions:

Y_{c, t, f} = ω_{c, t} \cdot X_{c, t, f}

(4)

Through this process, the CTA module suppresses irrelevant frequency regions and selectively emphasizes time–frequency regions critical for speaker features. In TD-SV environments, where fixed phrases are used, time alignment and subtle variations in frequency distribution provide essential cues for distinguishing between speakers. By leveraging these structural characteristics, the CTA module enhances speaker discriminability even under limited phonetic variability and short utterance duration, thereby contributing to more robust speaker embeddings. Unlike SENet’s global channel attention that loses frequency-specific information, CTA preserves frequency-wise variance patterns essential for speaker discrimination in short utterances while maintaining computational efficiency.

3.1.1. Computational Complexity Analysis

To further demonstrate the efficiency of the proposed CTA module, we provide a detailed computational complexity analysis.

CTA Module Complexity: Given input feature map

X \in R^{C \times T \times F}

and CTA module parameters:

C = 256

(in_channels),

k = 3

(kernel_size), and

m = 8

(middle_channels), the computational complexity is analyzed as follows:

Frequency-wise statistic pooling (Equations (1) and (2)): $O (C \times T \times F)$ ;
First Conv2D ( $C \to m$ , Equation (3)): $O (k^{2} \times C \times m \times T) = O (9 C m T)$ ;
Second Conv2D ( $m \to C$ , Equation (3)): $O (k^{2} \times m \times C \times T) = O (9 m C T)$ ;
Attention application (Equation (4)): $O (C \times T \times F)$
Total CTA: $O (2 C \times T \times F + 18 m C T)$ .

Baseline MFA-Conformer Complexity: For a single Conformer block with feature dimension d and sequence length T:

Multi-head self-attention: $O (3 T^{2} d + 5 T d^{2})$ ;
Macaron-style feed-forward networks: $O (4 r T d^{2})$ ;
Convolution module: $O (3 T d^{2} + k T d)$ ;
Total per block: $O (3 T^{2} d + (4 r + 8) T d^{2} + k T d)$ .

Relative Overhead: We quantify the additional computation introduced by CTA at the point where CTA is applied, i.e., after convolutional subsampling. Let

X \in R^{C \times T \times F}

be the input to the first Conformer block (post-subsampling), and let the Conformer block use relative position MHSA with expansion ratio r and convolution kernel size k. The per-block relative overhead is as follows:

Overhead = \frac{2 C T F + 18 m C T}{3 T^{2} d + (4 r + 8) T d^{2} + k T d}

(5)

where for 2 s audio segments, typical values yield Overhead ≈ 2.13%.

Parameter Analysis: The CTA module introduces minimal parameter overhead:

First Conv2D ( $C \to m$ ): $k^{2} \times C \times m + m = 18,440$ ;
Second Conv2D ( $m \to C$ ): $k^{2} \times m \times C + C = 18,688$ ;
BatchNorm: $2 m = 16$ ;
Total CTA parameters: $37,144$ .

For a baseline Conformer model with ∼14.2 M parameters, this represents only 0.26% parameter increase.

Comparison with Existing Mechanisms: Table 1 summarizes the complexity of the attention modules. CTA adds ∼37 K parameters at

C = 256

,

m = 8

(0.26% of a 14.2 M baseline), and ≈2% per-block compute while scaling linearly with C (vs.

O (C^{2} / r)

for SE/CBAM) and leveraging frequency-wise statistic pooling with temporal gating for TD-SV.

In summary, the CTA module introduces only marginal computational and parameter overhead, ensuring the proposed model remains efficient for deployment in resource-constrained environments.

3.1.2. Utterance-Level Representation and Loss Functions

After frame-level features are processed through Conformer blocks and aggregated using the MFA strategy, attentive statistical pooling (ASP) [26] is applied to convert the sequence into a fixed-dimensional utterance-level representation. ASP computes a weighted mean and standard deviation by assigning importance to each frame, allowing the model to focus on salient regions while preserving global distributional characteristics. The resulting representation is then passed through a fully connected layer to produce a 192-dimensional speaker embedding.

The extracted speaker embeddings are trained using distance-based loss functions, such as additive margin softmax (AM-Softmax) and additive angular margin softmax (AAM-Softmax) [28]. While cross-entropy (CE) loss, widely used in TD-SV systems [12,29,30], optimizes class probabilities through Softmax outputs, it lacks explicit constraints to ensure within-speaker compactness in the embedding space. In contrast, margin-based loss functions improve within-speaker compactness and between-speaker separation, resulting in more discriminative speaker embeddings and precise decision boundaries. These properties are particularly beneficial in TD-SV scenarios with short utterances [31] and limited training data, where the risk of misclassifications is higher. In this study, AAM-Softmax is employed to train speaker embeddings.

3.2. Training and Optimization Strategies

3.2.1. Transfer Learning for Domain Adaptation

In practical TD-SV scenarios, discrepancies in microphone characteristics, channel conditions, and background noise introduce domain mismatch between the training and test distributions, adversely affecting model generalization. To address this problem, we employ transfer learning as a practical and effective domain adaptation (DA) strategy. First, the CTA-Conformer is pre-trained on a large-scale text-independent dataset that includes diverse recording conditions, enabling the model to learn speaker representations that are robust to such variability. Then, fine-tuning is performed using a small text-dependent dataset collected under realistic environments characterized by short utterances, far-field microphones, reverberation, and noise. During fine-tuning, only the parameters of the loss function module are updated, preserving general knowledge from text-independent data while adapting to the specific characteristics of the TD-SV domain.

3.2.2. Data Augmentation

We apply an online data augmentation strategy to improve generalization, utilizing background noise from the MUSAN dataset [32] and reverberation from simulated RIRs datasets [33]. Volume change, noise injection, and reverberation are applied dynamically during training, effectively doubling the dataset size and improving acoustic robustness. Furthermore, to mitigate domain mismatch between clean enrollment and real-world test conditions, we simulate multiple far-field scenarios for enrollment using diverse RIRs, while test utterances remain unchanged.

3.2.3. Embedding-Level Averaging

Real-world TD-SV applications often involve multiple microphones or multi-channel inputs, introducing acoustic variability due to differences in microphone placement and channel-specific signal-to-noise ratio (SNR). Such variability can degrade the consistency of speaker embeddings and negatively impact system performance. To address this issue, we employ embedding-level averaging during inference. For each speaker, embeddings from multiple channels are averaged to obtain a single representation vector, suppressing channel-specific noise and distortion while preserving speaker-relevant information. This approach reduces within-speaker variance and enhances between-speaker separability, resulting in more compact and discriminative embeddings.

3.2.4. Score-Level Greedy Fusion

We performed system fusion at the score-level to integrate cosine similarity scores generated by multiple TD-SV models under different configurations. While simple averaging is widely used in prior works, including low-performing models in the fusion can degrade overall verification performance. To address this issue, we adopted a score-level greedy fusion strategy based on a greedy algorithm. Algorithm 1 presents the detailed greedy fusion procedure. As illustrated in Figure 4, the method begins with the best-performing model and sequentially adds other models only if they improve performance. The final score is computed by averaging the selected models. This process leads to better performance than conventional averaging-based methods.

Algorithm 1 Score-level Greedy Fusion for TD-SV

Input: Model set $M = {M_{1}, \dots, M_{n}}$ ; validation scores ${s_{i}}$ ; labels y; metric in Equation (6)
Output: Selected subset S, weights W, best score
Compute single-model performance $p_{i} \leftarrow Perf (y, s_{i})$ for $i = 1, . . ., n$
$sorted \leftarrow argsort ({p_{i}})$ (ascending order)
Initialize $S \leftarrow {M_{sorted [0]}}$ , $W \leftarrow [1.0]$ , $best \leftarrow p_{sorted [0]}$
for $j \in sorted [1 : n]$ do
$S_{tmp} \leftarrow S \cup {M_{j}}$
Compute weights $W_{tmp}$ by Equation (7) on $S_{tmp}$
Fuse scores $s_{fusion}$ by Equation (8) using $W_{tmp}$
$v \leftarrow Perf (y, s_{fusion})$
if $v < best$ then
$S \leftarrow S_{tmp}$ , $W \leftarrow W_{tmp}$ , $best \leftarrow v$
end if
end for
return $S, W, best$

The algorithm addresses model subset selection by minimizing the fusion performance metric, which is formally defined as the objective function:

S^{*} = arg min_{S \subseteq M} Perf (y, s_{fusion} (S))

(6)

where S is a subset of models from the complete model set M,

s_{fusion} (S)

represents the weighted combination of similarity scores from models in S (defined in Equation (8)), and Perf denotes either EER or minDCF depending on the evaluation criterion.

Unlike simple averaging approaches, our method employs performance-based weighting where each model’s contribution is inversely proportional to its individual error rate:

w_{j} = \frac{1 / p_{j}}{\sum_{ℓ \in S} 1 / p_{ℓ}}, p_{j} = Perf (y, s_{j}),

(7)

where

w_{j}

represents the weight for model j,

p_{j}

is the performance score (EER or minDCF) of model j, and

s_{j}

denotes the similarity scores from model j. The fused score is computed as follows:

s_{fusion} = \sum_{j \in S} w_{j} s_{j} .

(8)

This weighting scheme ensures that better-performing models have a greater influence on the final decision, while the greedy selection strategy reduces computational complexity from

O (2^{n})

for exhaustive search to

O (n^{2})

while ensuring monotonic improvement. This makes the approach practical for real-world speaker verification systems with multiple model configurations.

4. Experiments

In this section, we evaluate the effectiveness of the proposed CTA-Conformer architecture and the associated training and optimization strategies for TD-SV. We examine both overall system performance under real-world conditions and the individual contributions of various system components.

The evaluation comprises several experiments: First, we compared the proposed CTA-Conformer with three existing TD-SV methods. Next, we analyzed the impact of the number of Conformer blocks on system performance. We then conducted ablation studies to assess the contribution of individual components, including transfer learning, CTA module, embedding-level averaging, and loss function. In addition, we evaluated the effectiveness of score-level greedy fusion across various model configurations. We further examined the system’s performance under different multi-channel microphone setups. Finally, we assessed on-device inference performance to verify the practical feasibility of resource-constrained environments.

4.1. Experimental Settings

All audio samples were downsampled to 16 kHz and normalized to 2 s. Short utterances were padded using a repeat-padding strategy, where the original waveform was iteratively concatenated until the target length was reached, while long utterances were trimmed accordingly. To better preserve speaker-specific acoustic information, repeat padding was adopted [34]. A pre-emphasis filter was then applied to emphasize high-frequency components. For each segment, 80-dimensional mel filter bank (FBank) features were extracted using a 25 ms window length and 10 ms frame offset. The 80-dimensional configuration provides optimal spectral resolution for capturing detailed speaker-specific acoustic patterns while maintaining computational efficiency for real-time TD-SV applications. This dimensionality has been widely adopted in SV research and demonstrates a better performance compared to lower-dimensional alternatives [35].

The models were implemented using the PyTorch 1.12.1 framework and optimized with Adam optimizer. The batch size was set to 64, and the maximum number of training epochs was 30. A learning-rate schedule combining warm-up and step decay was used. The learning rate was decreased by a factor of 0.5 every two steps after an initial 1000-step warm-up. For pretraining, the learning rate and weight decay were set to

1 \times 10^{- 3}

and

1 \times 10^{- 7}

, respectively, while for fine-tuning, they were set to

2 \times 10^{- 4}

and

1 \times 10^{- 5}

.

SV performance was evaluated using equal error rate (EER) and minimum detection cost function (minDCF) with a

P_{t a r g e t}

of 0.01. All evaluations were performed based on a trial list consisting of enrollment and test speaker pairs. Speaker embeddings were extracted for each utterance, and cosine similarity was computed between enrollment and test embeddings to determine speaker identity matches.

4.2. Datasets

To validate the effectiveness of the proposed model and optimization strategies, we employed datasets collected under diverse acoustic conditions. Publicly available TD-SV corpora are extremely limited, FFSVC2020 Task 1 and SLR 85 HI-MIA are the most widely used benchmarks. Both datasets rely on the fixed phrase “ni hao, mi ya” (“Hello, mi ya” in Mandarin Chinese), which has been widely adopted in TD-SV research due to its balanced phonetic content and moderate length. This standardized phrase provides consistent lexical content across speakers while retaining individual differences in prosody and articulation, making it suitable for evaluating speaker-specific traits in short-utterance scenarios.

Accordingly, a large-scale text-independent dataset was used for pretraining, and two text-dependent datasets containing this fixed phrase were used for fine-tuning and evaluation. All TD-SV experiments in this study were therefore conducted exclusively on this phrase to ensure consistency and enable fair comparison with prior work.

4.2.1. Text-Independent Dataset

AISHELL-2 [36] contains approximately 1000 h of Mandarin speech from 1991 speakers, recorded in quiet indoor environments using both iOS and Android smartphones. This large-scale dataset provides diverse speaker characteristics and acoustic conditions necessary for robust pretraining. In this study, only the iPhone recordings were used for pretraining to ensure consistent recording quality.

4.2.2. Text-Dependent Datasets

We utilized two datasets to evaluate model performance under realistic domain mismatch conditions commonly encountered in practical TD-SV applications.

FFSVC2020 Task 1 [37] includes recordings from 120 speakers collected in real-world domestic environments. Each utterance was simultaneously recorded using a close-talking microphone, an iPhone placed 25 cm from the speaker, and three circular microphone arrays. Among all recordings, only the first 30 utterances per speaker are fixed-phrase recordings of “ni hao, mi ya”, and only these utterances were used in this study.

SLR 85 HI-MIA [8] contains 374 speakers and approximately 1.13 million utterances of the fixed phrase “ni hao, mi ya,” totaling 950 h. For training, we used data from 254 speakers recorded using a Hi-Fi microphone at a distance of 25 cm and a 16-channel microphone array at distances of 1 m, 3 m, and 5 m. For verification, utterances recorded at 1m were used for enrollment, and test utterances recorded at 1 m, 3 m, and 5 m were used for evaluation. A total of 44 speakers were included in the evaluation phase.

These datasets introduce realistic variations in speaking speed (e.g., slow, normal, fast), microphone channel conditions, and background noise, facilitating robust evaluation under typical domain mismatch scenarios.

4.3. Experimental Results

4.3.1. Comparison with Existing Methods

We evaluated the proposed CTA-Conformer against three existing TD-SV approaches using the HI-MIA dataset. The FFSVC2020 Task 1 dataset was excluded from this comparison, as it was not used in existing models. However, it was additionally used in subsequent experiments to augment the training data.

The first model is a ResNet34 model [8] designed for multi-channel and far-field TD-SV scenarios. It applies transfer learning using a text-independent dataset and augments enrollment utterances with simulated reverberation to improve robustness. The second model employs an x-vector with unit selection-based data augmentation [38]. This method constructs synthetic fixed-phrase utterances from text-independent speech by concatenating phoneme-level segments, enabling the model to simulate TD-SV training without requiring recorded fixed-phrase data. The third model, ResdefNet [17], integrates a deformable convolutional layer and a time–frequency attention mechanism into a compact ResNet architecture. The deformable convolution dynamically adjusts receptive fields to focus on salient regions, while the attention mechanisms emphasize important time–frequency patterns, improving robustness under noisy and far-field conditions.

Table 2 presents the comparison results between the proposed CTA-Conformer and three recent TD-SV models. The CTA-Conformer achieved the lowest error rates, with an EER of 2.04% and a minDCF of 0.20. Compared to the three baseline models, it reduced the EER by 44.9%, 55.7% (20.3% with augmentation), and 76.0%, respectively, demonstrating its superior performance. Note that embedding-level averaging was applied in CTA-Conformer, whereas score-level fusion was not used in any of the models.

4.3.2. Impact of the Number of Conformer Blocks

Table 3 presents the impact of the number of Conformer blocks on the performance of the CTA-Conformer. When the number of blocks was set to 3, 4, and 5, the EERs were 6.57%, 6.12%, and 5.96%, and the corresponding minDCFs were 0.52, 0.50, and 0.50, respectively. With six blocks, the model achieved the best performance, reducing the EER and minDCF to 5.54% and 0.46. Based on these results, six Conformer blocks were adopted in the final model.

When the number of Conformer blocks is small, the model becomes too shallow to sufficiently train the input features, which limits its capacity to capture speaker-specific features and degrades overall performance. In contrast, increasing the number of blocks enhances the model’s ability to extract both long-term dependencies and local acoustic features through its core components: MHSA and convolution modules. In particular, the convolution module effectively captures short-term speaker-specific patterns, contributing to improved verification performance [22].

4.3.3. Ablation Studies

To analyze the contribution of individual components, we conducted an ablation study by removing or adding the modules introduced in Section 3. Transfer learning, CTA module, and embedding-level averaging are key strategies incorporated in the proposed CTA-Conformer. The corresponding results are shown in Table 4. The first row shows the results of the CTA-Conformer, and the subsequent rows present the impact of excluding or including each module.

The results show that when transfer learning was removed, the performance degraded to an EER of 6.53% and a minDCF of 0.51, indicating signs of overfitting and reduced generalization. This highlights the importance of using pretrained speaker representations from large-scale text-independent data for stable training in limited TD-SV conditions.

When the CTA module was removed, performance degraded with an EER of 5.78% and a minDCF of 0.51, representing a 4.3% relative increase in error rate. The CTA module enhances speaker discrimination by emphasizing informative regions with high variance in time–frequency domain while suppressing noisy or redundant areas. As illustrated in Figure 5, the attention mechanism transforms the uniform treatment of spectral regions (a) into selective enhancement (d) through learned weights (c). In our analysis, 256 channels showed differentiated attention weights ranging from 0.50 to 0.94, with approximately 65% of channels selectively activated. The final output maintained 93% correlation with the original input while modifying 44% of the features, enabling more robust speaker-embedding extraction from short utterances.

To further validate the effectiveness of frequency-wise standard deviation pooling, we compared it with mean pooling within the CTA module. The mean pooling variant achieved an EER of 5.68% and a minDCF of 0.48, while standard deviation pooling achieved 5.54% and 0.46, demonstrating a 2.5% relative improvement and confirming the importance of variance-based pooling for speaker discrimination.

In the last row, we evaluated the effect of embedding-level averaging. When embedding-level averaging was applied to the CTA-Conformer, performance improved significantly, reducing the EER to 2.86% and the minDCF to 0.15. This improvement is primarily due to its ability to compensate for inconsistencies in speaker embedding caused by variations in microphone placement and quality in multi-channel conditions. This method averages embeddings extracted from each channel of a single utterance, thereby reducing the influence of noisy channels and enhancing consistent speaker-related characteristics. As visualized in the t-distributed stochastic neighbor embedding (t-SNE) plot in Figure 6, speaker clusters appear more compact and better separated, confirming improved within-speaker compactness and between-speaker separability of the averaged embeddings.

All previous experiments with the CTA-Conformer were conducted using the AAM-Softmax loss. For comparison, we replaced it with AM-Softmax using the same architecture. The AM-Softmax achieved an EER of 5.64% and a minDCF of 0.46, whereas the AAM-Softmax achieved 5.54% and 0.46, respectively. Although the improvement is marginal in minDCF, AAM-Softmax reduced the EER by approximately 1.8%. This result suggests that in TD-SV settings, where the number of speakers is limited and utterances are short, the speaker embedding may not be well separated. The angular margin in AAM-Softmax helps distinguish between classes more effectively by enforcing more precise decision boundaries, resulting in more stable performance.

4.3.4. Score-Level Greedy Fusion

Table 5 summarizes the model configurations used in the score-level greedy fusion. Two fusion methods were evaluated based on whether embedding-level averaging was applied. Fusion 1 was constructed from models without embedding-level averaging (combining models A, C, E, F, and G), while Fusion 2 included models with embedding-level averaging (combining models D, E*, F*, and G*). Index C refers to the final CTA-Conformer model proposed in this study, while Index D represents its variant with embedding-level averaging. These two variants are labeled separately in the index for clarity. Models with embedding-level averaging are denoted with an asterisk *, except for Index D, which is explicitly labeled.

Fusion 1 without embedding averaging achieved an EER of 4.96% and a minDCF of 0.44, while Fusion 2 with embedding averaging further improved performance to 2.76% and 0.16, respectively. Both fusion methods outperformed the individual models in Table 4. Index E, F, and G, which correspond to models with 5, 4, and 3 Conformer blocks, were commonly selected in both fusion strategies. These results suggest that combining models with different architectural properties contributes to more robust verification performance. Overall, the findings highlight that architectural diversity is a key factor in enhancing performance through score-level fusion.

4.3.5. Robustness to Channel Variation

All previous experiments were conducted under a fixed channel, where both enrollment and test phases used 8 channels. However, in real-world applications, the number and arrangement of microphones can vary. Especially during the test phase, when speech input often comes from diverse microphone arrays and varying channel conditions, such variability can significantly impact performance. Such channel mismatches can degrade the performance of SV systems. Therefore, evaluating the system under various channel configurations is essential to assess its robustness and practical applicability.

In this study, we systematically varied the number of channels used during the enrollment and test phases and analyzed their impact on TD-SV performance. Table 6 shows experimental results. The results showed consistent improvement in performance, with lower EERs observed as the number of test channels increased. This improvement can be attributed to the diversity of acoustic features captured across multiple channels, which helps generate more robust speaker embeddings. In contrast, increasing the number of enrollment channels led to performance degradation, likely due to information redundancy and excessive generalization.

The best performance was achieved when fewer channels were used for enrollment and more channels were used for testing. For example, the 5–16 configuration achieved the lowest EER of 2.04%. This indicates that, even in multi-channel environments where specific microphones may collect low-quality data due to noise, distance, or directionality, high-quality channels can compensate and contribute to the generation of more reliable embeddings. In addition, the model with the proposed CTA module consistently demonstrated performance improvements as the number of test channels increased, confirming its robustness to channel variability in real-world scenarios.

4.3.6. Inference on Embedded Devices

We evaluated the on-device inference performance of the proposed TD-SV system on Raspberry Pi 4 and Raspberry Pi Zero 2 W. The PyTorch model was converted to open neural network exchange (ONNX) and quantized to INT8 for deployment. To simulate realistic conditions, the number of channels was limited to one for enrollment and two for testing. As shown in Table 7, the INT8 model achieved up to 75.8% memory reduction and approximately 31.2% inference speed improvement compared to the FP32 model. While a slight degradation in performance was observed, the system maintained robust real-time capability. Embedding-level averaging further enhanced robustness, especially on Raspberry Pi Zero 2 W, which benefits from in-order architecture optimized for repetitive computation and memory access.

Inference latency was measured as the total time required for both embedding extraction and cosine similarity scoring, where the majority of the computational load stemmed from the embedding extraction stage. The observed acceleration with INT8 quantization mainly originated from this stage, as reducing arithmetic precision from FP32 to INT8 simplified memory access patterns and enabled efficient single instruction multiple data (SIMD)-based parallel integer operations on ARM CPUs.

Device-specific differences were also observed. Raspberry Pi 4 achieved superior performance without embedding-level averaging due to its out-of-order Cortex-A72 architecture, which is advantageous for complex computations. In contrast, Raspberry Pi Zero 2 W benefited more from embedding-level averaging because its in-order Cortex-A53 architecture is highly efficient for repetitive arithmetic and regular memory access.

These findings confirm that the proposed model maintains real-time performance across different embedded platforms. The combination of ONNX conversion, static quantization, and embedding-level averaging ensures that the system is efficient across resource-constrained embedded platforms, from relatively capable boards such as Raspberry Pi 4 to low-power IoT devices such as Raspberry Pi Zero 2 W.

5. Conclusions and Future Works

This paper proposed CTA-Conformer, a lightweight deep learning model that integrates frequency-wise attention mechanisms specifically optimized for TD-SV applications. The proposed architecture effectively combines CNN-based local feature extraction with Transformer-based global context modeling, enabling robust speaker embedding extraction from short utterances. Experimental results demonstrate substantial performance improvements with an EER of 2.04% and minDCF of 0.20 on the challenging HI-MIA dataset while maintaining computational efficiency suitable for edge device deployment. The system achieves up to 75.8% memory reduction through INT8 quantization and demonstrates practical feasibility for real-time speaker verification in IoT environments.

However, several limitations require acknowledgment. First, our evaluation is constrained to a single language and a specific fixed phrase, limiting generalization across diverse linguistic contexts. Second, the current approach requires pretraining on large-scale text-independent datasets, which may not be available for all target languages or domains.

Future research will address these limitations through several directions. First, extending the approach to multi-lingual TD-SV datasets will validate cross-linguistic generalization capabilities and assess the universality of frequency-wise attention mechanisms across different phonetic structures. Second, investigating few-shot learning and domain adaptation strategies will enable effective model adaptation with limited target-domain data, reducing dependency on large-scale pretraining datasets for new languages or acoustic environments. Additionally, we plan to incorporate statistical testing and uncertainty estimation in evaluation.

Author Contributions

Conceptualization, C.S. and Y.B.; methodology, C.S. and T.K.; software, C.S. and Y.C.; validation, C.S., T.K. and K.S.; formal analysis, C.S. and Y.B.; investigation, C.S.; resources, Y.B.; data curation, C.S. and T.K.; writing—original draft preparation, C.S.; writing—review and editing, Y.B.; visualization, C.S.; supervision, Y.B.; project administration, Y.B.; funding acquisition, Y.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2025-RS-2023-00254177) grant funded by the Korea government (MSIT). This work was supported by the IITP (Institute of Information & Communications Technology Planning & Evaluation)-ITRC (Information Technology Research Center) grant funded by the Korea government (Ministry of Science and ICT) (IITP-2025-RS-2023-00259967).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available: AISHELL-2 dataset is available at https://www.openslr.org/33/ (accessed on 14 September 2025) and SLR 85 HI-MIA dataset is available at https://www.openslr.org/85/ (accessed on 14 September 2025). The FFSVC2020 Task1 dataset was obtained through direct request to the organizers and is available upon reasonable request from the original authors.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, P.; Hu, P.; Zhang, X. Deep embedding learning for text-dependent speaker verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3461–3465. [Google Scholar]
Tu, Y.; Lin, W.; Mak, M.W. A survey on text-dependent and text-independent speaker verification. IEEE Access 2022, 10, 99038–99049. [Google Scholar] [CrossRef]
Zeinali, H.; Lee, K.A.; Alam, J.; Burget, L. Short-duration speaker verification (SDSV) challenges 2021: The challenge evaluation plan. arXiv 2021, arXiv:1912.06311. [Google Scholar]
Xie, X.; Nagrani, A.; Chung, J.S.; Zisserman, A. Utterance-level aggregation for speaker recognition in the wild. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5791–5795. [Google Scholar]
Gusev, A.; Volokhov, V.; Andzhukaev, T.; Novoselov, S.; Lavrentyeva, G.; Volkova, M.; Gazizullina, A.; Shulipa, A.; Gorlanov, A.; Avdeeva, A.; et al. Deep speaker embeddings for far-field speaker recognition on short utterances. arXiv 2020, arXiv:2002.06033. [Google Scholar] [CrossRef]
Stafylakis, T.; Kenny, P.; Ouellet, P.; Perez, J.; Kockmann, M.; Dumouchel, P. Text-dependent speaker recognition using PLDA with uncertainty propagation. Matrix 2013, 500, 3684–3688. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Qin, X.; Bu, H.; Li, M. HI-MIA: A far field text-dependent speaker verification database and the baselines. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7609–7613. [Google Scholar]
Prince, S.J.; Elder, J.H. Probabilistic linear discriminant analysis for inference about identity. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
Kenny, P. Bayesian speaker verification with heavy-tailed priors. In Proceedings of the Odyssey 2010, Brno, Czech Republic, 28 June–1 July 2010; p. 14. [Google Scholar]
Zhao, Y.; Wang, S.; Sun, G.; Chen, Z.; Zhang, C.; Xu, M.; Zheng, T.F. Investigation of zero-shot text-to-speech models for enhancing short-utterance speaker verification. arXiv 2025, arXiv:2506.14226. [Google Scholar]
Heigold, G.; Moreno, I.; Bengio, S.; Shazeer, N. End-to-end text-dependent speaker verification. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5115–5119. [Google Scholar]
Gulati, A.; Qin, J.; Chiu, C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar] [CrossRef]
Ma, J.; Sethu, V.; Ambikairajah, E.; Lee, K.-A. Incorporating local acoustic variability information into short duration speaker verification. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1502–1506. [Google Scholar]
Viñals, I.; Ortega, A.; Miguel, A.; Lleida, E. An analysis of the short utterance problem for speaker characterization. Appl. Sci. 2019, 9, 3697. [Google Scholar] [CrossRef]
Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 788–798. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, H.; Ma, Z. Speaker verification system based on deformable CNN and time-frequency attention. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 1689–1692. [Google Scholar]
Novoselov, S.; Lavrentyeva, G.; Avdeeva, A.; Volokhov, V.; Gusev, A. Robust speaker recognition with transformers using wav2vec 2.0. arXiv 2022, arXiv:2203.15095. [Google Scholar] [CrossRef]
Sang, M.; Zhao, Y.; Liu, G.; Hansen, J.H.; Wu, J. Improving transformer-based networks with locality for automatic speaker verification. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Xie, F.; Zhang, D.; Liu, C. Global-local self-attention based transformer for speaker verification. Appl. Sci. 2022, 12, 10154. [Google Scholar] [CrossRef]
Choi, J.-H.; Yang, J.-Y.; Jeoung, Y.-R.; Chang, J.-H. Improved CNN-Transformer using broadcasted residual learning for text-independent speaker verification. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 2223–2227. [Google Scholar]
Zhang, Y.; Lv, Z.; Wu, H.; Zhang, S.; Hu, P.; Wu, Z.; Lee, H.; Meng, H. MFA-Conformer: Multi-scale feature aggregation conformer for automatic speaker verification. arXiv 2022, arXiv:2203.15249. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Okabe, K.; Koshinaka, T.; Shinoda, K. Attentive statistics pooling for deep speaker embedding. arXiv 2018, arXiv:1803.10963. [Google Scholar]
Wang, S.; Yang, Y.; Qian, Y.; Yu, K. Revisiting the statistics pooling layer in deep speaker embedding learning. In Proceedings of the 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), Hong Kong, China, 24–27 January 2021; pp. 1–5. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Dey, S.; Koshinaka, T.; Motlicek, P.; Madikeri, S. DNN based speaker embedding using content information for text-dependent speaker verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5344–5348. [Google Scholar]
Malykh, E.; Novoselov, S.; Kudashev, O. On residual CNN in text-dependent speaker verification task. In Proceedings of the International Conference on Speech and Computer, Hatfield, Hertfordshire, UK, 12–16 September 2017; pp. 593–601. [Google Scholar]
Li, Y.; Gao, F.; Ou, Z.; Sun, J. Angular softmax loss for end-to-end speaker verification. In Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan, 26–29 November 2018; pp. 190–194. [Google Scholar]
Snyder, D.; Chen, G.; Povey, D. MUSAN: A music, speech, and noise corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar] [CrossRef]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Yoon, S.-H.; Yu, H.-J. A simple distortion-free method to handle variable length sequences for recurrent neural networks in text dependent speaker verification. Appl. Sci. 2020, 10, 4092. [Google Scholar] [CrossRef]
Jakubec, M.; Jarina, R.; Lieskovska, E.; Kasak, P. Deep Speaker Embeddings for Speaker Verification: Review and Experimental Comparison. Eng. Appl. Artif. Intell. 2024, 127, 107232. [Google Scholar] [CrossRef]
Du, J.; Na, X.; Liu, X.; Bu, H. AISHELL-2: Transforming Mandarin ASR research into industrial scale. arXiv 2018, arXiv:1808.10583. [Google Scholar] [CrossRef]
Qin, X.; Li, M.; Bu, H.; Das, R.K.; Rao, W.; Narayanan, S.; Li, H. The FFSVC 2020 evaluation plan. arXiv 2020, arXiv:2002.00387. [Google Scholar] [CrossRef]
Huang, H.; Xiang, X.; Zhao, F.; Wang, S.; Qian, Y. Unit selection synthesis based data augmentation for fixed phrase speaker verification. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5849–5853. [Google Scholar]

Figure 1. Overview of the proposed end-to-end TD-SV system. Input utterances are first processed through feature extraction to obtain acoustic representation. The training phase uses CTA-Conformer to generate speaker embeddings with AAM-Softmax loss, incorporating speaker ID labels for learning. The inference phase includes enrollment (voice registration) and verification (identity matching).

Figure 2. Architecture of the proposed CTA-Conformer showing the CTA module positioned before Conformer blocks with MFA strategy.

Figure 3. Detailed structure of the channel-temporal attention (CTA) module, where X = input feature map (C = channels, T = time, F = frequency), z = frequency-wise statistics, d = intermediate features, and

ω

= attention weights.

Figure 3. Detailed structure of the channel-temporal attention (CTA) module, where X = input feature map (C = channels, T = time, F = frequency), z = frequency-wise statistics, d = intermediate features, and

ω

= attention weights.

Figure 4. Overview of the score-level greedy fusion framework in TD-SV. The framework incrementally selects complementary models and averages their scores to improve robustness and verification reliability.

Figure 5. Visualization of CTA module effects on time–frequency features. (a) Original mel-spectrogram X; (b) frequency-wise standard deviation z computed by Equation (2); (c) learned attention weights

ω

from Equation (3) with 65% of 256 channels selectively activated (yellow indicates high attention 0.50–0.94); (d) enhanced output computed by Equation (4) showing selective emphasis on speaker-discriminative regions while suppressing noise.

Figure 5. Visualization of CTA module effects on time–frequency features. (a) Original mel-spectrogram X; (b) frequency-wise standard deviation z computed by Equation (2); (c) learned attention weights

ω

from Equation (3) with 65% of 256 channels selectively activated (yellow indicates high attention 0.50–0.94); (d) enhanced output computed by Equation (4) showing selective emphasis on speaker-discriminative regions while suppressing noise.

Figure 6. t-SNE visualization showing improved speaker clustering with embedding-level averaging. Speaker clusters become more compact and better separated.

Table 1. Complexity comparison of attention mechanisms (dominant terms). Here, C denotes the number of channels, T the time dimension, F the frequency dimension, r the reduction ratio,

k_{s}

the spatial kernel size in CBAM,

k_{e}

the kernel size of the 1D convolution in ECA-Net, k the convolution kernel size, and m the number of middle channels, respectively.

Table 1. Complexity comparison of attention mechanisms (dominant terms). Here, C denotes the number of channels, T the time dimension, F the frequency dimension, r the reduction ratio,

k_{s}

the spatial kernel size in CBAM,

k_{e}

the kernel size of the 1D convolution in ECA-Net, k the convolution kernel size, and m the number of middle channels, respectively.

Method	Complexity	Parameters	Audio-Optimized
SE-Net	$O (C^{2} / r)$	≈ $2 C^{2} / r$	No
CBAM	$O (C^{2} / r + 2 k_{s}^{2} T F)$	≈ $2 C^{2} / r + 2 k_{s}^{2}$	No
ECA-Net	$O (C k_{e})$	≈ $k_{e}$	No
CTA (Ours)	$O (2 C T F + 18 m C T)$	$2 k^{2} C m + C + 3 m$	Yes

Table 2. Performance comparison of the proposed CTA-Conformer and existing TD-SV models on the HI-MIA dataset.

Method	EER (%)	minDCF
ResNet34 [8]	3.70	-
x-vector + Unit Selection [38]	4.60	0.41
x-vector + Unit Selection (with aug.) [38]	2.56	0.22
ResdefNet [17]	8.51	-
CTA-Conformer (Proposed)	2.04	0.20

Table 3. Impact of the number of Conformer blocks on CTA-Conformer performance.

Number of Blocks	EER (%)	minDCF
3	6.57	0.52
4	6.12	0.50
5	5.96	0.50
6	5.54	0.46

Table 4. Ablation study of CTA-Conformer components.

Configuration	EER (%)	minDCF
CTA-Conformer (Full)	5.54	0.46
w/o Transfer Learning	6.53	0.51
w/o CTA Module	5.78	0.51
w/ Embedding-level Averaging	2.86	0.15

Table 5. Model configurations for greedy fusion experiments. The ✓ indicates that the corresponding component is selected in the configuration, and an asterisk (*) denotes models used with embedding-level averaging in the fusion results.

Index	Model Configuration			Loss Function
Index	Number of Conformer Blocks	CTA Module	Embedding-Level Averaging	AM-Softmax	AAM-Softmax
A	6				✓
B	6	✓		✓
C	6	✓			✓
D	6	✓	✓		✓
E	5	✓			✓
F	4	✓			✓
G	3	✓			✓
Fusion 1	A + C + E + F + G
Fusion 2	D + E* + F* + G*

Table 6. Evaluation of model performance under different channel configurations (Enrollment-Test channels).

Enrollment-Test Channels	EER (%)	minDCF
1-2	3.35	0.14
1-16	2.62	0.26
5-16	2.04	0.21
8-8	2.86	0.15
8-16	2.16	0.19
16-16	2.86	0.16

Table 7. On-device inference performance on embedded devices. The ↓ in the ‘Memory (MB)’ column indicates the percentage reduction in memory usage of the INT8 model compared to the FP32 model.

Device	Model	Memory (MB)	Speed Improvement (%)	EER (%)	minDCF
Raspberry Pi 4	FP32	76.23	-	5.74	0.47
Raspberry Pi 4	INT8	19.66 (75.8% ↓)	31.2	6.28	0.51
Raspberry Pi Zero 2 W	FP32	76.23	-	5.77	0.47
Raspberry Pi Zero 2 W	INT8	19.66 (75.8% ↓)	31.3	6.31	0.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shin, C.; Kim, T.; Cho, Y.; Shin, K.; Baek, Y. An End-to-End Transformer-Based Architecture with Channel-Temporal Attention for Robust Text-Dependent Speaker Verification. Appl. Sci. 2025, 15, 10240. https://doi.org/10.3390/app151810240

AMA Style

Shin C, Kim T, Cho Y, Shin K, Baek Y. An End-to-End Transformer-Based Architecture with Channel-Temporal Attention for Robust Text-Dependent Speaker Verification. Applied Sciences. 2025; 15(18):10240. https://doi.org/10.3390/app151810240

Chicago/Turabian Style

Shin, Chaerim, Taegu Kim, Yonghun Cho, Kihun Shin, and Yunju Baek. 2025. "An End-to-End Transformer-Based Architecture with Channel-Temporal Attention for Robust Text-Dependent Speaker Verification" Applied Sciences 15, no. 18: 10240. https://doi.org/10.3390/app151810240

APA Style

Shin, C., Kim, T., Cho, Y., Shin, K., & Baek, Y. (2025). An End-to-End Transformer-Based Architecture with Channel-Temporal Attention for Robust Text-Dependent Speaker Verification. Applied Sciences, 15(18), 10240. https://doi.org/10.3390/app151810240

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An End-to-End Transformer-Based Architecture with Channel-Temporal Attention for Robust Text-Dependent Speaker Verification

Abstract

1. Introduction

2. Related Works

2.1. Text-Dependent Speaker Verification

2.2. Transformer-Based Architecture for Speaker Verification

2.3. Attention Mechanisms in Speaker Verification

3. Proposed TD-SV System

3.1. Proposed CTA-Conformer Architecture

3.1.1. Computational Complexity Analysis

3.1.2. Utterance-Level Representation and Loss Functions

3.2. Training and Optimization Strategies

3.2.1. Transfer Learning for Domain Adaptation

3.2.2. Data Augmentation

3.2.3. Embedding-Level Averaging

3.2.4. Score-Level Greedy Fusion

4. Experiments

4.1. Experimental Settings

4.2. Datasets

4.2.1. Text-Independent Dataset

4.2.2. Text-Dependent Datasets

4.3. Experimental Results

4.3.1. Comparison with Existing Methods

4.3.2. Impact of the Number of Conformer Blocks

4.3.3. Ablation Studies

4.3.4. Score-Level Greedy Fusion

4.3.5. Robustness to Channel Variation

4.3.6. Inference on Embedded Devices

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI