Next Article in Journal
Influence of Bilateral Upper Limb Morphological Asymmetry on Grip Strength Related to Gender in Non-Athlete University Students
Previous Article in Journal
Effect of the Dzyaloshinskii–Moriya Interaction on Magnonic Activity in Ferromagnetic Nanotubes
Previous Article in Special Issue
Symmetry-Aware LLM-Driven Generation and Repair of Interactive Fiction Graphs in Twine/Twee
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures

by
Sundareswari Thiyagarajan
and
Deok-Hwan Kim
*
Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Republic of Korea
*
Author to whom correspondence should be addressed.
Symmetry 2026, 18(1), 121; https://doi.org/10.3390/sym18010121
Submission received: 24 November 2025 / Revised: 23 December 2025 / Accepted: 5 January 2026 / Published: 8 January 2026

Abstract

Symmetry and asymmetry are foundational design principles in artificial intelligence, defining the balance between invariance and adaptability in multimodal learning systems. In audio-visual speaker verification, where speech and lip-motion features are jointly modeled to determine whether two utterances belong to the same individual, these principles govern both fairness and discriminative power. In this work, we analyze how symmetry and asymmetry emerge within a gated-fusion architecture that integrates Time-Delay Neural Networks and Bidirectional Long Short-Term Memory encoders for speech, ResNet-based visual lip encoders, and a shared Conformer-based temporal backbone. Structural symmetry is preserved through weight-sharing across paired utterances and symmetric cosine-based scoring, ensuring verification consistency regardless of input order. In contrast, asymmetry is intentionally introduced through modality-dependent temporal encoding, multi-head attention pooling, and a learnable gating mechanism that dynamically re-weights the contribution of audio and visual streams at each timestep. This controlled asymmetry allows the model to rely on visual cues when speech is noisy, and conversely on speech when lip visibility is degraded, yielding adaptive robustness under cross-modal degradation. Experimental results demonstrate that combining symmetric embedding space design with adaptive asymmetric fusion significantly improves generalization, reducing Equal Error Rate (EER) to 3.419% on VoxCeleb-2 test dataset without sacrificing interpretability. The findings show that symmetry ensures stable and fair decision-making, while learnable asymmetry enables modality awareness together forming a principled foundation for next-generation audio-visual speaker verification systems.

1. Introduction

Symmetry and asymmetry serve as foundational design principles in both natural and artificial systems, defining the balance between stability, fairness, and adaptability. In artificial intelligence (AI), symmetry enforces invariance and interpretability, enabling learned representations to remain stable under equivalent transformations such as channel variation, temporal shifts, or input order [1,2]. Conversely, asymmetry introduces directionality and differentiation, allowing systems to represent complex interactions, heterogeneous modalities, and context-adaptive decision boundaries [3]. The interplay between symmetry and asymmetry is especially relevant to speaker verification (SV), a task that determines whether two utterances belong to the same individual where robustness, discriminability, and fairness must coexist under open-set and potentially noisy conditions.
Historically, speaker verification has been dominated by audio-only models based on embeddings such as the x-vector [4] and ECAPA-TDNN [5], which leverage Time-Delay Neural Networks (TDNNs), channel attention, and attentive or statistical pooling mechanisms to convert speech signals into fixed-dimensional speaker embeddings. These models are typically trained using margin-based objectives such as Additive Angular Margin Softmax (AAM-Softmax) [6], which enforce intra-class compactness and inter-class separation in angular space. Moreover, purely audio-based systems are constrained by the unimodal nature of speaker identity, ignoring the complementary biometric signal present in visual cues such as lip motion, facial geometry, and articulatory dynamics [7,8].
Motivated by this limitation, recent work has explored audio-visual speaker verification (AV-SV) by integrating speech and face-derived embeddings using late fusion, shared encoders, or multimodal transformers [9,10]. Yet, many existing AV-SV systems either rely on static fusion (e.g., feature concatenation) or treat audio and visual modalities symmetrically regardless of noise conditions, lacking the flexibility to adaptively prioritize one modality when the other is unreliable. Such designs fail to exploit the inherent asymmetry in modality reliability. For example, visual cues remain intact under acoustic interference, while speech features are more reliable than visual motion when lip visibility is poor.
To address these issues, we propose a unified audio-visual speaker verification framework that balances structural symmetry in its architecture with controlled asymmetry in its optimization and fusion mechanisms. The system consists of a TDNN-BiLSTM (Bidirectional Long Short-Term Memory) speech encoder that extracts temporal speech embeddings from 80-dimensional FBANK features, and a parallel ResNet-based visual encoder that processes cropped lip ROI frames sampled at 25 FPS. These two modalities are fused using a learnable Gated Fusion Module, which dynamically adjusts the contribution of audio and visual features at each time step based on context, allowing robustness to noise, blur, or occlusion. The fused representation is passed to a shared Conformer encoder that models cross-modal temporal dependencies through joint convolution and self-attention. A Multi-Head Attention (MHA) pooling layer aggregates the encoded sequence into a fixed-length vector, which is optimized using AAM-Softmax (ArcFace) to enforce angular margin separation between speaker embeddings.
Symmetry in the proposed framework is reinforced through a shared temporal backbone, parallel dual-branch processing for paired utterances, and cosine-based scoring, which together yield consistent, order-invariant verification decisions across different modality pairings. In contrast, asymmetry is intentionally introduced through a learnable gated-fusion mechanism that dynamically re-weights audio and visual contributions, attention-based pooling that emphasizes the most informative time regions, and an angular-margin objective that sharpens inter-speaker separation in the embedding space. This combination enables the model to rely more heavily on visual cues under severe acoustic corruption, while still benefiting from speech cues when visual information is degraded by blur, occlusion, or poor illumination. Experiments on VoxCeleb2 [11] validate these design choices, achieving an Equal Error Rate (EER) of 3.419% and a minDCF of 0.342 on the VoxCeleb2 test set, and consistently outperforming audio-only baselines and static multimodal fusion strategies. Collectively, these results highlight that controlled asymmetry, when embedded within a symmetry-preserving architectural backbone, improves robustness to real-world degradations while maintaining strong discriminability and interpretability in audio–visual speaker verification.

2. Related Work

Early research in speaker verification primarily focused on audio-only embeddings learned from handcrafted or statistical models, such as i-vectors and Gaussian Mixture Models (GMMs), which were later surpassed by deep neural architectures, including TDNNs and x-vectors [4]. The introduction of ECAPA-TDNN further strengthened audio-based verification by incorporating channel attention, Res2Net modules, and multi-layer aggregation to enhance local and global acoustic feature modeling [5]. Despite their success, audio-only systems can degrade significantly under adverse conditions such as noise, reverberation, and channel mismatch [7].
To mitigate these limitations, multimodal speaker verification has increasingly incorporated visual information, such as lip motion and face appearance, which can complement speech cues when audio is unreliable. In this context, a growing body of work has explored audio-visual speaker identification and verification, with early systems relying on late fusion of independently learned face and voice embeddings [8], and more recent systems adopting joint encoder–decoder or cross-attentional fusion strategies [9,10]. These approaches highlight the potential of visual signals to improve robustness in real-world settings.
Self-supervised learning models such as AV-HuBERT have demonstrated the feasibility of using masked multimodal pretraining to learn transferable speech–vision representations without labeled data, opening pathways toward robust multimodal embedding learning. Similarly, transformer-based architectures have been applied to multimodal speech tasks to improve long-range temporal modeling and cross-modal alignment [12,13].
Beyond architectural choices, adaptive fusion strategies have been proposed to address modality imbalance and missing or corrupted inputs. These include confidence-aware fusion, attention-based weighting, and modality dropout mechanisms, which aim to dynamically adjust reliance on each modality depending on quality and availability [14,15]. Such designs are particularly important in practical deployments where acoustic noise or visual occlusion may occur frequently.
In addition, margin-based classification objectives have become standard in both speaker and face recognition due to their strong embedding discrimination. Methods such as ArcFace (AAM-Softmax) [6] and SphereFace [16] enforce angular margins that promote compact intra-class clustering and separable inter-class boundaries, often outperforming Triplet Loss for open-set verification. Motivated by these developments, recent audio–visual speaker verification studies have explored adaptive lip–voice fusion using gated mechanisms or cross-modal attention, reporting consistent gains over audio-only baselines, especially in noisy and unconstrained video scenarios [17,18]. Cross-modal discriminative learning has been explored to strengthen the correspondence between voice and visual cues through modality interaction [19]. Self-supervised multimodal learning frameworks such as AV-HuBERT [20] have also been proposed to learn lip-based audio–visual speaker embeddings via masked prediction, improving robustness and transferability without heavy reliance on labeled data.
Overall, existing work indicates a clear trend toward hybrid frameworks that combine robust symmetric embedding extractors with controlled asymmetric fusion and optimization, improving reliability and interpretability in open-set multimodal speaker verification.

3. Proposed Model

The proposed architecture, as shown in Figure 1, is designed to balance symmetric multimodal encoding with asymmetric fusion and optimization, to address the limitations of both unimodal speaker verification [4,5] and static multimodal fusion systems.

3.1. Audio Stream Encoder

Our architecture builds on the x-vector framework by integrating TDNN layers with BiLSTM units, forming a hybrid encoder that captures both local and global temporal characteristics of speech. This design also reflects key symmetry principles in deep speaker verification, where structural symmetry improves robustness and controlled asymmetry enhances discriminative ability. An example FBANK feature map (log-Mel Filterbank Energies) used as input to the audio encoder is shown in Figure 2.
TDNN layers operate as one-dimensional convolutions across a fixed temporal context window. By observing a small set of past and future frames, TDNNs extract short-term dependencies such as phoneme transitions and spectral variations. This operation is symmetric across time within the context window; each frame is processed using the same convolutional kernel, ensuring time-shift invariance and encouraging a consistent feature space regardless of where a phonetic event occurs. The shared-weight structure of TDNNs also contributes to architectural symmetry, promoting stable learning and robustness to temporal distortions. Additionally, TDNNs are highly parallelizable, making them efficient for processing long utterances.
While TDNNs capture localized acoustic cues, they are limited in modeling global temporal structure. To overcome this, we incorporate a BiLSTM module after the TDNN layers. The BiLSTM processes the sequence in both forward and backward directions, producing two hidden states that are concatenated. This bidirectional mechanism introduces an explicit form of temporal symmetry, as the network treats the past and future with equal importance. Such symmetry ensures that the representation remains invariant to whether discriminative cues appear earlier or later in the utterance. At the same time, the BiLSTM introduces asymmetry in information flow, because forward and backward recurrent paths learn complementary patterns, reflecting differences in how past and future dependencies contribute to speaker identity. The ordering of TDNN followed by BiLSTM is deliberate. TDNN layers first generate clean, localized, and time-invariant features. The BiLSTM then operates on these higher-level representations, allowing it to focus on global temporal characteristics such as rhythm, prosody, and speaking style. This separation of local and global modeling reflects an interplay of symmetric feature extraction (TDNN) and asymmetric temporal aggregation (BiLSTM), which improves training stability and enhances representational richness.
A TDNN [4] applies a 1D convolution across a temporal context window:
h t = σ ( W [ x t c , . . . , x t , . . . , x t + c ] + b )
where x t represents input feature vector at time t , c is scalar context size, W is weight matrix and b is bias vector, and σ(⋅) is a non-linear activation (e.g., ReLU). This operation captures short-term temporal dependencies such as phoneme transitions and spectral dynamics.
The BiLSTM [21] captures long-range dependencies by processing the sequence in both directions:
h t = L S T M f ( x t , h t 1 )
h t = L S T M b ( x t , h t + 1 )
The BiLSTM output is:
h t B i L S T M = [ h t ; h t ]
where, x t   represents the input vector at time t ,     h t   represents forward hidden state vector (past → future), h t   represents backward hidden state vector (future → past) and concatenation combines both directions
Together, the TDNN and BiLSTM components form a hybrid encoder that produces robust frame-level representations containing both fine-grained phonetic cues and long-term speaker-specific patterns. This combination of structural symmetry (shared-weight TDNN and bidirectional recurrence) and functional asymmetry (direction-specific LSTM gating and sequential learning) provides a strong foundation for high-quality speaker embeddings.

3.2. Video Stream Encoder

The visual stream is formed by extracting a sequence of lip Region-of-Interest (ROI) frames from speaker videos. Frames are sampled at 25 frames per second, each frame is independently processed through a ResNet-18 backbone [22,23,24], pre-trained on large-scale face or lip datasets, to extract spatially encoded feature vectors representing lip shape, motion trajectories, and articulation patterns [8,12]. By maintaining temporal correspondence between speech and lip features, the model preserves multimodal symmetry while preparing both streams for time-synchronized fusion.
v t = ϕ v ( I t ; θ v ) ,             t = 1 , , T
v ~ t = W p v t + b p
where I t is the lip ROI frame at time t , ϕ v ( ) denotes the ResNet-18 encoder with parameters θ v , v t is the extracted visual feature vector, W p and b p are the projection parameters, and v ~ t R d is the projected visual feature vector.

3.3. Gated Fusion Module

To overcome the limitations of static fusion methods (e.g., concatenation, averaging) that fail under modality imbalance, we employ a learnable [25] Gated Fusion Module inspired by multimodal gating in adaptive audio-visual integration for speaker verification. Given audio features a t R d and visual features v t R d at timestep t , a fusion gate g t [ 0,1 ] is computed as:
g t = σ ( w T [ a t ; v t ] + b g )
f t = g t a t + ( 1 g t ) v t
where a t , v t , f t are vectors, g t is a scalar gate, w is a parameter vector, b g is a scalar bias and denotes elementwise multiplication. This mechanism introduces controlled asymmetry, enabling the model to dynamically prioritize the more informative modality under noise, blur, or occlusion, while preserving representational symmetry through uniform embedding dimensionality and shared downstream processing.
In practice, a t and v t are first projected to a common dimensionality d (if needed) using modality-specific linear layers to ensure that fusion does not favor a modality simply due to scale or dimension mismatch. The gate g t computed per time step, enabling locally adaptive fusion; this is important because audio and visual reliability often varies within an utterance (e.g., transient noise in audio or momentary blur/occlusion in video). Since g t is a scalar, it is broadcast across the feature dimension when applied to vectors, yielding a convex combination that preserves the embedding space R d and keeps the downstream network interface unchanged.
Compared to static fusion (averaging) or concatenation, the proposed gating provides a controlled and smooth mechanism to handle modality imbalance while avoiding dimensionality expansion. Concatenation increases feature dimension and forces the subsequent backbone to learn modality weighting implicitly, which may increase parameters and overfitting risk. Fixed averaging assumes equal reliability across modalities and time, which is often violated under realistic conditions. Our gated fusion explicitly models sample- and time-dependent modality contribution, allowing the network to suppress a degraded modality and rely more on the informative modality when necessary.
Although more sophisticated fusion mechanisms exist (e.g., transformer-based cross-attention or uncertainty-aware fusion), we adopt a lightweight gate to preserve training stability, computational efficiency, and reproducibility in verification settings. Cross-attention introduces substantial cross-modal parameters and higher compute, and may be more sensitive to limited paired audio–visual training data or intermittent modality corruption. In contrast, the proposed gate introduces asymmetry only at the fusion stage while maintaining a consistent fused representation for shared downstream temporal modeling. We further support this choice empirically through fusion ablations (e.g., fixed average and concatenation baselines, and an attention-based fusion variant when feasible), showing that the proposed gating achieves a strong accuracy–complexity trade-off under noisy or corrupted conditions.

3.4. Shared Temporal Encoder

The fused multimodal sequence F = { f t } t = 1 T is processed by a Conformer encoder, which integrates convolutional layers (local pattern modeling) with multi-head self-attention (global temporal dependency) in a symmetric architecture. The Conformer has demonstrated superiority over pure transformers or TDNNs in modeling continuous speech sequences and multimodal speech–vision alignment. By using a shared temporal encoder for both modalities, the system maintains invariance to input order and enforces joint temporal modeling symmetry across audio and visual features.

3.5. Multi-Head Attention Pooling and Embedding Learning

Following temporal modeling, we apply Multi-Head Attention (MHA) pooling to aggregate the output sequence into a fixed-length embedding. This replaces statistical pooling (e.g., mean + std) used in x-vector [4] and ECAPA-TDNN [5], and allows the network to learn complementary attention maps, improving speaker discriminability by emphasizing temporally relevant frames.
This mechanism computes several independent attention distributions over the sequence of frame-level features, with each head focusing on a different subset of acoustic cues. As a result, one head may attend to phonetically rich regions where spectral patterns are most distinct, another may capture energy dynamics or intensity fluctuations, and another may emphasize prosodic features such as rhythm or intonation that are crucial for distinguishing speakers. By learning multiple complementary perspectives, the MHA layer produces a richer and more balanced representation of the input speech signal.
z ( k ) = t = 1 T α t ( k ) h t W k
where, h t     R C is (vector) frame-level feature at time t, α t ( k ) is the scalar attention weight, W k is a matrix (projection) and z ( k ) is the pooled vector for head k. The final pooled representation, typically 192 dimensions in our framework, serves as the fixed-length speaker embedding.
The pooled embedding is passed through a projection layer and trained using Additive Angular Margin Softmax (AAM-Softmax) [6], also known as ArcFace, which imposes an angular margin m between speaker classes in embedding space:
L = 1 N i = 1 N l o g e s ( c o s ( θ y i + m ) ) e s ( c o s ( θ y i + m ) ) + j y i e s c o s ( θ j ) ,
where s is the scaling factor, and m is the angular margin; θ y i denotes the target angle for the ground-truth class y i , and θ j denotes the angle for a non-target class j . This loss has been shown to outperform Triplet Loss and Softmax for open-set speaker and face verification [6,16]. During inference, embeddings are L2-normalized, and speaker similarity is computed using cosine similarity, preserving symmetry in final verification scoring.
After optimizing the embedding space using the angular-margin objective, we evaluate verification performance using the Equal Error Rate (EER), a standard metric in open-set speaker verification. Given a decision threshold ( τ ) applied to the cosine similarity score, the False Acceptance Rate (FAR) and False Rejection Rate (FRR) are defined as
F A R ( τ ) = | { ( i , k ) I : s c o r e ( i , k ) τ } | | I | ,   F R R ( τ ) = | { ( i , k ) G : s c o r e ( i , k ) < τ } | | G | ,
E E R = F A R ( τ * ) = F R R ( τ * ) ,   τ * = a r g   m i n τ F A R ( τ ) F R R ( τ )
where τ is the decision threshold (scalar) G and I denote the sets of genuine (same-speaker) and impostor (different-speaker) trials, respectively, and s c o r e ( i , k ) is the similarity score computed for the trial pair s c o r e ( i , k ) ; τ * is the operating threshold at which F A R ( τ ) and F R R ( τ ) are equal. The EER is obtained at the operating point where the two error rates are equal.
A lower EER indicates better verification performance, reflecting a more discriminative and robust embedding space under a balanced trade-off between false accepts and false rejects.

3.6. Theoretical Interpretation of the Symmetry/Asymmetry Principle

In this study, the notions of symmetry and asymmetry are not used as informal labels for popular modules, but as a principled way to describe two complementary design forces in multimodal speaker verification: enforcing representation consistency where speaker identity should remain unchanged (symmetry) and enabling selective specialization when modality reliability or temporal saliency is uneven (asymmetry). We define symmetry as any modeling or scoring behavior that remains stable under admissible changes that should not affect identity, such as different temporal ordering/aggregation within an utterance or swapping the order of the two embeddings during verification. Under this interpretation, symmetry is primarily reflected in choices that encourage a consistent embedding geometry and reduce arbitrary order effects. For example, bidirectional temporal processing integrates evidence from both past and future context without privileging a single direction, and cosine-based verification scoring with normalized embeddings produces a symmetric decision function in the sense that the similarity between two speakers does not depend on which embedding is treated as the “query” versus the “reference.” These symmetric components act as regularizers that make representations more comparable across samples and conditions, thereby improving stability and reproducibility.
In contrast, we define asymmetry as an intentional mechanism that allows the model to break strict balance when the data demands it, particularly in multimodal settings where one modality may be degraded, missing, or less informative at certain times. In our architecture, this is realized through adaptive fusion and attention-driven aggregation. Gated fusion introduces a data-dependent preference that can emphasize the more reliable modality while suppressing a corrupted one, and multi-head attention assigns non-uniform importance across temporal frames (and potentially across modalities), capturing inherently uneven saliency patterns. The key contribution is therefore not the isolated use of gating or attention, but the explicit symmetry–asymmetry coupling as a general design paradigm: maintain symmetric invariances to stabilize identity-related representation geometry and comparability, while introducing controlled asymmetry to adaptively weight modality contributions and temporal evidence under real-world variability. This formulation clarifies the novelty of the proposed “symmetry/asymmetry principle” as a reusable guideline for multimodal verification systems rather than a post-hoc description of standard components.

4. Experimental Setup

4.1. Datasets

We conduct experiments using large-scale audio-visual speaker verification datasets: VoxCeleb2 [11]—6112 speakers, 2442 h of speech, video–audio speech collected from YouTube. The “dev” split is used for model training, and the “test” split for additional evaluation. Because VoxCeleb does not include visual mouth crops, we extract lip ROI frames from all videos to enable multimodal evaluation. Noise robustness is tested by augmenting audio with samples from the MUSAN [26] noise corpus (Babble, Speech, Music, Misc.). Table 1 shows the noise-robustness results under Babble/Speech/Music at fixed SNR levels.

4.2. Feature Extraction

The acoustic and visual features are extracted independently for the audio and video streams to maintain a dual-modality input pipeline. Each speech segment is converted into 80-dimensional FBANKs using a 25 ms window and 10 ms hop, including first- and second-order delta coefficients, resulting in a 240-dimensional feature vector per frame. FBANKs are widely used in state-of-the-art speaker recognition systems due to their robustness to channel variations and compatibility with convolutional or temporal encoders [4,5]. Video frames are sampled at 25 frames per second (FPS). For each frame, a face detector + 68-point landmark model is used to extract a lip ROI centered on the speaker’s mouth, similar to that used in [20] AV-HuBERT and related audio-visual speech models. ROIs are resized to 112 × 112 grayscale images and normalized before being passed into the ResNet-based visual encoder.

4.3. Training Protocol

The multimodal architecture is trained end-to-end [27] in two phases. Each encoder (TDNN-BiLSTM for audio and ResNet-18 for lip frames) is first pre-trained independently using labeled speaker classification on the VoxCeleb2 dev set. This allows each stream to learn modality-specific representations before fusion. The full architecture (including Gated Fusion, Conformer encoder, attention pooling, and ArcFace classifier) is trained using the VoxCeleb2 development split. AAM-Softmax loss with an angular margin m = 0.2 and scale parameter s = 30 is used for speaker classification optimization [6].

4.4. Evaluation Metrics

The performance of the proposed audio–visual speaker verification system is evaluated using standard benchmarks employed in VoxCeleb and NIST SRE, with a focus on Equal Error Rate (EER), minimum Detection Cost Function (minDCF), and cosine similarity scoring. EER represents the operating point at which the false acceptance rate equals the false rejection rate, serving as a threshold-independent measure of how well the system separates genuine and impostor trials. A lower EER indicates stronger discriminative power of the learned embeddings. Complementing this, minDCF provides an application-driven evaluation by incorporating predefined costs for false accepts and false rejects.

4.5. Computational Cost

To assess practical efficiency, we report evaluation-time runtime on the GPU, including embedding extraction throughput and trial scoring throughput. All experiments were conducted on an NVIDIA RTX A6000 with CUDA 12.2. The proposed model contains approximately 120M trainable parameters. During evaluation, the system extracts enrollment and test embeddings for 589 audio–visual segments (utterances) in approximately 41 s and 38 s, respectively, corresponding to about 14–16 segments/s (≈65–70 ms/segment in this run). Verification scoring and EER computation over 50,000 trials takes 52 min 19 s, corresponding to 15.92 trials/s (≈62.8 ms/trial). These results provide a concrete estimate of end-to-end evaluation cost and support the practical feasibility of the proposed AV-SV system.

5. Results

Table 2 presents a comparative analysis of the proposed audio-visual speaker verification (AV-SV) system against several baseline and state-of-the-art multimodal approaches. The simple A + V fusion method, which combines audio and visual embeddings via static concatenation, achieves an EER of 6.135%, demonstrating that naïve fusion offers only limited benefit under cross-modal variability. Cross-modal systems such as VFNet and DIMNet, which attempt voice–face matching, perform poorly for verification, with EERs of 22.52% and 24.56%, confirming that identity-mapping across modalities is significantly more difficult than using parallel modalities like speech and lip movements. Self-supervised approaches such as AV-HuBERT obtain a stronger score (3.7% EER with lip video and 2.8% with face video), but rely on massive multimodal pretraining on hundreds of hours of data, making them computationally expensive to reproduce.
In contrast, our proposed TDNN-BiLSTM, ResNet18, Conformer architecture with Gated Fusion and MHA pooling achieves an EER of 3.419% and minDCF of 0.342 on VoxCeleb2—the best performance among lip-based AV-SV models.

5.1. Ablation Study

In Table 3, we intentionally begin with a fully symmetric reference design (S0) to provide a clean, reproducible baseline before introducing any adaptive mechanisms. The ablation study results are visualized in Figure 3. In S0, fusion is performed by equal-weight averaging of the audio and visual streams, which is symmetric because swapping the two modalities does not change the fused representation, and the aggregation over time uses statistics pooling (mean/std), which is also symmetric with respect to temporal order because it treats all frames equally. Together with L2-normalized embeddings and cosine scoring, S0 represents the most “stable and fairness-driven” configuration: both modalities contribute identically, and no frame or modality is preferred. However, this symmetry also becomes a limitation in real conditions because audio–visual quality is rarely balanced; for example, background noise can corrupt audio while poor illumination, blur, or imperfect lip cropping can degrade the visual stream. Since S0 cannot adapt its weighting, it averages reliable and unreliable evidence uniformly, which explains its higher EER (5.765%).
After establishing this symmetric point, we introduce asymmetry in a controlled order to show why adaptive fusion is necessary. In S1, we add temporal asymmetry through BiLSTM and incorporate residual fusion. BiLSTM breaks the “treat all frames equally” assumption by modeling directional temporal dependencies, which is important because both speech acoustics and lip motion contain structured dynamics (coarticulation, syllable timing, speaking rate) that cannot be captured by purely order-invariant pooling. We include residual fusion as a lightweight asymmetric baseline that still enables cross-modal interaction through a learned correction term. This provides a fair, efficient comparison to gated fusion while keeping the overall architecture stable and interpretable. Residual fusion further introduces asymmetry by treating fusion as a base representation, allowing cross-modal information to refine the fused embedding rather than enforcing identical contributions from both modalities at every point. This reduces EER to 4.860%, indicating that simply allowing temporal structure and cross-modal refinement already improves discriminability.
In S2, we move from residual refinement to a more explicit and interpretable asymmetry mechanism: gated fusion. Instead of applying an implicit correction, gated fusion predicts a sample-dependent gate that directly controls how much the fused embedding relies on audio versus visual information. This is especially suitable for robust AV-SV because it aligns with the practical failure modes of the modalities: when the audio stream is noisy or reverberant, the gate can shift weight toward visual speech cues; when the lip ROI is blurred or partially occluded, it can shift weight back toward audio. In other words, gating implements the “asymmetry principle” as controlled imbalance based on reliability, which leads to a stronger improvement (EER 3.652%) compared to residual fusion.
Finally, in the full model (S3), we extend this asymmetry from the modality level to the temporal level by adding MHA pooling on top of gated fusion. While gated fusion decides “which modality to trust” for a given trial, MHA pooling decides “which frames to trust” within that trial. This is important because modality reliability can vary across time, even within the same utterances such as silence, mouth closure, non-speech frames, and transient noise segments, which should not contribute equally to the final speaker embedding. MHA pooling assigns higher weights to informative segments and suppresses uninformative or corrupted regions, producing a cleaner, more discriminative representation. The combination of modality-level adaptivity (gating) and frame-level adaptivity (MHA) yields the best result (EER 3.419%), demonstrating that the full model benefits from two complementary asymmetry mechanisms: dynamic modality weighting to handle cross-modal quality imbalance and attention-based temporal weighting to handle within-utterance variability.

5.2. Discussion

This work positions symmetry and asymmetry as complementary design forces for audio–visual speaker verification, and the results highlight how these forces relate to both classical fusion strategies and recent self-supervised audio–visual models. Prior multimodal verification systems often rely on dynamic fusion or modality weighting, including cross-attention and transformer-based fusion blocks, as well as quality-aware weighting that adapts to noise or visual degradation. Many of these mechanisms can be interpreted as introducing controlled asymmetry, because they allow the model to treat modalities unequally depending on reliability. In contrast, methods such as AV-HuBERT are primarily designed for representation learning via large-scale self-supervised pretraining, where symmetry is expressed through shared objectives that encourage cross-modal alignment and consistency, and asymmetry appears mainly through architectural choices such as masked prediction and modality-conditioned reconstruction. While AV-HuBERT achieves strong performance, it typically benefits from extensive pretraining and high computational budgets, whereas the proposed approach focuses on a practical supervised AV-SV pipeline that explicitly couples symmetric embedding comparability (through normalized embeddings and symmetric scoring) with lightweight, controlled asymmetry (through adaptive fusion and attention-based aggregation). This discussion clarifies that our contribution is not the mere inclusion of common modules, but the systematic framing of how invariance-preserving components stabilize identity representation while selective asymmetry improves robustness under uneven modality quality.
A second important consideration is the decision to restrict the visual stream to the lip ROI. Lip motion contains strong complementary identity cues that remain informative even when the acoustic channel is corrupted, and it is closely linked to speech production, making it particularly relevant for noise-robust verification. In addition, lip ROI processing reduces sensitivity to nuisance factors such as hairstyle, accessories, and background, and it offers practical benefits for privacy and computational cost compared to full-face pipelines. However, limiting the visual input to the mouth region may omit useful biometric information available in the full face, such as facial geometry, skin texture, and static appearance, which can improve discrimination in clean conditions and may reduce error rates for systems that exploit face embeddings. Therefore, compared to state-of-the-art systems that use full-face cues, a lip-only design may trade some identity-specific richness for robustness, privacy, and tighter coupling to speech-related motion. This trade-off motivates our stated future direction of extending the visual pathway to full-face embeddings or multi-region visual features, which would allow a more complete comparison between “speech-driven” visual cues (lips) and “biometric-rich” cues (full face) within the same symmetry/asymmetry framework.
A strong audio-only baseline such as ECAPA-TDNNv2 or CAM++ is indeed important for assessing the absolute gain obtained by adding the visual modality; however, this comparison is primarily a matter of baseline fairness rather than a direct test of the proposed symmetry/asymmetry principle. Our symmetry/asymmetry contribution is formulated around multimodal coupling, which maintains identity-consistent, comparable embeddings while allowing controlled imbalance through adaptive modality weighting when reliability differs, which is not exercised in purely audio-only architectures that contain no cross-modal fusion. In addition, reported results for ECAPA/CAM++ in the literature are often not directly comparable to our setting because they depend strongly on differences in training scale (e.g., additional data), augmentation recipes, segmentation/chunking, loss hyperparameters, and scoring backends. Therefore, to ensure a fair and interpretable measurement, we will include ECAPA-TDNN/CAM++ audio-only baselines trained and evaluated under the same protocol as our system in the future, and we will report the incremental improvement of our audio–visual model over these matched audio-only references.
A limitation of this study is that all experiments are conducted on VoxCeleb2, which is the most widely used benchmark for audio–visual speaker verification and enables direct comparison with prior work, but does not fully cover the diversity of real-world acquisition conditions. In addition, our robustness analysis focuses on MUSAN-based additive noise, which captures an important but restricted subset of acoustic degradations and does not model other practical challenges such as device and microphone mismatch, speech codec artifacts, reverberation changes, or far-field recording effects. On the visual side, using lip ROIs on VoxCeleb2 does not systematically stress the model under severe motion blur, partial occlusion (e.g., masks), extreme pose, or low-light scenarios. Therefore, while the reported gains demonstrate the benefit of controlled asymmetry for noise-robust AV-SV under the evaluated setting, broader generalizability should be interpreted with caution. As future work, we plan to extend evaluation to additional datasets and more realistic stress tests, including cross-device and codec-distorted audio conditions and controlled visual corruptions such as blur, occlusion, and illumination shifts, to more comprehensively validate robustness and cross-domain transfer.

6. Conclusions

This paper introduced a symmetry-aware and robustness-enhanced audio-visual speaker verification (AV-SV) framework that integrates acoustic and visual information through dynamic gated fusion, shared temporal modeling, and angular-margin optimization. By combining a TDNN-BiLSTM speech encoder, a ResNet-based lip encoder, a learnable Gated Fusion Module, and a shared Conformer backbone, the proposed architecture successfully balances structural symmetry in representation learning with controlled asymmetry in modality fusion and optimization. Extensive experiments on VoxCeleb2 demonstrated that the proposed model consistently outperforms strong audio-only baselines such as ECAPA-TDNN, as well as several multimodal fusion systems, especially under noisy or corrupted audio conditions. The proposed method achieved substantial relative reductions in EER of 3.419% in both clean and adverse environments, demonstrating the effectiveness of using visual speech cues for improved speaker verification robustness. Ablation studies further confirmed the importance of gated fusion and attention-based pooling in improving discriminative speaker representation.
Overall, the results validate that integrating complementary modalities via learnable and temporally aligned mechanisms leads to more reliable and noise-resilient speaker verification systems. This work also highlights the significance of combining symmetric model components (e.g., shared encoders, cosine scoring) with asymmetric learning strategies (e.g., gated fusion, margin-based loss), offering a principled approach to multimodal verification.
Several directions can further extend this symmetry-aware AV-SV framework. First, we plan to explore self-supervised and semi-supervised pretraining for both audio and visual branches (e.g., AV-style masked prediction or contrastive learning), to reduce reliance on labeled VoxCeleb 2 data and improve cross-domain generalization. We will further expand our model on additional datasets. We also intend to investigate quality-aware and uncertainty-aware fusion, where the gated fusion weights are conditioned on estimated modality quality (e.g., SNR, blur, occlusion) and the system can down-weight unreliable channels at run time.

Author Contributions

Conceptualization, S.T. and D.-H.K.; methodology, S.T. and D.-H.K.; software, S.T.; validation, S.T. and D.-H.K.; formal analysis, S.T.; investigation, S.T.; resources, D.-H.K.; data curation, S.T.; writing—original draft preparation, S.T.; writing, review and editing, S.T. and D.-H.K.; visualization, S.T.; supervision, D.-H.K.; project administration, D.-H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported in part by the National Research Foundation of Korea (NRF) funded by the Korean government (MSIT) under Grant (No. RS-2024-00336286) and in part by the Inha University Research Grant.

Data Availability Statement

The audio–visual speaker verification experiments were conducted on the publicly available VoxCeleb2 datasets, which can be accessed from the official project website (https://www.robots.ox.ac.uk/~vgg/data/voxceleb/ (accessed on 15 November 2025)).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cohen, T.S.; Welling, M. Group Equivariant Convolutional Networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 2990–2999. [Google Scholar]
  2. Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric Deep Learning: Going beyond Euclidean Data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
  3. Lin, Q.; Yang, L.; Wang, X.; Qin, X.; Wang, J.; Li, M. Towards Lightweight Applications: Asymmetric Enroll-Verify Structure for Speaker Verification. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7067–7071. [Google Scholar] [CrossRef]
  4. Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
  5. Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, ISCA, Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar]
  6. Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4685–4694. [Google Scholar]
  7. Rajasekhar, G.P.; Alam, J. Audio-Visual Speaker Verification via Joint Cross-Attention. In Speech and Computer; Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; Volume 14339, pp. 18–31. ISBN 978-3-031-48311-0. [Google Scholar]
  8. Chetty, G.; Wagner, M. Audiovisual Speaker Identity Verification Based on Lip Motion Features. In Proceedings of the Interspeech 2007, ISCA, Antwerp, Belgium, 27–31 August 2007; pp. 2045–2048. [Google Scholar]
  9. Liu, M.; Lee, K.A.; Wang, L.; Zhang, H.; Zeng, C.; Dang, J. Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  10. Bando, Y.; Aizawa, T.; Itoyama, K.; Nakadai, K. Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition. In Proceedings of the Interspeech 2022, ISCA, Incheon, Republic of Korea, 18–22 September 2022; pp. 3824–3828. [Google Scholar]
  11. Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. In Proceedings of the INTERSPEECH 2018, Hyderabad, India, 2–6 September 2018; pp. 1086–1090. [Google Scholar] [CrossRef]
  12. Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6558–6569. [Google Scholar]
  13. Shi, B.; Hsu, W.-N.; Lakhotia, K.; Mohamed, A. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
  14. Ephrat, A.; Halperin, T.; Peleg, S. Improved Speech Reconstruction from Silent Video. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 455–462. [Google Scholar]
  15. Arevalo, J.; Solorio, T.; Montes-y-Gómez, M.; González, F.A. Gated Multimodal Units for Information Fusion. In Proceedings of the International Conference on Learning Representations (ICLR) Workshop, Toulon, France, 24–26 April 2017. [Google Scholar]
  16. Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6738–6746. [Google Scholar]
  17. Yang, Z.; Wang, X.; Xia, D.; Wang, W.; Dai, H. Sequence-Based Device-Free Gesture Recognition Framework for Multi-Channel Acoustic Signals. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  18. Jeon, S.; Lee, J.; Lee, Y.-J. Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition. AI 2025, 6, 222. [Google Scholar] [CrossRef]
  19. Tao, R.; Das, R.K.; Li, H. Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network. In Proceedings of the Interspeech 2020, ISCA, Shanghai, China, 25–29 October 2020; pp. 2242–2246. [Google Scholar]
  20. Shi, B.; Mohamed, A.; Hsu, W.-N. Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT. In Proceedings of the Interspeech 2022, ISCA, Incheon, Republic of Korea, 18–22 September 2022; pp. 4785–4789. [Google Scholar]
  21. Graves, A.; Schmidhuber, J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
  22. Rajasekhar, G.P.; Alam, J. SSAVSV: Towards Unified Model for Self-Supervised Audio–Visual Speaker Verification. arXiv 2025, arXiv:2506.17694. [Google Scholar] [CrossRef]
  23. Wu, Z.; Shen, C.; van den Hengel, A. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. Pattern Recognit. 2019, 90, 119–133. [Google Scholar] [CrossRef]
  24. Sarı, L.; Singh, K.; Zhou, J.; Torresani, L.; Singhal, N.; Saraf, Y. A Multi-View Approach to Audio–Visual Speaker Verification. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6194–6198. [Google Scholar]
  25. Ren, W.; Ma, L.; Zhang, J.; Pan, J.; Cao, X.; Liu, W.; Yang, M.-H. Gated Fusion Network for Single Image Dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3253–3261. [Google Scholar]
  26. Snyder, D.; Chen, G.; Povey, D. MUSAN: A Music, Speech, and Noise Corpus. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 13–17 December 2015; pp. 348–352. [Google Scholar]
  27. Hu, C.; Hao, Y.; Zhang, F.; Luo, X.; Shen, Y.; Gao, Y.; Deng, C.; Zhang, S.; Feng, J. Privacy-Preserving Speaker Verification via End-to-End Secure Representation Learning. In Proceedings of the INTERSPEECH 2025, Rotterdam, The Netherlands, 17–21 August 2025; pp. 1508–1512. [Google Scholar]
Figure 1. Proposed architecture for the Audio-Visual speaker verification.
Figure 1. Proposed architecture for the Audio-Visual speaker verification.
Symmetry 18 00121 g001
Figure 2. Visualization of the FBANK feature map (log-Mel filterbank energies) computed from the input speech signal.
Figure 2. Visualization of the FBANK feature map (log-Mel filterbank energies) computed from the input speech signal.
Symmetry 18 00121 g002
Figure 3. Visual representation of ablation study performance for symmetric and asymmetric components in the proposed audio–visual speaker verification model.
Figure 3. Visual representation of ablation study performance for symmetric and asymmetric components in the proposed audio–visual speaker verification model.
Symmetry 18 00121 g003
Table 1. Performance under Babble/Speech/Music Noise at Fixed SNR Levels (VoxCeleb2 Test, Audio-Only Noise Injection).
Table 1. Performance under Babble/Speech/Music Noise at Fixed SNR Levels (VoxCeleb2 Test, Audio-Only Noise Injection).
NoiseSNR (dB)EER (%) minDCF
Babble54.010.44
Babble05.170.56
Speech54.780.43
Speech04.980.54
Music53.910.42
Music04.850.52
Table 2. Performance comparison Table (Lower EER indicates best performance).
Table 2. Performance comparison Table (Lower EER indicates best performance).
SystemModalityDatasetEER (%)minDCF
SSAVSV [22]Audio + VisualVoxCeleb16.1350.472
VFNet [19]Audio + VisualVoxCeleb222.52
AV-HuBERT (audio + lip) [20]Audio + LipVoxCeleb23.7
Ours (Proposed Model)Audio + LipVoxCeleb23.4190.342
Table 3. Ablation study performance comparison table.
Table 3. Ablation study performance comparison table.
VariantSymmetric Layers KeptAsymmetric Layers KeptEER (%)
(S0)
Average fusion and Statistics pooling
(maximally symmetric baseline)
TDNN
ResNet
Shared Conformer
L2-norm
Statistics pooling
Average fusion
Cosine scoring
-5.765
(S1)
Residual fusion + Statistics pooling
TDNN
ResNet
Shared Conformer
L2-norm
Statistics pooling
Cosine scoring
BiLSTM
Residual fusion
4.860
(S2)
Gated fusion and Statistics pooling
TDNN
ResNet
Shared Conformer
L2-norm
Statistics pooling
Cosine scoring
BiLSTM
Gated fusion
3.652
(S3)
Full model (ours)
TDNN
ResNet
Shared Conformer
L2-norm
Cosine scoring
BiLSTM
Gated fusion
MHA
3.419
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Thiyagarajan, S.; Kim, D.-H. Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures. Symmetry 2026, 18, 121. https://doi.org/10.3390/sym18010121

AMA Style

Thiyagarajan S, Kim D-H. Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures. Symmetry. 2026; 18(1):121. https://doi.org/10.3390/sym18010121

Chicago/Turabian Style

Thiyagarajan, Sundareswari, and Deok-Hwan Kim. 2026. "Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures" Symmetry 18, no. 1: 121. https://doi.org/10.3390/sym18010121

APA Style

Thiyagarajan, S., & Kim, D.-H. (2026). Symmetry and Asymmetry Principles in Deep Speaker Verification Systems: Balancing Robustness and Discrimination Through Hybrid Neural Architectures. Symmetry, 18(1), 121. https://doi.org/10.3390/sym18010121

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop