Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (71)

Search Parameters:
Keywords = wav2vec 2.0

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
28 pages, 1299 KB  
Review
Multimodal Deep Learning Approaches for Lung Disease Detection: A Review
by Bastian Estay Zamorano, Ali Dehghan Firoozabadi, Pablo Adasme, Wanda Montiel Piña, Mauricio Chávez Muñoz, David Zabala-Blanco, Pablo Palacios Játiva and Cesar A. Azurdia-Meza
Medicina 2026, 62(7), 1223; https://doi.org/10.3390/medicina62071223 (registering DOI) - 24 Jun 2026
Abstract
Lung diseases are among the leading global causes of morbidity and mortality, and existing reviews on deep learning (DL) for pulmonary diagnosis rarely integrate imaging, acoustic, and electronic health record (EHR) modalities within a single framework. We aimed to synthesize the state of [...] Read more.
Lung diseases are among the leading global causes of morbidity and mortality, and existing reviews on deep learning (DL) for pulmonary diagnosis rarely integrate imaging, acoustic, and electronic health record (EHR) modalities within a single framework. We aimed to synthesize the state of the art (2019–2024) in multimodal DL for lung disease detection and classification, identifying dominant architectures, performance benchmarks, and translational barriers across chest X-rays, CT scans, respiratory sounds, and EHRs. A structured narrative review was conducted using PubMed, Scopus, IEEE Xplore, and Web of Science, applying explicit inclusion criteria for peer-reviewed studies; performance metrics, dataset characteristics, and reported limitations were extracted. Research involving convolutional neural networks (CNNs) and more recent models such as Transformers have reported high performance in chest X-ray classification, whereas acoustic approaches based on spectrograms and self-supervised representations (e.g., Wav2Vec 2.0) show promising but dataset-dependent results. Full article
Show Figures

Figure 1

17 pages, 4283 KB  
Article
A Hybrid Semantic-Acoustic Transformer for Vocal Burst Emotion Recognition Using Wav2Vec 2.0 and Whisper ASR
by Suryakant Tyagi and Sándor Szénási
Algorithms 2026, 19(5), 416; https://doi.org/10.3390/a19050416 - 21 May 2026
Viewed by 290
Abstract
Finding emotions in human speech is a difficult task. It is even harder for sounds without words, like laughs, gasps, and sighs. Normal audio models fail at this task because these sounds are very short and the audio patterns are complex. To fix [...] Read more.
Finding emotions in human speech is a difficult task. It is even harder for sounds without words, like laughs, gasps, and sighs. Normal audio models fail at this task because these sounds are very short and the audio patterns are complex. To fix this problem, we created a new model called the Hybrid Semantic-Acoustic Transformer. Our system uses a Wav2Vec 2.0 model to get acoustic features. At the same time, it uses a Whisper ASR model to get phonetic features. We mix these two types of data together using a Cross-Attention layer. We tested our model on the EmoGator dataset. This dataset has 32,130 audio files across 30 different emotion classes. We split the data strictly into 80% for training, 10% for validation, and 10% for testing. Our new model achieved an overall accuracy of 74.8%. We also did an ablation study. This study proves that using cross-attention is much better than simply adding the features together. Our final result is a 6.4% increase in the F1-score compared to the original EmoGator baseline model. This sets a new high score for classifying non-speech sounds in different noisy environments. Our model also reached over 90% precision when telling the difference between a ‘Sigh’ and a ‘Gasp’. Standard speech models usually fail at this specific task. Full article
(This article belongs to the Special Issue Bio-Inspired Algorithms: 2nd Edition)
Show Figures

Figure 1

22 pages, 421 KB  
Article
Frame-Level Audio Forgery Localization Using Handcrafted and Neural Features
by Mostafa Moallim, Taqwa A. Alhaj, Fatin A. Elhaj, Inshirah Idris and Tasneem Darwish
Signals 2026, 7(3), 42; https://doi.org/10.3390/signals7030042 - 7 May 2026
Viewed by 722
Abstract
Audio forgery has emerged as a significant security and forensic challenge, driven by rapid advances in generative artificial intelligence and the widespread availability of audio editing tools, which enable the creation of highly realistic manipulated speech with minimal technical expertise. Existing approaches predominantly [...] Read more.
Audio forgery has emerged as a significant security and forensic challenge, driven by rapid advances in generative artificial intelligence and the widespread availability of audio editing tools, which enable the creation of highly realistic manipulated speech with minimal technical expertise. Existing approaches predominantly operate at the file level, providing only coarse binary decisions without identifying when or where manipulation occurs. This study addresses fine-grained temporal localization through a unified frame-level localization framework. We introduce a controlled forgery generation framework derived from the TIMIT speech corpus, applying atomic, localized manipulations under strict temporal constraints and producing precise frame-level annotations across diverse manipulation types. Building on this dataset, we then propose a transform-agnostic localization-driven detection approach using temporal inconsistency modeling, enabling unified analysis across heterogeneous manipulations at frame-level resolution. To analyze forensic evidence, we present an evidence-stratified modeling paradigm comparing three complementary strategies: a handcrafted anomaly-based method, a deep localization model leveraging pretrained wav2vec 2.0 representations, and a hybrid approach combining both through confidence-aware fusion and temporal consistency reinforcement. A systematic experimental analysis evaluates the effects of representation adaptation, hybrid fusion, and manipulation type on detection and localization performance. Results show that handcrafted features are insufficient for reliable frame-level localization, while task-adapted wav2vec 2.0 achieves strong and consistent performance. The hybrid approach does not consistently improve frame-level accuracy but yields substantial gains in segment-level localization by enforcing temporal coherence. Per-transform analysis confirms robust performance across most manipulations, with deletion-based operations remaining the most challenging. Full article
Show Figures

Figure 1

18 pages, 527 KB  
Article
An Empirical Comparison of Cascade and Direct End-to-End Speech Translation for Low-Resource Language Pair
by Zhanibek Kozhirbayev
Computers 2026, 15(4), 222; https://doi.org/10.3390/computers15040222 - 2 Apr 2026
Viewed by 1449
Abstract
Speech-to-text translation (S2TT) for low-resource languages remains challenging due to the scarcity of parallel speech translation data and the susceptibility of modular pipelines to error propagation. This paper presents a controlled empirical comparison of cascade and end-to-end approaches for Kazakh–Russian speech translation using [...] Read more.
Speech-to-text translation (S2TT) for low-resource languages remains challenging due to the scarcity of parallel speech translation data and the susceptibility of modular pipelines to error propagation. This paper presents a controlled empirical comparison of cascade and end-to-end approaches for Kazakh–Russian speech translation using the ST-kk-ru dataset (≈332 h, 140 k triplets). The cascade framework is strengthened with recent pre-trained models for automatic speech recognition and neural machine translation, achieving 21.3 BLEU on the test set. Three representative end-to-end architectures are evaluated under identical data conditions. The strongest direct model, combining a Wav2Vec 2.0 encoder with an mBART decoder augmented by a length adaptor and adapter modules, reaches 17.97 BLEU, compared with 15.35 BLEU for FAIRSEQ S2T and 16.3 BLEU for ESPnet-ST. Automatic evaluation is complemented by expert manual assessment and targeted linguistic analysis. Results indicate that, under current low-resource conditions, cascade systems provide higher translation accuracy and better morpho-syntactic fidelity, while end-to-end models remain competitive and offer advantages in architectural simplicity and potentially reduced inference latency (due to single-pass processing), although empirical measurements were not conducted in this study. This study establishes a reproducible benchmark for Kazakh–Russian speech translation and highlights practical trade-offs between modeling paradigms in low-resource, morphologically rich settings. Full article
Show Figures

Figure 1

16 pages, 950 KB  
Article
A CTC-Based Speech Recognition Network Fusing Local Convolution and Global Attention
by Huijuan Hu, Chenyang Tang, Ping Tan and He Xu
Sensors 2026, 26(6), 1865; https://doi.org/10.3390/s26061865 - 16 Mar 2026
Viewed by 665
Abstract
Integrating wav2vec 2.0 with Connectionist Temporal Classification (CTC) for automatic speech recognition (ASR) often involves a trade-off between capturing global semantic consistency and maintaining local feature discriminability. This study proposes DBA-wav2vec 2.0, an architecture designed to manage these modeling requirements by decoupling temporal [...] Read more.
Integrating wav2vec 2.0 with Connectionist Temporal Classification (CTC) for automatic speech recognition (ASR) often involves a trade-off between capturing global semantic consistency and maintaining local feature discriminability. This study proposes DBA-wav2vec 2.0, an architecture designed to manage these modeling requirements by decoupling temporal modeling into parallel local and global streams at the encoder–decoder interface. Depthwise separable convolutions are utilized to capture local acoustic structures, while a self-attention path is retained for long-range dependencies. A task-aware gating mechanism is introduced to integrate these heterogeneous features. By adjusting fusion weights based on acoustic input characteristics, the gate facilitates the refinement of posterior probability distributions, leading to more distinct alignment points. Experimental results on AISHELL-1 and ST-CMDS datasets show relative Character Error Rate (CER) reductions of 6.4% and 7.4%, respectively, compared to a baseline wav2vec 2.0 model. Further evaluations under varying speaking rates demonstrate a 15.3% relative improvement in fast-speech scenarios, suggesting that structural adaptation at the decoding interface can enhance the robustness of CTC-based systems against temporal variations. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

17 pages, 306 KB  
Article
Multimodal AI Screening of Developmental Language Disorder in Tunisian Arabic Children: Clinical Markers and Computational Detection
by Faten Bouhajeb, Redha Touati and Selçuk Güven
Behav. Sci. 2026, 16(3), 375; https://doi.org/10.3390/bs16030375 - 6 Mar 2026
Viewed by 742
Abstract
Developmental Language Disorder (DLD) is a common neurodevelopmental condition that affects language acquisition in children. However, standardized diagnostic tools for Tunisian Arabic, a widely spoken yet underrepresented dialect, is still lacking. This study presents a multimodal biomedical informatics framework that integrates clinical assessments, [...] Read more.
Developmental Language Disorder (DLD) is a common neurodevelopmental condition that affects language acquisition in children. However, standardized diagnostic tools for Tunisian Arabic, a widely spoken yet underrepresented dialect, is still lacking. This study presents a multimodal biomedical informatics framework that integrates clinical assessments, speech recordings, and artificial intelligence (AI) for early DLD detection. Three linguistic tasks (the CLT Task, the Arabic Verb Evaluation Task, and the Nonword Repetition Task) were adapted for Tunisian Arabic, and spontaneous speech samples were collected from children with typical development and those with DLD. Statistical analyses revealed significant deficits in verb production, past-tense morphology, and phonological memory in the DLD group. For automated screening, we developed two systems: a Random Forest classifier based on structured clinical and linguistic features and a multimodal deep learning model using Wav2Vec2 acoustic embeddings. The best model achieved an F1 score of 0.85, demonstrating the feasibility of AI-assisted DLD screening. This work introduces the first standardized dataset and computational baseline for DLD in Tunisian Arabic, providing clinically relevant tools for early identification and supporting research on underrepresented Arabic dialects. This work also highlights future implications, including potential applications in early screening, the integration of acoustic markers, and the development of culturally adapted assessment tools for underrepresented languages. Full article
24 pages, 11178 KB  
Article
FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition
by Yikun Xu, Zhiheng Xu and Pengwen Dai
Electronics 2026, 15(5), 1064; https://doi.org/10.3390/electronics15051064 - 4 Mar 2026
Cited by 1 | Viewed by 545
Abstract
Scene text recognition (STR) and automatic speech recognition (ASR) translate visual or acoustic signals into linguistic sequences and underpin many modern perception systems. Although their front-ends and decoders differ (e.g., CTC-based, attention-based, or variants), both tasks ultimately rely on aligning input frames to [...] Read more.
Scene text recognition (STR) and automatic speech recognition (ASR) translate visual or acoustic signals into linguistic sequences and underpin many modern perception systems. Although their front-ends and decoders differ (e.g., CTC-based, attention-based, or variants), both tasks ultimately rely on aligning input frames to output tokens by deep learning techniques, which exposes a shared vulnerability to adversarial perturbations. Existing attacks commonly optimize global sequence-level objectives. As a result, decisive frames are treated implicitly, and optimization can become unnecessarily diffuse over long input sequences, hindering convergence and perceptual quality. To address the above issues, we propose FLAMA, a unified Frame-Level Alignment Margin Attack, which could be used for both STR and ASR models. FLAMA explicitly targets alignment by maximizing per frame (or per step) recognition margins. The design is decoder-agnostic and applies to both CTC-based and attention-based pipelines. It employs a recognition-score-aware Step/Halt gate that concentrates updates on the most critical frames, and a stabilization stage that suppresses late-iteration oscillations to improve optimization stability and perceptual control. Ablation analyses show that stabilization consistently enhances attack success and reduces distortion. We evaluate FLAMA on STR benchmarks (SVT, CUTE80, and IC13) with CRNN, STAR, and TRBA, and on the ASR benchmark (LibriSpeech) with a Wav2Vec 2.0 model. Across modalities and architectures, FLAMA achieves near-100% attack success while substantially reducing l2 distortion and improving perceptual metrics compared with FGSM/PGD baselines. These results highlight frame-level alignment as a shared weak point across visual and audio sequence recognizers and suggest localized margin objectives as a principled route to effective sequence attacks. Full article
Show Figures

Figure 1

15 pages, 669 KB  
Article
Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion
by Felix Agbavor and Hualou Liang
J. Dement. Alzheimer's Dis. 2026, 3(1), 12; https://doi.org/10.3390/jdad3010012 - 2 Mar 2026
Viewed by 1099
Abstract
Background/Objectives: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that affects the daily lives of older adults, impacting their cognitive abilities as well as speech and language communication. Early detection is crucial, as it enables timely intervention and helps improve the quality [...] Read more.
Background/Objectives: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that affects the daily lives of older adults, impacting their cognitive abilities as well as speech and language communication. Early detection is crucial, as it enables timely intervention and helps improve the quality of life for those affected. While large language models (LLMs) have shown promise from spontaneous speech, most studies are unimodal and miss complementary signals across modalities. Methods: We present an LLM-powered multimodal cross-attention framework that integrates lexical (text), acoustic (speech), and visual (image) information for dementia detection using the ADReSSo 2021 picture-description dataset. Within this framework, text data are encoded using the ModernBERT, audio features are extracted using the wav2vec 2.0-base-960, and the Cookie Theft image is represented through the CLIP ViT-L/14. These embeddings are linearly projected to a shared space and then combined via Transformer-based cross-attention, yielding a fused vector for AD detection. Results: Our results show that the trimodal model achieved the best overall performance when paired with an SVC classifier, reaching an accuracy of 0.8732 and an F1 score of 0.8571, surpassing both the top-performing unimodal and bimodal configurations. For interpretability, a sensitivity analysis of modality contributions reveals that text plays the primary role, audio provides complementary improvements, and image offers modest yet stabilizing contextual support. Conclusions: These results highlight that the method of multimodal embedding fusion significantly influences performance: a cross-attention block achieves an effective balance between accuracy and simplicity, producing integrated representations that align well with interpretable downstream classifiers. Full article
Show Figures

Figure 1

16 pages, 434 KB  
Article
Modern Speech Recognition for Romanian Language
by Remus-Dan Ungureanu and Mihai Dascalu
Appl. Sci. 2026, 16(4), 1928; https://doi.org/10.3390/app16041928 - 14 Feb 2026
Viewed by 1191
Abstract
Despite having approximately 24 million native speakers, Romanian remains a low-resource language for automatic speech recognition (ASR), with few accurate and publicly available systems. To address this gap, this study explores the challenges of adapting modern speech recognition models, such as wav2vec 2.0 [...] Read more.
Despite having approximately 24 million native speakers, Romanian remains a low-resource language for automatic speech recognition (ASR), with few accurate and publicly available systems. To address this gap, this study explores the challenges of adapting modern speech recognition models, such as wav2vec 2.0 and Conformer, to Romanian. Our investigation is a comprehensive analysis of the two models, their capabilities to adapt to Romanian data, and the performance of the trained models. The research also focuses on unique attributes of the Romanian language, data collection techniques, including weakly supervised learning, and processing methodologies. Building on the previously introduced Echo dataset of 378 h, we release CRoWL (Crawled Romanian Weakly Labeled), a weakly supervised dataset of 9000 h created via automatic transcription. We obtain strong results that, to the best of our knowledge, are competitive with or exceed publicly reported results for Romanian under comparable open evaluation settings, with Conformer attaining 3.01% WER on Echo + CRoWL and wav2vec 2.0 reaching 4.04% (Echo) and 4.17% (Echo + CRoWL). In addition to the datasets, we also release our most capable models as open source, along with their training plans, thereby providing a solid foundation for researchers interested in languages with limited representation. Full article
Show Figures

Figure 1

12 pages, 356 KB  
Article
When Data Is Scarce: Training a Kazakh Speech Language Model from Discrete Units
by Bauyrzhan Kairatuly and Madina Mansurova
Appl. Sci. 2026, 16(4), 1773; https://doi.org/10.3390/app16041773 - 11 Feb 2026
Viewed by 521
Abstract
This research explores the development of a decoder-only speech language model (SLM) for Kazakh, a language currently characterized by limited computational resources. Our approach leverages discrete acoustic units synthesized from self-supervised speech representations. Specifically, we utilize a pretrained Wav2Vec 2.0 model to extract [...] Read more.
This research explores the development of a decoder-only speech language model (SLM) for Kazakh, a language currently characterized by limited computational resources. Our approach leverages discrete acoustic units synthesized from self-supervised speech representations. Specifically, we utilize a pretrained Wav2Vec 2.0 model to extract continuous latent features, which are then transformed into discrete semantic tokens via the k-means clustering algorithm. These tokens serve as the foundation for training a generative model designed to predict and maximize the likelihood of speech-unit sequences. To facilitate this study, we curated a specialized Kazakh speech corpus by synthesizing and refining multiple publicly available audio datasets. Given the constrained hardware resources available, we conducted large-scale feature extraction and tokenization to train the unit-based model. We evaluated the system’s efficacy using negative log-likelihood and perplexity metrics on independent test sets. The model captures Kazakh vowel harmony but struggles with long-range agglutinative chains. Key observations include the model’s high sensitivity to data quality, tokenization techniques, and specific training hyperparameters. Although constrained by data volume and training time relative to global benchmarks, the model successfully captures the underlying structural patterns in Kazakh speech. This work establishes a vital empirical baseline and suggests future improvements through refined unit discovery and integrated speech-text modeling. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

19 pages, 1393 KB  
Article
Multimodal Emotion Recognition Model Based on Dynamic Heterogeneous Graph Temporal Network
by Bulaga Da and Feilong Bao
Appl. Sci. 2026, 16(4), 1731; https://doi.org/10.3390/app16041731 - 10 Feb 2026
Viewed by 863
Abstract
To address the semantic gap and complex feature entanglement inherent in multimodal emotion recognition, we propose the Dynamic Heterogeneous Graph Temporal Network (DHGTN), an end-to-end framework designed to model dynamic cross-modal interactions effectively. Utilizing a robust backbone of Wav2vec 2.0, VideoMAE, and BERT, [...] Read more.
To address the semantic gap and complex feature entanglement inherent in multimodal emotion recognition, we propose the Dynamic Heterogeneous Graph Temporal Network (DHGTN), an end-to-end framework designed to model dynamic cross-modal interactions effectively. Utilizing a robust backbone of Wav2vec 2.0, VideoMAE, and BERT, we introduce a “Shared Private” subspace projection mechanism that explicitly disentangles emotion common features from modality-specific noise through contrastive learning to ensure strict semantic alignment. Furthermore, our collaborative Dynamic Heterogeneous Graph and Transformer module overcomes static fusion limitations by constructing time-varying graphs for instantaneous associations and employing global attention to capture long-range temporal dependencies. Extensive experiments on the IEMOCAP and MELD benchmarks demonstrate that DHGTN significantly outperforms state-of-the-art baselines, achieving weighted F1-scores of 73.86% and 66.87%, respectively, which confirms the method’s effectiveness and robustness. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

26 pages, 712 KB  
Article
Comparing Multi-Scale and Pipeline Models for Speaker Change Detection
by Alymzhan Toleu, Gulmira Tolegen and Bagashar Zhumazhanov
Acoustics 2026, 8(1), 5; https://doi.org/10.3390/acoustics8010005 - 25 Jan 2026
Viewed by 1714
Abstract
Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their [...] Read more.
Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their behavior under a unified modeling setup is still not well understood. In this paper, we systematically compare two representative unsupervised approaches on the multi-talker audio meeting corpus: (i) a clustering-based pipeline that segments and clusters embeddings/features and scores boundaries via cluster changes and jump magnitude, and (ii) a multi-scale jump-based detector that measures embedding discontinuities at several window lengths and fuses them via temporal clustering and voting. Using a shared front-end and protocol, we vary the underlying features (ECAPA, WavLM, wav2vec 2.0, MFCC, and log-Mel) and test the model’s robustness under additive noise. The results show that embedding choice is crucial and that the two methods offer complementary trade-offs: the pipeline yields low false alarm rates but higher misses, while the multi-scale detector achieves relatively high recall at the cost of many false alarms. Full article
Show Figures

Figure 1

20 pages, 5606 KB  
Article
Heart Sound Classification for Early Detection of Cardiovascular Diseases Using XGBoost and Engineered Acoustic Features
by P. P. Satya Karthikeya, P. Rohith, B. Karthikeya, M. Karthik Reddy, Akhil V M, Andrea Tigrini, Agnese Sbrollini and Laura Burattini
Sensors 2026, 26(2), 630; https://doi.org/10.3390/s26020630 - 17 Jan 2026
Cited by 2 | Viewed by 1513
Abstract
Heart sound-based detection of cardiovascular diseases is a critical task in clinical diagnostics, where early and accurate identification can significantly improve patient outcomes. In this study, we investigate the effectiveness of combining traditional acoustic features and transformer-based Wav2Vec embeddings with advanced machine learning [...] Read more.
Heart sound-based detection of cardiovascular diseases is a critical task in clinical diagnostics, where early and accurate identification can significantly improve patient outcomes. In this study, we investigate the effectiveness of combining traditional acoustic features and transformer-based Wav2Vec embeddings with advanced machine learning models for multi-class classification of five heart sound categories. Ten engineered acoustic features, i.e., Log Mel, MFCC, delta, delta-delta, chroma, discrete wavelet transform, zero-crossing rate, energy, spectral centroid, and temporal flatness, were extracted as regular features. Four model configurations were evaluated: a hybrid CNN + LSTM and XGBoost trained with either regular features or Wav2Vec embeddings. Models were assessed using a held-out test set with hyperparameter tuning and cross-validation. Results demonstrate that models trained on regular features consistently outperform Wav2Vec-based models, with XGBoost achieving the highest accuracy of 99%, surpassing the hybrid model at 98%. These findings highlight the importance of domain-specific feature engineering and the effectiveness of ensemble learning with XGBoost for robust and accurate heart sound classification, offering a promising approach for early detection and intervention in cardiovascular diseases. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

30 pages, 6201 KB  
Article
AFAD-MSA: Dataset and Models for Arabic Fake Audio Detection
by Elsayed Issa
Computation 2026, 14(1), 20; https://doi.org/10.3390/computation14010020 - 14 Jan 2026
Viewed by 1581
Abstract
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of [...] Read more.
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness. Full article
Show Figures

Figure 1

26 pages, 29009 KB  
Article
Quantifying the Relationship Between Speech Quality Metrics and Biometric Speaker Recognition Performance Under Acoustic Degradation
by Ajan Ahmed and Masudul H. Imtiaz
Signals 2026, 7(1), 7; https://doi.org/10.3390/signals7010007 - 12 Jan 2026
Cited by 1 | Viewed by 2026
Abstract
Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and [...] Read more.
Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and HuBERT) across three diverse datasets: TIMIT, CHiME-6, and Common Voice. We systematically applied 21 degradation conditions spanning noise contamination (SNR levels from 0 to 20 dB), reverberation (RT60 from 0.3 to 1.0 s), and codec compression (various bit rates), then measured both objective audio quality metrics (PESQ, STOI, SNR, SegSNR, fwSNRseg, jitter, shimmer, HNR) and speaker verification performance metrics (EER, AUC-ROC, d-prime, minDCF). At the condition level, multiple regression with all eight quality metrics explained up to 80% of the variance in minDCF for HuBERT and 78% for WavLM, but only 35% for Wav2Vec2; EER predictability was lower (69%, 67%, and 28%, respectively). PESQ was the strongest single predictor for WavLM and HuBERT, while Shimmer showed the highest single-metric correlation for Wav2Vec2; fwSNRseg yielded the top single-metric R2 for WavLM, and PESQ for HuBERT and Wav2Vec2 (with much smaller gains for Wav2Vec2). WavLM and HuBERT exhibited more predictable quality-performance relationships compared to Wav2Vec2. These findings establish quantitative relationships between measurable audio quality and speaker verification accuracy at the condition level, though substantial within-condition variability limits utterance-level prediction accuracy. Full article
Show Figures

Figure 1

Back to TopTop