Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (60)

Search Parameters:
Keywords = Wav2Vec 2.0

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
26 pages, 712 KB  
Article
Comparing Multi-Scale and Pipeline Models for Speaker Change Detection
by Alymzhan Toleu, Gulmira Tolegen and Bagashar Zhumazhanov
Acoustics 2026, 8(1), 5; https://doi.org/10.3390/acoustics8010005 - 25 Jan 2026
Viewed by 177
Abstract
Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their [...] Read more.
Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their behavior under a unified modeling setup is still not well understood. In this paper, we systematically compare two representative unsupervised approaches on the multi-talker audio meeting corpus: (i) a clustering-based pipeline that segments and clusters embeddings/features and scores boundaries via cluster changes and jump magnitude, and (ii) a multi-scale jump-based detector that measures embedding discontinuities at several window lengths and fuses them via temporal clustering and voting. Using a shared front-end and protocol, we vary the underlying features (ECAPA, WavLM, wav2vec 2.0, MFCC, and log-Mel) and test the model’s robustness under additive noise. The results show that embedding choice is crucial and that the two methods offer complementary trade-offs: the pipeline yields low false alarm rates but higher misses, while the multi-scale detector achieves relatively high recall at the cost of many false alarms. Full article
Show Figures

Figure 1

20 pages, 5606 KB  
Article
Heart Sound Classification for Early Detection of Cardiovascular Diseases Using XGBoost and Engineered Acoustic Features
by P. P. Satya Karthikeya, P. Rohith, B. Karthikeya, M. Karthik Reddy, Akhil V M, Andrea Tigrini, Agnese Sbrollini and Laura Burattini
Sensors 2026, 26(2), 630; https://doi.org/10.3390/s26020630 - 17 Jan 2026
Viewed by 283
Abstract
Heart sound-based detection of cardiovascular diseases is a critical task in clinical diagnostics, where early and accurate identification can significantly improve patient outcomes. In this study, we investigate the effectiveness of combining traditional acoustic features and transformer-based Wav2Vec embeddings with advanced machine learning [...] Read more.
Heart sound-based detection of cardiovascular diseases is a critical task in clinical diagnostics, where early and accurate identification can significantly improve patient outcomes. In this study, we investigate the effectiveness of combining traditional acoustic features and transformer-based Wav2Vec embeddings with advanced machine learning models for multi-class classification of five heart sound categories. Ten engineered acoustic features, i.e., Log Mel, MFCC, delta, delta-delta, chroma, discrete wavelet transform, zero-crossing rate, energy, spectral centroid, and temporal flatness, were extracted as regular features. Four model configurations were evaluated: a hybrid CNN + LSTM and XGBoost trained with either regular features or Wav2Vec embeddings. Models were assessed using a held-out test set with hyperparameter tuning and cross-validation. Results demonstrate that models trained on regular features consistently outperform Wav2Vec-based models, with XGBoost achieving the highest accuracy of 99%, surpassing the hybrid model at 98%. These findings highlight the importance of domain-specific feature engineering and the effectiveness of ensemble learning with XGBoost for robust and accurate heart sound classification, offering a promising approach for early detection and intervention in cardiovascular diseases. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

30 pages, 6201 KB  
Article
AFAD-MSA: Dataset and Models for Arabic Fake Audio Detection
by Elsayed Issa
Computation 2026, 14(1), 20; https://doi.org/10.3390/computation14010020 - 14 Jan 2026
Viewed by 284
Abstract
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of [...] Read more.
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness. Full article
Show Figures

Figure 1

26 pages, 29009 KB  
Article
Quantifying the Relationship Between Speech Quality Metrics and Biometric Speaker Recognition Performance Under Acoustic Degradation
by Ajan Ahmed and Masudul H. Imtiaz
Signals 2026, 7(1), 7; https://doi.org/10.3390/signals7010007 - 12 Jan 2026
Viewed by 423
Abstract
Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and [...] Read more.
Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and HuBERT) across three diverse datasets: TIMIT, CHiME-6, and Common Voice. We systematically applied 21 degradation conditions spanning noise contamination (SNR levels from 0 to 20 dB), reverberation (RT60 from 0.3 to 1.0 s), and codec compression (various bit rates), then measured both objective audio quality metrics (PESQ, STOI, SNR, SegSNR, fwSNRseg, jitter, shimmer, HNR) and speaker verification performance metrics (EER, AUC-ROC, d-prime, minDCF). At the condition level, multiple regression with all eight quality metrics explained up to 80% of the variance in minDCF for HuBERT and 78% for WavLM, but only 35% for Wav2Vec2; EER predictability was lower (69%, 67%, and 28%, respectively). PESQ was the strongest single predictor for WavLM and HuBERT, while Shimmer showed the highest single-metric correlation for Wav2Vec2; fwSNRseg yielded the top single-metric R2 for WavLM, and PESQ for HuBERT and Wav2Vec2 (with much smaller gains for Wav2Vec2). WavLM and HuBERT exhibited more predictable quality-performance relationships compared to Wav2Vec2. These findings establish quantitative relationships between measurable audio quality and speaker verification accuracy at the condition level, though substantial within-condition variability limits utterance-level prediction accuracy. Full article
Show Figures

Figure 1

36 pages, 1309 KB  
Article
Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment
by Philipp Ollmann, Erik Sonnleitner, Marc Kurz, Jens Krösche and Stephan Selinger
Information 2026, 17(1), 40; https://doi.org/10.3390/info17010040 - 4 Jan 2026
Viewed by 425
Abstract
Reading proficiency in early childhood is crucial for academic success and intellectual development. However, more and more children are struggling with reading. According to the last PISA study in Austria, one out of five children is dealing with reading difficulties. The reasons for [...] Read more.
Reading proficiency in early childhood is crucial for academic success and intellectual development. However, more and more children are struggling with reading. According to the last PISA study in Austria, one out of five children is dealing with reading difficulties. The reasons for this are diverse, but an application that tracks children while reading aloud and guides them when they experience difficulties could offer meaningful help. Therefore, this proposal explores a prototyping approach for a core component that tracks children’s reading using a self-supervised Wav2Vec2 model with a limited amount of data. Self-supervised learning allows models to learn general representations from large amounts of unlabeled audio, which can then be fine-tuned on smaller, task-specific datasets, making it especially useful when labeled data is limited. Our model is operating on the phonetic level with the help of the International Phonetic Alphabet (IPA). To implement this, the KidsTALC dataset from the Leibniz University Hannover was used, which contains spontaneous speech recordings of German-speaking children. To enhance the training data and improve robustness, several data augmentation techniques were applied and evaluated, including pitch shifting, formant shifting, and speed variation. The models were trained using different data configurations to compare the effects of data variety and quality on recognition performance. The best model trained in this work achieved a phoneme error rate (PER) of 14.3% and a word error rate (WER) of 31.6% on unseen child speech data, demonstrating the potential of self-supervised models for such use cases. Full article
(This article belongs to the Special Issue AI Technology-Enhanced Learning and Teaching)
Show Figures

Figure 1

11 pages, 555 KB  
Article
Human–AI Feedback Loop for Pronunciation Training: A Mobile Application with Phoneme-Level Error Highlighting
by Aleksei Demin, Georgii Vorontsov and Dmitrii Chaikovskii
Multimodal Technol. Interact. 2026, 10(1), 2; https://doi.org/10.3390/mti10010002 - 26 Dec 2025
Viewed by 593
Abstract
This paper presents an AI-augmented pronunciation training approach for Russian language learners through a mobile application that supports an interactive learner–system feedback loop. The system combines a pre-trained Wav2Vec2Phoneme neural network with Needleman–Wunsch global sequence alignment to convert reference and learner speech into [...] Read more.
This paper presents an AI-augmented pronunciation training approach for Russian language learners through a mobile application that supports an interactive learner–system feedback loop. The system combines a pre-trained Wav2Vec2Phoneme neural network with Needleman–Wunsch global sequence alignment to convert reference and learner speech into aligned phoneme sequences. Rather than producing an overall pronunciation score, the application provides localized, interpretable feedback by highlighting phoneme-level matches and mismatches in a red/green transcription, enabling learners to see where sounds were substituted, omitted, or added. Implemented as a WeChat Mini Program with a WebSocket-based backend, the design illustrates how speech-to-phoneme models and alignment procedures can be integrated into a lightweight mobile interface for autonomous pronunciation practice. We further provide a feature-level comparison with widely used commercial applications (Duolingo, HelloChinese, Babbel), emphasizing differences in feedback granularity and interpretability rather than unvalidated accuracy claims. Overall, the work demonstrates the feasibility of alignment-based phoneme-level feedback for mobile pronunciation training and motivates future evaluation of recognition reliability, latency, and learning outcomes on representative learner data. Full article
Show Figures

Figure 1

24 pages, 1716 KB  
Article
Multi-Modal Decentralized Hybrid Learning for Early Parkinson’s Detection Using Voice Biomarkers and Contrastive Speech Embeddings
by Khaled M. Alhawiti
Sensors 2025, 25(22), 6959; https://doi.org/10.3390/s25226959 - 14 Nov 2025
Cited by 1 | Viewed by 990
Abstract
Millions worldwide are affected by Parkinson’s disease, with the World Health Organization highlighting its growing prevalence. Early neuromotor speech impairments make voice analysis a promising tool for detecting Parkinson’s, aided by advances in deep speech embeddings. However, existing approaches often rely on either [...] Read more.
Millions worldwide are affected by Parkinson’s disease, with the World Health Organization highlighting its growing prevalence. Early neuromotor speech impairments make voice analysis a promising tool for detecting Parkinson’s, aided by advances in deep speech embeddings. However, existing approaches often rely on either handcrafted acoustic features or opaque deep representations, limiting diagnostic performance and interoperability. To address this, we propose a multi-modal decentralized hybrid learning framework that combines structured voice biomarkers from the UCI Parkinson’s dataset (195 sustained-phonation samples from 31 subjects) with contrastive speech embeddings derived from the DAIC-WOZ corpus (189 interview recordings originally collected for depression detection) using Wav2Vec 2.0. This system employs an early fusion strategy followed by a dense neural classifier optimized for binary classification. By integrating both clinically interpretable and semantically rich features, the model captures complementary phonatory and affective patterns relevant to early-stage Parkinson’s detection. Extensive evaluation demonstrates that the proposed method achieves an accuracy of 96.2% and an AUC of 97.1%, outperforming unimodal and baseline fusion models. SHAP-based analysis confirms that a subset of features have disproportionately high discriminative value, enhancing interpretability. Overall, the proposed framework establishes a promising pathway toward data-driven, non-invasive screening for neurodegenerative conditions through voice analysis. Full article
(This article belongs to the Special Issue Blockchain Technology for Internet of Things)
Show Figures

Figure 1

38 pages, 2282 KB  
Article
Cross-Lingual Bimodal Emotion Recognition with LLM-Based Label Smoothing
by Elena Ryumina, Alexandr Axyonov, Timur Abdulkadirov, Darya Koryakovskaya and Dmitry Ryumin
Big Data Cogn. Comput. 2025, 9(11), 285; https://doi.org/10.3390/bdcc9110285 - 12 Nov 2025
Viewed by 2072
Abstract
Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method [...] Read more.
Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method is proposed, integrating Mamba-based temporal encoders for audio (Wav2Vec2.0) and text (Jina-v3) with a Transformer-based cross-modal fusion architecture (BiFormer). Three corpus-adaptive augmentation strategies are introduced: (1) Stacked Data Sampling, in which short utterances are concatenated to stabilize sequence length; (2) Label Smoothing Generation based on Large Language Model, where the Qwen3-4B model is prompted to detect subtle emotional cues missed by annotators, producing soft labels that reflect latent emotional co-occurrences; and (3) Text-to-Utterance Generation, in which emotionally labeled utterances are generated by ChatGPT-5 and synthesized into speech using the DIA-TTS model, enabling controlled creation of affective audio–text pairs without human annotation. BiFormer is trained jointly on the English Multimodal EmotionLines Dataset and the Russian Emotional Speech Dialogs corpus, enabling cross-lingual transfer without parallel data. Experimental results show that the optimal data augmentation strategy is corpus-dependent: Stacked Data Sampling achieves the best performance on short, noisy English utterances, while Label Smoothing Generation based on Large Language Model better captures nuanced emotional expressions in longer Russian utterances. Text-to-Utterance Generation does not yield a measurable gain due to current limitations in expressive speech synthesis. When combined, the two best performing strategies produce complementary improvements, establishing new state-of-the-art performance in both monolingual and cross-lingual settings. Full article
Show Figures

Figure 1

14 pages, 1602 KB  
Article
Frame and Utterance Emotional Alignment for Speech Emotion Recognition
by Seounghoon Byun and Seok-Pil Lee
Future Internet 2025, 17(11), 509; https://doi.org/10.3390/fi17110509 - 5 Nov 2025
Viewed by 900
Abstract
Speech Emotion Recognition (SER) is important for applications such as Human–Computer Interaction (HCI) and emotion-aware services. Traditional SER models rely on utterance-level labels, aggregating frame-level representations through pooling operations. However, emotional states can vary across frames within an utterance, making it difficult for [...] Read more.
Speech Emotion Recognition (SER) is important for applications such as Human–Computer Interaction (HCI) and emotion-aware services. Traditional SER models rely on utterance-level labels, aggregating frame-level representations through pooling operations. However, emotional states can vary across frames within an utterance, making it difficult for models to learn consistent and robust representations. To address this issue, we propose two auxiliary loss functions, Emotional Attention Loss (EAL) and Frame-to-Utterance Alignment Loss (FUAL). The proposed approach uses a Classification token (CLS) self-attention pooling mechanism, where the CLS summarizes the entire utterance sequence. EAL encourages frames of the same emotion to align closely with the CLS while separating frames of different classes, and FUAL enforces consistency between frame-level and utterance-level predictions to stabilize training. Model training proceeds in two stages: Stage 1 fine-tunes the wav2vec 2.0 backbone with Cross-Entropy (CE) loss to obtain stable frame embeddings, and stage 2 jointly optimizes CE, EAL and FUAL within the CLS-based pooling framework. Experiments on the IEMOCAP four-class dataset demonstrate that our method consistently outperforms baseline models, showing that the proposed losses effectively address representation inconsistencies and improve SER performance. This work advances Artificial Intelligence by improving the ability of models to understand human emotions through speech. Full article
Show Figures

Graphical abstract

18 pages, 1138 KB  
Article
Speech-Based Depression Recognition in Hikikomori Patients Undergoing Cognitive Behavioral Therapy
by Samara Soares Leal, Stavros Ntalampiras, Maria Gloria Rossetti, Antonio Trabacca, Marcella Bellani and Roberto Sassi
Appl. Sci. 2025, 15(21), 11750; https://doi.org/10.3390/app152111750 - 4 Nov 2025
Viewed by 685
Abstract
Major depressive disorder (MDD) affects approximately 4.4% of the global population. Its prevalence is increasing among adolescents and has led to the psychosocial condition known as hikikomori. MDD is typically assessed by self-report questionnaires, which, although informative, are subject to evaluator bias [...] Read more.
Major depressive disorder (MDD) affects approximately 4.4% of the global population. Its prevalence is increasing among adolescents and has led to the psychosocial condition known as hikikomori. MDD is typically assessed by self-report questionnaires, which, although informative, are subject to evaluator bias and subjectivity. To address these limitations, recent studies have explored machine learning (ML) for automated MDD detection. Among the input data used, speech signals stand out due to their low cost and minimal intrusiveness. However, many speech-based approaches lack integration with cognitive behavioral therapy (CBT) and adherence to evidence-based, patient-centered care—often aiming to replace rather than support clinical monitoring. In this context, we propose ML models to assess MDD in hikikomori patients using speech data from a real-world clinical trial. The trial is conducted in Italy, supervised by physicians, and comprises an eight-session CBT plan that is clinical evidence-based and follows patient-centered practices. Patients’ speech is recorded during therapy, and the Mel-Frequency Cepstral Coefficients (MFCCs) and wav2vec 2.0 embedding are extracted to train the models. The results show that the Multi-Layer Perceptron (MLP) predicted depression outcomes with a Root Mean Squared Error (RMSE) of 0.064 using only MFCCs from the first session, suggesting that early-session speech may be valuable for outcome prediction. When considering the entire CBT treatment (i.e., all sessions), the MLP achieved an RMSE of 0.063 using MFCCs and a lower RMSE of 0.057 with wav2vec 2.0, indicating approximately a 9.5% performance improvement. To aid the interpretability of the treatment outcomes, a binary task was conducted, where Logistic Regression (LR) achieved 70% recall in predicting depression improvement among young adults using wav2vec 2.0. These findings position speech as a valuable predictive tool in clinical informatics, potentially supporting clinicians in anticipating treatment response. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

26 pages, 2931 KB  
Review
Prospects of AI-Powered Bowel Sound Analytics for Diagnosis, Characterization, and Treatment Management of Inflammatory Bowel Disease
by Divyanshi Sood, Zenab Muhammad Riaz, Jahnavi Mikkilineni, Narendra Nath Ravi, Vineeta Chidipothu, Gayathri Yerrapragada, Poonguzhali Elangovan, Mohammed Naveed Shariff, Thangeswaran Natarajan, Jayarajasekaran Janarthanan, Naghmeh Asadimanesh, Shiva Sankari Karuppiah, Keerthy Gopalakrishnan and Shivaram P. Arunachalam
Med. Sci. 2025, 13(4), 230; https://doi.org/10.3390/medsci13040230 - 13 Oct 2025
Cited by 4 | Viewed by 2423
Abstract
Background: This narrative review examines the role of artificial intelligence (AI) in bowel sound analysis for the diagnosis and management of inflammatory bowel disease (IBD). Inflammatory bowel disease (IBD), encompassing Crohn’s disease and ulcerative colitis, presents a significant clinical burden due to its [...] Read more.
Background: This narrative review examines the role of artificial intelligence (AI) in bowel sound analysis for the diagnosis and management of inflammatory bowel disease (IBD). Inflammatory bowel disease (IBD), encompassing Crohn’s disease and ulcerative colitis, presents a significant clinical burden due to its unpredictable course, variable symptomatology, and reliance on invasive procedures for diagnosis and disease monitoring. Despite advances in imaging and biomarkers, tools such as colonoscopy and fecal calprotectin remain costly, uncomfortable, and impractical for frequent or real-time assessment. Meanwhile, bowel sounds—an overlooked physiologic signal—reflect underlying gastrointestinal motility and inflammation but have historically lacked objective quantification. With recent advances in artificial intelligence (AI) and acoustic signal processing, there is growing interest in leveraging bowel sound analysis as a novel, non-invasive biomarker for detecting IBD, monitoring disease activity, and predicting disease flares. This approach holds the promise of continuous, low-cost, and patient-friendly monitoring, which could transform IBD management. Objectives: This narrative review assesses the clinical utility, methodological rigor, and potential future integration of artificial intelligence (AI)-driven bowel sound analysis in inflammatory bowel disease (IBD), with a focus on its potential as a non-invasive biomarker for disease activity, flare prediction, and differential diagnosis. Methods: This manuscript reviews the potential of AI-powered bowel sound analysis as a non-invasive tool for diagnosing, monitoring, and managing inflammatory bowel disease (IBD), including Crohn’s disease and ulcerative colitis. Traditional diagnostic methods, such as colonoscopy and biomarkers, are often invasive, costly, and impractical for real-time monitoring. The manuscript explores bowel sounds, which reflect gastrointestinal motility and inflammation, as an alternative biomarker by utilizing AI techniques like convolutional neural networks (CNNs), transformers, and gradient boosting. We analyze data on acoustic signal acquisition (e.g., smart T-shirts, smartphones), signal processing methodologies (e.g., MFCCs, spectrograms, empirical mode decomposition), and validation metrics (e.g., accuracy, F1 scores, AUC). Studies were assessed for clinical relevance, methodological rigor, and translational potential. Results: Across studies enrolling 16–100 participants, AI models achieved diagnostic accuracies of 88–96%, with AUCs ≥ 0.83 and F1 scores ranging from 0.71 to 0.85 for differentiating IBD from healthy controls and IBS. Transformer-based approaches (e.g., HuBERT, Wav2Vec 2.0) consistently outperformed CNNs and tabular models, yielding F1 scores of 80–85%, while gradient boosting on wearable multi-microphone recordings demonstrated robustness to background noise. Distinct acoustic signatures were identified, including prolonged sound-to-sound intervals in Crohn’s disease (mean 1232 ms vs. 511 ms in IBS) and high-pitched tinkling in stricturing phenotypes. Despite promising performance, current models remain below established biomarkers such as fecal calprotectin (~90% sensitivity for active disease), and generalizability is limited by small, heterogeneous cohorts and the absence of prospective validation. Conclusions: AI-powered bowel sound analysis represents a promising, non-invasive tool for IBD monitoring. However, widespread clinical integration requires standardized data acquisition protocols, large multi-center datasets with clinical correlates, explainable AI frameworks, and ethical data governance. Future directions include wearable-enabled remote monitoring platforms and multi-modal decision support systems integrating bowel sounds with biomarker and symptom data. This manuscript emphasizes the need for large-scale, multi-center studies, the development of explainable AI frameworks, and the integration of these tools within clinical workflows. Future directions include remote monitoring using wearables and multi-modal systems that combine bowel sounds with biomarkers and patient symptoms, aiming to transform IBD care into a more personalized and proactive model. Full article
Show Figures

Figure 1

46 pages, 7346 KB  
Review
Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning
by Chaoji Wu, Yi Pan, Haipan Wu and Lei Ning
Informatics 2025, 12(4), 107; https://doi.org/10.3390/informatics12040107 - 4 Oct 2025
Viewed by 5365
Abstract
Automatic speech recognition (ASR) has advanced rapidly, evolving from early template-matching systems to modern deep learning frameworks. This review systematically traces ASR’s technological evolution across four phases: the template-based era, statistical modeling approaches, the deep learning revolution, and the emergence of large-scale models [...] Read more.
Automatic speech recognition (ASR) has advanced rapidly, evolving from early template-matching systems to modern deep learning frameworks. This review systematically traces ASR’s technological evolution across four phases: the template-based era, statistical modeling approaches, the deep learning revolution, and the emergence of large-scale models under diverse learning paradigms. We analyze core technologies such as hidden Markov models (HMMs), Gaussian mixture models (GMMs), recurrent neural networks (RNNs), and recent architectures including Transformer-based models and Wav2Vec 2.0. Beyond algorithmic development, we examine how ASR integrates into intelligent information systems, analyzing real-world applications in healthcare, education, smart homes, enterprise systems, and automotive domains with attention to deployment considerations and system design. We also address persistent challenges—noise robustness, low-resource adaptation, and deployment efficiency—while exploring emerging solutions such as multimodal fusion, privacy-preserving modeling, and lightweight architectures. Finally, we outline future research directions to guide the development of robust, scalable, and intelligent ASR systems for complex, evolving environments. Full article
(This article belongs to the Section Machine Learning)
Show Figures

Figure 1

19 pages, 7222 KB  
Article
Multi-Channel Spectro-Temporal Representations for Speech-Based Parkinson’s Disease Detection
by Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee and Myunggi Yi
J. Imaging 2025, 11(10), 341; https://doi.org/10.3390/jimaging11100341 - 1 Oct 2025
Viewed by 1090
Abstract
Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary [...] Read more.
Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary time–frequency representations—mel spectrogram, constant-Q transform (CQT), and gammatone spectrogram—into a three-channel input analogous to an RGB image. This fused representation is evaluated across CNNs (ResNet, DenseNet, and EfficientNet) and Vision Transformer using the PC-GITA dataset, under 10-fold subject-independent cross-validation for robust assessment. Results showed that fusion consistently improves performance over single representations across architectures. EfficientNet-B2 achieves the highest accuracy (84.39% ± 5.19%) and F1-score (84.35% ± 5.52%), outperforming recent methods using handcrafted features or pretrained models (e.g., Wav2Vec2.0, HuBERT) on the same task and dataset. Performance varies with sentence type, with emotionally salient and prosodically emphasized utterances yielding higher AUC, suggesting that richer prosody enhances discriminability. Our findings indicate that multi-channel fusion enhances sensitivity to subtle speech impairments in PD by integrating complementary spectral information. Our approach implies that multi-channel fusion could enhance the detection of discriminative acoustic biomarkers, potentially offering a more robust and effective framework for speech-based PD screening, though further validation is needed before clinical application. Full article
(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)
Show Figures

Figure 1

14 pages, 839 KB  
Article
MMFA: Masked Multi-Layer Feature Aggregation for Speaker Verification Using WavLM
by Uijong Lee and Seok-Pil Lee
Electronics 2025, 14(19), 3857; https://doi.org/10.3390/electronics14193857 - 29 Sep 2025
Viewed by 1516
Abstract
Speaker verification (SV) is a core technology for security and personalized services, and its importance has been growing with the spread of wearables such as smartwatches, earbuds, and AR/VR headsets, where privacy-preserving on-device operation under limited compute and power budgets is required. Recently, [...] Read more.
Speaker verification (SV) is a core technology for security and personalized services, and its importance has been growing with the spread of wearables such as smartwatches, earbuds, and AR/VR headsets, where privacy-preserving on-device operation under limited compute and power budgets is required. Recently, self-supervised learning (SSL) models such as WavLM and wav2vec 2.0 have been widely adopted as front ends that provide multi-layer speech representations without labeled data. Lower layers contain fine-grained acoustic information, whereas higher layers capture phonetic and contextual features. However, conventional SV systems typically use only the final layer or a single-step temporal attention over a simple weighted sum of layers, implicitly assuming that frame importance is shared across layers and thus failing to fully exploit the hierarchical diversity of SSL embeddings. We argue that frame relevance is layer dependent, as the frames most critical for speaker identity differ across layers. To address this, we propose Masked Multi-layer Feature Aggregation (MMFA), which first applies independent frame-wise attention within each layer, then performs learnable layer-wise weighting to suppress irrelevant frames such as silence and noise while effectively combining complementary information across layers. On VoxCeleb1, MMFA achieves consistent improvements over strong baselines in both EER and minDCF, and attention-map analysis confirms distinct selection patterns across layers, validating MMFA as a robust SV approach even in short-utterance and noisy conditions. Full article
Show Figures

Figure 1

18 pages, 3632 KB  
Article
Multilingual Mobility: Audio-Based Language ID for Automotive Systems
by Joowon Oh and Jeaho Lee
Appl. Sci. 2025, 15(16), 9209; https://doi.org/10.3390/app15169209 - 21 Aug 2025
Viewed by 1113
Abstract
With the growing demand for natural and intelligent human–machine interaction in multilingual environments, automatic language identification (LID) has emerged as a crucial component in voice-enabled systems, particularly in the automotive domain. This study proposes an audio-based LID model that identifies the spoken language [...] Read more.
With the growing demand for natural and intelligent human–machine interaction in multilingual environments, automatic language identification (LID) has emerged as a crucial component in voice-enabled systems, particularly in the automotive domain. This study proposes an audio-based LID model that identifies the spoken language directly from voice input without requiring manual language selection. The model architecture leverages two types of feature extraction pipelines: a Variational Autoencoder (VAE) and a pre-trained Wav2Vec model, both used to obtain latent speech representations. These embeddings are then fed into a multi-layer perceptron (MLP)-based classifier to determine the speaker’s language among five target languages: Korean, Japanese, Chinese, Spanish, and French. The model is trained and evaluated using a dataset preprocessed into Mel-Frequency Cepstral Coefficients (MFCCs) and raw waveform inputs. Experimental results demonstrate the effectiveness of the proposed approach in achieving accurate and real-time language detection, with potential applications in in-vehicle systems, speech translation platforms, and multilingual voice assistants. By eliminating the need for predefined language settings, this work contributes to more seamless and user-friendly multilingual voice interaction systems. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Back to TopTop