MDPI - Publisher of Open Access Journals

34 pages, 1876 KiB

Open AccessArticle

The Interaction of Target and Masker Speech in Competing Speech Perception

by Sheyenne Fishero, Joan A. Sereno and Allard Jongman

Brain Sci. 2025, 15(8), 834; https://doi.org/10.3390/brainsci15080834 (registering DOI) - 4 Aug 2025

Background/Objectives: Speech perception typically takes place against a background of other speech or noise. The present study investigates the effectiveness of segregating speech streams within a competing speech signal, examining whether cues such as pitch, which typically denote a difference in talker, [...] Read more.

Background/Objectives: Speech perception typically takes place against a background of other speech or noise. The present study investigates the effectiveness of segregating speech streams within a competing speech signal, examining whether cues such as pitch, which typically denote a difference in talker, behave in the same way as cues such as speaking rate, which typically do not denote the presence of a new talker. Methods: Native English speakers listened to English target speech within English two-talker babble of a similar or different pitch and/or a similar or different speaking rate to identify whether mismatched properties between target speech and masker babble improve speech segregation. Additionally, Dutch and French masker babble was tested to identify whether an unknown language masker improves speech segregation capacity and whether the rhythm patterns of the unknown language modulate the improvement. Results: Results indicated that a difference in pitch or speaking rate between target and masker improved speech segregation, but when both pitch and speaking rate differed, only a difference in pitch improved speech segregation. Results also indicated improved speech segregation for an unknown language masker, with little to no role of rhythm pattern of the unknown language. Conclusions: This study increases the understanding of speech perception in a noisy ecologically valid context and suggests that there is a link between a cue’s potential to denote a new speaker and its ability to aid in speech segregation during competing speech perception. Full article

(This article belongs to the Special Issue Language Perception and Processing)

► Show Figures

Figure 1

12 pages, 1196 KiB

Open AccessArticle

DNN-Based Noise Reduction Significantly Improves Bimodal Benefit in Background Noise for Cochlear Implant Users

by Courtney Kolberg, Sarah O. Holbert, Jamie M. Bogle and Aniket A. Saoji

J. Clin. Med. 2025, 14(15), 5302; https://doi.org/10.3390/jcm14155302 - 27 Jul 2025

Viewed by 347

Abstract

Background/Objectives: Traditional hearing aid noise reduction algorithms offer no additional benefit in noisy situations for bimodal cochlear implant (CI) users with a CI in one ear and a hearing aid (HA) in the other. Recent breakthroughs in deep neural network (DNN)-based noise [...] Read more.

Background/Objectives: Traditional hearing aid noise reduction algorithms offer no additional benefit in noisy situations for bimodal cochlear implant (CI) users with a CI in one ear and a hearing aid (HA) in the other. Recent breakthroughs in deep neural network (DNN)-based noise reduction have improved speech understanding for hearing aid users in noisy environments. These advancements could also boost speech perception in noise for bimodal CI users. This study investigated the effectiveness of DNN-based noise reduction in the HAs used by bimodal CI patients. Methods: Eleven bimodal CI patients, aged 71–89 years old, were fit with a Phonak Audéo Sphere Infinio 90 HA in their non-implanted ear and were provided with a Calm Situation program and Spheric Speech in Loud Noise program that uses DNN-based noise reduction. Sentence recognition scores were measured using AzBio sentences in quiet and in noise with the CI alone, hearing aid alone, and bimodally with both the Calm Situation and DNN HA programs. Results: The DNN program in the hearing aid significantly improved bimodal performance in noise, with sentence recognition scores reaching 79% compared to 60% with Calm Situation (a 19% average benefit, p < 0.001). When compared to the CI-alone condition in multi-talker babble, the DNN HA program offered a 40% bimodal benefit, significantly higher than the 21% score seen with the Calm Situation program. Conclusions: DNN-based noise reduction in HA significantly improves speech understanding in noise for bimodal CI users. Utilization of this technology is a promising option to address patients’ common complaint of speech understanding in noise. Full article

(This article belongs to the Section Otolaryngology)

► Show Figures

Figure 1

18 pages, 697 KiB

Open AccessReview

Lip-Reading: Advances and Unresolved Questions in a Key Communication Skill

by Martina Battista, Francesca Collesei, Eva Orzan, Marta Fantoni and Davide Bottari

Audiol. Res. 2025, 15(4), 89; https://doi.org/10.3390/audiolres15040089 - 21 Jul 2025

Viewed by 334

Abstract

Lip-reading, i.e., the ability to recognize speech using only visual cues, plays a fundamental role in audio-visual speech processing, intelligibility, and comprehension. This capacity is integral to language development and functioning; it emerges in early development, and it slowly evolves. By linking psycholinguistics, [...] Read more.

Lip-reading, i.e., the ability to recognize speech using only visual cues, plays a fundamental role in audio-visual speech processing, intelligibility, and comprehension. This capacity is integral to language development and functioning; it emerges in early development, and it slowly evolves. By linking psycholinguistics, psychophysics, and neurophysiology, the present narrative review explores the development and significance of lip-reading across different stages of life, highlighting its role in human communication in both typical and atypical development, e.g., in the presence of hearing or language impairments. We examined how relying on lip-reading becomes crucial when communication occurs in noisy environments and, on the contrary, the impacts that visual barriers can have on speech perception. Finally, this review highlights individual differences and the role of cultural and social contexts for a better understanding of the visual counterpart of speech. Full article

(This article belongs to the Special Issue Breaking Down Listening Barriers for Students with Hearing Difficulties)

► Show Figures

Figure 1

17 pages, 1467 KiB

Open AccessArticle

Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation

by Maria Zafar, Patrick J. Wall, Souhail Bakkali and Rejwanul Haque

Appl. Sci. 2025, 15(14), 8091; https://doi.org/10.3390/app15148091 - 21 Jul 2025

Viewed by 424

Abstract

The transformer-based deep learning approach represents the current state-of-the-art in machine translation (MT) research. Large-scale pretrained transformer models produce state-of-the-art performance across a wide range of MT tasks for many languages. However, such deep neural network (NN) models are often data-, compute-, space-, [...] Read more.

The transformer-based deep learning approach represents the current state-of-the-art in machine translation (MT) research. Large-scale pretrained transformer models produce state-of-the-art performance across a wide range of MT tasks for many languages. However, such deep neural network (NN) models are often data-, compute-, space-, power-, and energy-hungry, typically requiring powerful GPUs or large-scale clusters to train and deploy. As a result, they are often regarded as “non-green” and “unsustainable” technologies. Distilling knowledge from large deep NN models (teachers) to smaller NN models (students) is a widely adopted sustainable development approach in MT as well as in broader areas of natural language processing (NLP), including speech, and image processing. However, distilling large pretrained models presents several challenges. First, increased training time and cost that scales with the volume of data used for training a student model. This could pose a challenge for translation service providers (TSPs), as they may have limited budgets for training. Moreover, CO₂ emissions generated during model training are typically proportional to the amount of data used, contributing to environmental harm. Second, when querying teacher models, including encoder–decoder models such as NLLB, the translations they produce for low-resource languages may be noisy or of low quality. This can undermine sequence-level knowledge distillation (SKD), as student models may inherit and reinforce errors from inaccurate labels. In this study, the teacher model’s confidence estimation is employed to filter those instances from the distilled training data for which the teacher exhibits low confidence. We tested our methods on a low-resource Urdu-to-English translation task operating within a constrained training budget in an industrial translation setting. Our findings show that confidence estimation-based filtering can significantly reduce the cost and CO₂ emissions associated with training a student model without drop in translation quality, making it a practical and environmentally sustainable solution for the TSPs. Full article

(This article belongs to the Special Issue Deep Learning and Its Applications in Natural Language Processing)

► Show Figures

Figure 1

21 pages, 1118 KiB

Open AccessReview

Integrating Large Language Models into Robotic Autonomy: A Review of Motion, Voice, and Training Pipelines

by Yutong Liu, Qingquan Sun and Dhruvi Rajeshkumar Kapadia

AI 2025, 6(7), 158; https://doi.org/10.3390/ai6070158 - 15 Jul 2025

Viewed by 1441

Abstract

This survey provides a comprehensive review of the integration of large language models (LLMs) into autonomous robotic systems, organized around four key pillars: locomotion, navigation, manipulation, and voice-based interaction. We examine how LLMs enhance robotic autonomy by translating high-level natural language commands into [...] Read more.

This survey provides a comprehensive review of the integration of large language models (LLMs) into autonomous robotic systems, organized around four key pillars: locomotion, navigation, manipulation, and voice-based interaction. We examine how LLMs enhance robotic autonomy by translating high-level natural language commands into low-level control signals, supporting semantic planning and enabling adaptive execution. Systems like SayTap improve gait stability through LLM-generated contact patterns, while TrustNavGPT achieves a 5.7% word error rate (WER) under noisy voice-guided conditions by modeling user uncertainty. Frameworks such as MapGPT, LLM-Planner, and 3D-LOTUS++ integrate multi-modal data—including vision, speech, and proprioception—for robust planning and real-time recovery. We also highlight the use of physics-informed neural networks (PINNs) to model object deformation and support precision in contact-rich manipulation tasks. To bridge the gap between simulation and real-world deployment, we synthesize best practices from benchmark datasets (e.g., RH20T, Open X-Embodiment) and training pipelines designed for one-shot imitation learning and cross-embodiment generalization. Additionally, we analyze deployment trade-offs across cloud, edge, and hybrid architectures, emphasizing latency, scalability, and privacy. The survey concludes with a multi-dimensional taxonomy and cross-domain synthesis, offering design insights and future directions for building intelligent, human-aligned robotic systems powered by LLMs. Full article

► Show Figures

Figure 1

25 pages, 2093 KiB

Open AccessArticle

Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems

by Samuel Yaw Mensah, Tao Zhang, Nahid AI Mahmud and Yanzhang Geng

Electronics 2025, 14(13), 2643; https://doi.org/10.3390/electronics14132643 - 30 Jun 2025

Viewed by 798

Abstract

Deep learning has emerged as a powerful technique for speech enhancement, particularly in security systems where audio signals are often degraded by non-stationary noise. Traditional signal processing methods struggle in such conditions, making it difficult to detect critical sounds like gunshots, alarms, and [...] Read more.

Deep learning has emerged as a powerful technique for speech enhancement, particularly in security systems where audio signals are often degraded by non-stationary noise. Traditional signal processing methods struggle in such conditions, making it difficult to detect critical sounds like gunshots, alarms, and unauthorized speech. This study investigates a hybrid deep learning framework that combines Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs) to enhance speech quality and improve sound classification accuracy in noisy security environments. The proposed model is trained and validated using real-world datasets containing diverse noise distortions, including VoxCeleb for benchmarking speech enhancement and UrbanSound8K and ESC-50 for sound classification. Performance is evaluated using industry-standard metrics such as Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Signal-to-Noise Ratio (SNR). The architecture includes multi-layered neural networks, residual connections, and dropout regularization to ensure robustness and generalizability. Additionally, the paper addresses key challenges in deploying deep learning models for security applications, such as computational complexity, latency, and vulnerability to adversarial attacks. Experimental results demonstrate that the proposed DNN + GAN-based approach significantly improves speech intelligibility and classification performance in high-interference scenarios, offering a scalable solution for enhancing the reliability of audio-based security systems. Full article

► Show Figures

Figure 1

23 pages, 2410 KiB

Open AccessArticle

A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian

by Gailius Raškinis, Darius Amilevičius, Danguolė Kalinauskaitė, Artūras Mickus, Daiva Vitkutė-Adžgauskienė, Antanas Čenys and Tomas Krilavičius

Mathematics 2025, 13(13), 2107; https://doi.org/10.3390/math13132107 - 27 Jun 2025

Viewed by 319

Abstract

We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for [...] Read more.

We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for character-level transcription models. We evaluate three approaches: a weighted finite-state transducer (WFST), an LSTM-based sequence-to-sequence model with attention, and a Transformer model optimized for character transduction. Results show that word-pair models outperform single-word models, with the Transformer achieving the best performance (19.04% WER) on a cleaned and augmented dataset. Data augmentation via word order reversal proved effective, while combining single-word and word-pair training offered limited gains. Despite filtering, residual noise persists, with 54% of outputs showing some error, though only 11% were perceptually significant. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

27 pages, 4737 KiB

Open AccessArticle

Context-Aware Multimodal Fusion with Sensor-Augmented Cross-Modal Learning: The BLAF Architecture for Robust Chinese Homophone Disambiguation in Dynamic Environments

by Yu Sun, Yihang Qin, Wenhao Chen, Xuan Li and Chunlian Li

Appl. Sci. 2025, 15(13), 7068; https://doi.org/10.3390/app15137068 - 23 Jun 2025

Viewed by 605

Abstract

Chinese, a tonal language with inherent homophonic ambiguity, poses significant challenges for semantic disambiguation in natural language processing (NLP), hindering applications like speech recognition, dialog systems, and assistive technologies. Traditional static disambiguation methods suffer from poor adaptability in dynamic environments and low-frequency scenarios, [...] Read more.

Chinese, a tonal language with inherent homophonic ambiguity, poses significant challenges for semantic disambiguation in natural language processing (NLP), hindering applications like speech recognition, dialog systems, and assistive technologies. Traditional static disambiguation methods suffer from poor adaptability in dynamic environments and low-frequency scenarios, limiting their real-world utility. To address these limitations, we propose BLAF—a novel MacBERT-BiLSTM Hybrid Architecture—that synergizes global semantic understanding with local sequential dependencies through dynamic multimodal feature fusion. This framework incorporates innovative mechanisms for the principled weighting of heterogeneous features, effective alignment of representations, and sensor-augmented cross-modal learning to enhance robustness, particularly in noisy environments. Employing a staged optimization strategy, BLAF achieves state-of-the-art performance on the SIGHAN 2015 (data fine-tuning and supplementation): 93.37% accuracy and 93.25% F1 score, surpassing pure BERT by 15.74% in accuracy. Ablation studies confirm the critical contributions of the integrated components. Furthermore, the sensor-augmented module significantly improves robustness under noise (speech SNR to 18.6 dB at 75 dB noise, 12.7% reduction in word error rates). By bridging gaps among tonal phonetics, contextual semantics, and computational efficiency, BLAF establishes a scalable paradigm for robust Chinese homophone disambiguation in industrial NLP applications. This work advances cognitive intelligence in Chinese NLP and provides a blueprint for adaptive disambiguation in resource-constrained and dynamic scenarios. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

► Show Figures

Figure 1

18 pages, 1577 KiB

Open AccessArticle

CAs-Net: A Channel-Aware Speech Network for Uyghur Speech Recognition

by Jiang Zhang, Miaomiao Xu, Lianghui Xu and Yajing Ma

Sensors 2025, 25(12), 3783; https://doi.org/10.3390/s25123783 - 17 Jun 2025

Viewed by 367

Abstract

This paper proposes a Channel-Aware Speech Network (CAs-Net) for low-resource speech recognition tasks, aiming to improve recognition performance for languages such as Uyghur under complex noisy conditions. The proposed model consists of two key components: (1) the Channel Rotation Module (CIM), which reconstructs [...] Read more.

This paper proposes a Channel-Aware Speech Network (CAs-Net) for low-resource speech recognition tasks, aiming to improve recognition performance for languages such as Uyghur under complex noisy conditions. The proposed model consists of two key components: (1) the Channel Rotation Module (CIM), which reconstructs each frame’s channel vector into a spatial structure and applies a rotation operation to explicitly model the local structural relationships within the channel dimension, thereby enhancing the encoder’s contextual modeling capability; and (2) the Multi-Scale Depthwise Convolution Module (MSDCM), integrated within the Transformer framework, which leverages multi-branch depthwise separable convolutions and a lightweight self-attention mechanism to jointly capture multi-scale temporal patterns, thus improving the model’s perception of compact articulation and complex rhythmic structures. Experiments conducted on a real Uyghur speech recognition dataset demonstrate that CAs-Net achieves the best performance across multiple subsets, with an average Word Error Rate (WER) of 5.72%, significantly outperforming existing approaches. These results validate the robustness and effectiveness of the proposed model under low-resource and noisy conditions. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

17 pages, 1071 KiB

Open AccessFeature PaperArticle

Empirical Analysis of Learning Improvements in Personal Voice Activity Detection Frameworks

by Yu-Tseng Yeh, Chia-Chi Chang and Jeih-Weih Hung

Electronics 2025, 14(12), 2372; https://doi.org/10.3390/electronics14122372 - 10 Jun 2025

Viewed by 530

Abstract

Personal Voice Activity Detection (PVAD) has emerged as a critical technology for enabling speaker-specific detection in multi-speaker environments, surpassing the limitations of conventional Voice Activity Detection (VAD) systems that merely distinguish speech from non-speech. PVAD systems are essential for applications such as personalized [...] Read more.

Personal Voice Activity Detection (PVAD) has emerged as a critical technology for enabling speaker-specific detection in multi-speaker environments, surpassing the limitations of conventional Voice Activity Detection (VAD) systems that merely distinguish speech from non-speech. PVAD systems are essential for applications such as personalized voice assistants and robust speech recognition, where accurately identifying a target speaker’s voice amidst background speech and noise is crucial for both user experience and computational efficiency. Despite significant progress, PVAD frameworks still face challenges related to temporal modeling, integration of speaker information, class imbalance, and deployment on resource-constrained devices. In this study, we present a systematic enhancement of the PVAD framework through four key innovations: (1) a Bi-GRU (Bidirectional Gated Recurrent Unit) layer for improved temporal modeling of speech dynamics, (2) a cross-attention mechanism for context-aware speaker embedding integration, (3) a hybrid CE-AUROC (Cross-Entropy and Area Under Receiver Operating Characteristic) loss function to address class imbalance, and (4) Cosine Annealing Learning Rate (CALR) for optimized training convergence. Evaluated on LibriSpeech datasets under varied acoustic conditions, the proposed modifications demonstrate significant performance gains over the baseline PVAD framework, achieving 87.59% accuracy (vs. 86.18%) and 0.9481 mean Average Precision (vs. 0.9378) while maintaining real-time processing capabilities. These advancements address critical challenges in PVAD deployment, including robustness to noisy environments, with the hybrid loss function reducing false negatives by 12% in imbalanced scenarios. The work provides practical insights for implementing personalized voice interfaces on resource-constrained devices. Future extensions will explore quantized inference and multi-modal sensor fusion to further bridge the gap between laboratory performance and real-world deployment requirements. Full article

(This article belongs to the Special Issue Emerging Trends in Generative-AI Based Audio Processing)

► Show Figures

Figure 1

17 pages, 439 KiB

Open AccessEditor’s ChoiceArticle

MultiAVSR: Robust Speech Recognition via Supervised Multi-Task Audio–Visual Learning

by Shad Torrie, Kimi Wright and Dah-Jye Lee

Electronics 2025, 14(12), 2310; https://doi.org/10.3390/electronics14122310 - 6 Jun 2025

Viewed by 813

Abstract

Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, [...] Read more.

Speech recognition approaches typically fall into three categories: audio, visual, and audio–visual. Visual speech recognition, or lip reading, is the most difficult because visual cues are ambiguous and data is scarce. To address these challenges, we present a new multi-task audio–visual speech recognition, or MultiAVSR, framework for training a model on all three types of speech recognition simultaneously primarily to improve visual speech recognition. Unlike prior works which use separate models or complex semi-supervision, our framework employs a supervised multi-task hybrid Connectionist Temporal Classification/Attention loss cutting training exaFLOPs to just 18% of that required by semi-supervised multitask models. MultiAVSR achieves state-of-the-art visual speech recognition word error rate of 21.0% on the LRS3-TED dataset. Furthermore, it exhibits robust generalization capabilities, achieving a remarkable 44.7% word error rate on the WildVSR dataset. Our framework also demonstrates reduced dependency on external language models, which is critical for real-time visual speech recognition. For the audio and audio–visual tasks, our framework improves the robustness under various noisy environments with average relative word error rate improvements of 16% and 31%, respectively. These improvements across the three tasks illustrate the robust results our supervised multi-task speech recognition framework enables. Full article

(This article belongs to the Special Issue Advances in Information, Intelligence, Systems and Applications)

► Show Figures

Figure 1

18 pages, 1264 KiB

Open AccessArticle

GazeMap: Dual-Pathway CNN Approach for Diagnosing Alzheimer’s Disease from Gaze and Head Movements

by Hyuntaek Jung, Shinwoo Ham, Hyunyoung Kil, Jung Eun Shin and Eun Yi Kim

Mathematics 2025, 13(11), 1867; https://doi.org/10.3390/math13111867 - 3 Jun 2025

Viewed by 543

Abstract

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that impairs cognitive function, making early detection crucial for timely intervention. This study proposes a novel AD detection framework integrating gaze and head movement analysis via a dual-pathway convolutional neural network (CNN). Unlike conventional methods [...] Read more.

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that impairs cognitive function, making early detection crucial for timely intervention. This study proposes a novel AD detection framework integrating gaze and head movement analysis via a dual-pathway convolutional neural network (CNN). Unlike conventional methods relying on linguistic, speech, or neuroimaging data, our approach leverages non-invasive video-based tracking, offering a more accessible and cost-effective solution to early AD detection. To enhance feature representation, we introduce GazeMap, a novel transformation converting 1D gaze and head pose time-series data into 2D spatial representations, effectively capturing both short- and long-term temporal interactions while mitigating missing or noisy data. The dual-pathway CNN processes gaze and head movement features separately before fusing them to improve diagnostic accuracy. We validated our framework using a clinical dataset (112 participants) from Konkuk University Hospital and an out-of-distribution dataset from senior centers and nursing homes. Our method achieved 91.09% accuracy on in-distribution data collected under controlled clinical settings, and 83.33% on out-of-distribution data from real-world scenarios, outperforming several time-series baseline models. Model performance was validated through cross-validation on in-distribution data and tested on an independent out-of-distribution dataset. Additionally, our gaze-saliency maps provide interpretable visualizations, revealing distinct AD-related gaze patterns. Full article

(This article belongs to the Special Issue AI-Driven Innovations in Healthcare: Advances in Machine Learning and Computer Vision)

► Show Figures

Figure 1

15 pages, 8737 KiB

Open AccessArticle

A Piezoelectric Micromachined Ultrasonic Transducer-Based Bone Conduction Microphone System for Enhancing Speech Recognition Accuracy

by Chongbin Liu, Xiangyang Wang, Jianbiao Xiao, Jun Zhou and Guoqiang Wu

Micromachines 2025, 16(6), 613; https://doi.org/10.3390/mi16060613 - 23 May 2025

Viewed by 608

Abstract

Speech recognition in noisy environments has long posed a challenge. Air conduction microphone (ACM), the devices typically used, are susceptible to environmental noise. In this work, a customized bone conduction microphone (BCM) system based on a piezoelectric micromachined ultrasonic transducer is developed to [...] Read more.

Speech recognition in noisy environments has long posed a challenge. Air conduction microphone (ACM), the devices typically used, are susceptible to environmental noise. In this work, a customized bone conduction microphone (BCM) system based on a piezoelectric micromachined ultrasonic transducer is developed to capture speech through real-time bone conduction (BC), while a commercial ACM is integrated for simultaneous capture of speech through air conduction (AC). The system enables simpler and more robust BC speech capture. The BC speech capture achieves a signal-to-noise amplitude ratio over five times greater than that of AC speech capture in an environment with a noise level of 68 dB. Instead of using only AC-captured speech, both BC- and AC-captured speech are input into a speech enhancement module. The noise-insensitive BC-captured speech serves as a speech reference to adapt the SE backbone of AC-captured speech. The two types of speech are fused, and noise suppression is applied to generate enhanced speech. Compared with the original noisy speech, the enhanced speech achieves a character error rate reduction of over 20%, approaching the speech recognition accuracy of clean speech. The results indicate that this speech enhancement method based on the fusion of BC- and AC-captured speech efficiently integrates the features of both types of speech, thereby improving speech recognition accuracy in noisy environments. This work presents an innovative system designed to efficiently capture BC speech and enhance speech recognition in noisy environments. Full article

(This article belongs to the Special Issue Advances in Piezoelectric Sensors)

► Show Figures

Figure 1

12 pages, 1391 KiB

Open AccessArticle

Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation

by Federico Cioffi, Massimiliano Masullo, Aniello Pascale and Luigi Maffei

Acoustics 2025, 7(2), 30; https://doi.org/10.3390/acoustics7020030 - 23 May 2025

Viewed by 1195

Abstract

Speech intelligibility (SI) is critical in effective communication across various settings, although it is often compromised by adverse acoustic conditions. In noisy environments, visual cues such as lip movements and facial expressions, when congruent with auditory information, can significantly enhance speech perception and [...] Read more.

Speech intelligibility (SI) is critical in effective communication across various settings, although it is often compromised by adverse acoustic conditions. In noisy environments, visual cues such as lip movements and facial expressions, when congruent with auditory information, can significantly enhance speech perception and reduce cognitive effort. In an ever-growing diffusion of virtual environments, communicating through virtual avatars is becoming increasingly prevalent, thus requiring a comprehensive understanding of these dynamics to ensure effective interactions. The present study used Unreal Engine’s MetaHuman technology to compare four methodologies used to create facial animation: MetaHuman Animator (MHA), MetaHuman LiveLink (MHLL), Audio-Driven MetaHuman (ADMH), and Synthetized Audio-Driven MetaHuman (SADMH). Thirty-six word pairs from the Diagnostic Rhyme Test (DRT) were used as input stimuli to create the animations and to compare them in terms of intelligibility. Moreover, to simulate a challenging background noise, the animations were mixed with a babble noise at a signal-to-noise ratio of −13 dB (A). Participants assessed a total of 144 facial animations. Results showed the ADMH condition to be the most intelligible among the methodologies used, probably due to enhanced clarity and consistency in the generated facial animations, while eliminating distractions like micro-expressions and natural variations in human articulation. Full article

► Show Figures

Figure 1

18 pages, 2345 KiB

Open AccessArticle

SGM-EMA: Speech Enhancement Method Score-Based Diffusion Model and EMA Mechanism

by Yuezhou Wu, Zhiri Li and Hua Huang

Appl. Sci. 2025, 15(10), 5243; https://doi.org/10.3390/app15105243 - 8 May 2025

Viewed by 847

Abstract

The score-based diffusion model has made significant progress in the field of computer vision, surpassing the performance of generative models, such as variational autoencoders, and has been extended to applications such as speech enhancement and recognition. This paper proposes a U-Net architecture using [...] Read more.

The score-based diffusion model has made significant progress in the field of computer vision, surpassing the performance of generative models, such as variational autoencoders, and has been extended to applications such as speech enhancement and recognition. This paper proposes a U-Net architecture using a score-based diffusion model and an efficient multi-scale attention mechanism (EMA) for the speech enhancement task. The model leverages the symmetric structure of U-Net to extract speech features and captures contextual information and local details across different scales using the EMA mechanism, improving speech quality in noisy environments. We evaluate the method on the VoiceBank-DEMAND (VB-DMD) dataset and the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus–TUT Sound Events 2017 (TIMIT-TUT) dataset. The experimental results show that the proposed model performed well in terms of speech quality perception (PESQ), extended short-time objective intelligibility (ESTOI), and scale-invariant signal-to-distortion ratio (SI-SDR). Especially when processing out-of-dataset noisy speech, the proposed method achieved excellent speech enhancement results compared to other methods, demonstrating the model’s strong generalization capability. We also conducted an ablation study on the SDE solver and the EMA mechanism, and the results show that the reverse diffusion method outperformed the Euler–Maruyama method, and the EMA strategy could improve the model performance. The results demonstrate the effectiveness of these two techniques in our system. Nevertheless, since the model is specifically designed for Gaussian noise, its performance under non-Gaussian or complex noise conditions may be limited. Full article

(This article belongs to the Special Issue Application of Deep Learning in Speech Enhancement Technology)

► Show Figures

Figure 1

Search Results (173)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (173)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI