Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (660)

Search Parameters:
Keywords = speech acoustics

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
37 pages, 12169 KB  
Article
Perceptual Evaluation of Acoustic Level of Detail in Virtual Acoustic Environments
by Stefan Fichna, Steven van de Par, Bernhard U. Seeber and Stephan D. Ewert
Acoustics 2026, 8(1), 9; https://doi.org/10.3390/acoustics8010009 - 30 Jan 2026
Abstract
Virtual acoustics enables the creation and simulation of realistic and ecologically valid indoor environments vital for hearing research and audiology. For real-time applications, room acoustics simulation requires simplifications. However, the acoustic level of detail (ALOD) necessary to capture all perceptually relevant effects remains [...] Read more.
Virtual acoustics enables the creation and simulation of realistic and ecologically valid indoor environments vital for hearing research and audiology. For real-time applications, room acoustics simulation requires simplifications. However, the acoustic level of detail (ALOD) necessary to capture all perceptually relevant effects remains unclear. This study examines the impact of varying ALOD in simulations of three real environments: a living room with a coupled kitchen, a pub, and an underground station. ALOD was varied by generating different numbers of image sources for early reflections, or by excluding geometrical room details specific for each environment. Simulations were perceptually evaluated using headphones in comparison to measured, real binaural room impulse responses, or by using loudspeakers. The perceived overall difference, spatial audio quality differences, plausibility, speech intelligibility, and externalization were assessed. A transient pulse, an electric bass, and a speech token were used as stimuli. The results demonstrate that considerable reductions in acoustic level of detail are perceptually acceptable for communication-oriented scenarios. Speech intelligibility was robust across ALOD levels, whereas broadband transient stimuli revealed increased sensitivity to simplifications. High-ALOD simulations yielded plausibility and externalization ratings comparable to real-room recordings under both headphone and loudspeaker reproduction. Full article
Show Figures

Figure 1

25 pages, 2358 KB  
Article
Near-Merger and Contextual Sensitivity in the Perception of /n-l/ in Sichuan Mandarin
by Minghao Zheng, Allen Shamsi and Ratree Wayland
Brain Sci. 2026, 16(2), 155; https://doi.org/10.3390/brainsci16020155 - 29 Jan 2026
Abstract
Background/Objectives: Sichuan Mandarin is often described as exhibiting overlap or merger between word-initial /n/ and /l/, but perceptual sensitivity across phonetic contexts remains underexplored. This study examines whether perception of the /n-l/ contrast varies by vowel context and listener experience. Methods: [...] Read more.
Background/Objectives: Sichuan Mandarin is often described as exhibiting overlap or merger between word-initial /n/ and /l/, but perceptual sensitivity across phonetic contexts remains underexplored. This study examines whether perception of the /n-l/ contrast varies by vowel context and listener experience. Methods: Thirty-two Sichuan Mandarin listeners completed categorical identification and same–different AX discrimination tasks using seven-step /n/ → /l/ continua derived from native-speaker productions in /i/ and /a/ contexts. Sensitivity, response bias, accuracy, and response times were analyzed alongside individual differences. Acoustic properties of the stimuli were quantified using spectral and amplitude-based measures. Results: Listeners showed overall reduced sensitivity to the /n-l/ contrast, with substantially stronger perceptual differentiation in /i/ than in /a/ contexts. Bias patterns were comparable across contexts, indicating sensitivity-driven effects. Acoustic analyses showed more robust cue structure in the /i/ continuum. Age, education, and Standard Mandarin experience modulated response efficiency but did not eliminate the vowel asymmetry. Conclusions: Results support a context-dependent near-merger of /n/ and /l/, shaped by acoustic cue availability and experience-based cue exploitation. Full article
(This article belongs to the Special Issue Language Perception and Processing)
Show Figures

Figure 1

19 pages, 795 KB  
Article
A Confidence-Gated Hybrid CNN Ensemble for Accurate Detection of Parkinson’s Disease Using Speech Analysis
by Salem Titouni, Nadhir Djeffal, Massinissa Belazzoug, Boualem Hammache, Idris Messaoudene and Abdallah Hedir
Electronics 2026, 15(3), 587; https://doi.org/10.3390/electronics15030587 - 29 Jan 2026
Abstract
Parkinson’s Disease (PD) is a progressive neurodegenerative disorder for which early and reliable diagnosis remains challenging. To address this challenge, the key innovation of this work is a confidence-gated fusion mechanism that dynamically weights classifier outputs based on per-sample prediction certainty, overcoming the [...] Read more.
Parkinson’s Disease (PD) is a progressive neurodegenerative disorder for which early and reliable diagnosis remains challenging. To address this challenge, the key innovation of this work is a confidence-gated fusion mechanism that dynamically weights classifier outputs based on per-sample prediction certainty, overcoming the limitations of static ensemble strategies. Building on this idea, we propose a Confidence-Gated Hybrid CNN Ensemble that integrates CNN-based acoustic feature extraction with heterogeneous classifiers, including XGBoost, Support Vector Machines, and Random Forest. By adaptively modulating the contribution of each classifier at the sample level, the proposed framework enhances robustness against data imbalance, inter-speaker variability, and feature complexity. The method is evaluated on two benchmark PD speech datasets, where it consistently outperforms conventional machine learning and ensemble approaches, achieving a best classification accuracy of up to 97.9% while maintaining computational efficiency compatible with real-time deployment. These results highlight the effectiveness and clinical potential of confidence-aware ensemble learning for non-invasive PD detection. Full article
(This article belongs to the Section Bioelectronics)
Show Figures

Figure 1

19 pages, 1724 KB  
Article
Speech Impairment in Early Parkinson’s Disease Is Associated with Nigrostriatal Dopaminergic Dysfunction
by Sotirios Polychronis, Grigorios Nasios, Efthimios Dardiotis, Rayo Akande and Gennaro Pagano
J. Clin. Med. 2026, 15(3), 1006; https://doi.org/10.3390/jcm15031006 - 27 Jan 2026
Viewed by 107
Abstract
Background/Objectives: Speech difficulties are an early and disabling manifestation of Parkinson’s disease (PD), affecting communication and quality of life. This study aimed to examine demographic, clinical, dopaminergic imaging and cerebrospinal fluid (CSF) correlates of speech difficulties in early PD, comparing treatment-naïve and levodopa-treated [...] Read more.
Background/Objectives: Speech difficulties are an early and disabling manifestation of Parkinson’s disease (PD), affecting communication and quality of life. This study aimed to examine demographic, clinical, dopaminergic imaging and cerebrospinal fluid (CSF) correlates of speech difficulties in early PD, comparing treatment-naïve and levodopa-treated patients. Methods: A cross-sectional analysis was conducted using data from the Parkinson’s Progression Markers Initiative (PPMI). The sample included 376 treatment-naïve and 133 levodopa-treated early PD participants. Speech difficulties were defined by Movement Disorder Society—Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) Part III, with Item 3.1 ≥ 1. Group comparisons and binary logistic regression identified predictors among demographic, clinical, dopaminergic and CSF biomarker variables, including [123I]FP-CIT specific binding ratios (SBRs). All analyses were cross-sectional, and findings reflect associative relationships rather than treatment effects or causal mechanisms. Results: Speech difficulties were present in 44% of treatment-naïve and 57% of levodopa-treated participants. In both cohorts, higher MDS-UPDRS Part III ON scores—reflecting greater motor severity—and lower mean putamen SBR values were significant independent predictors of speech impairment. Age was an additional predictor in the treatment-naïve group. No significant differences were found in CSF biomarkers (α-synuclein, amyloid-β, tau, phosphorylated tau). These findings indicate that striatal dopaminergic loss, particularly in the putamen, and motor dysfunction relate to early PD-related speech difficulties, whereas CSF neurodegeneration markers do not differentiate affected patients. Conclusions: Speech difficulties in early PD are primarily linked to dopaminergic and motor dysfunction rather than global neurodegenerative biomarker changes. Longitudinal and multimodal studies integrating acoustic, neuroimaging, and cognitive measures are warranted to elucidate the neural basis of speech decline and inform targeted interventions. Full article
(This article belongs to the Special Issue Innovations in Parkinson’s Disease)
Show Figures

Figure 1

23 pages, 12389 KB  
Article
Possible Merits of the Orchestra Pit Covering for Speech Activities in Baroque Theatres
by Silvana Sukaj, Umberto Derme and Gino Iannace
Appl. Sci. 2026, 16(2), 819; https://doi.org/10.3390/app16020819 - 13 Jan 2026
Viewed by 135
Abstract
Acoustically, Baroque theatres have prove remarkably appropriate for opera, and, in the past, little distinction was drawn in design between drama and opera use, except for the inclusion of an orchestra pit, because both music and words were audible and balanced, reverberation times [...] Read more.
Acoustically, Baroque theatres have prove remarkably appropriate for opera, and, in the past, little distinction was drawn in design between drama and opera use, except for the inclusion of an orchestra pit, because both music and words were audible and balanced, reverberation times being shorter than in concert halls but longer than in speech auditoria. In a drama configuration, scenery is set in the fly tower on stage, while for opera pieces, in most cases, the orchestra pit platform raises to the main floor level of the stalls to set additional seats rows. Considering the characteristics of the Opera di Roma (IT), the case study, the main physical parameters that contribute to the sound quality are evaluated and compared in relation to the pit position level, in order to understand the possible merits of the covering seats on the pit surface for drama representations and, more generally, for speech activities. Eight different configurations are compared and, to evaluate the acoustic parameters’ sensitivity, the JND (just noticeable difference) is analyzed. The parameters’ trend is described. Full article
(This article belongs to the Special Issue Acoustics Analysis and Noise Control for Buildings)
Show Figures

Figure 1

26 pages, 29009 KB  
Article
Quantifying the Relationship Between Speech Quality Metrics and Biometric Speaker Recognition Performance Under Acoustic Degradation
by Ajan Ahmed and Masudul H. Imtiaz
Signals 2026, 7(1), 7; https://doi.org/10.3390/signals7010007 - 12 Jan 2026
Viewed by 362
Abstract
Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and [...] Read more.
Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and HuBERT) across three diverse datasets: TIMIT, CHiME-6, and Common Voice. We systematically applied 21 degradation conditions spanning noise contamination (SNR levels from 0 to 20 dB), reverberation (RT60 from 0.3 to 1.0 s), and codec compression (various bit rates), then measured both objective audio quality metrics (PESQ, STOI, SNR, SegSNR, fwSNRseg, jitter, shimmer, HNR) and speaker verification performance metrics (EER, AUC-ROC, d-prime, minDCF). At the condition level, multiple regression with all eight quality metrics explained up to 80% of the variance in minDCF for HuBERT and 78% for WavLM, but only 35% for Wav2Vec2; EER predictability was lower (69%, 67%, and 28%, respectively). PESQ was the strongest single predictor for WavLM and HuBERT, while Shimmer showed the highest single-metric correlation for Wav2Vec2; fwSNRseg yielded the top single-metric R2 for WavLM, and PESQ for HuBERT and Wav2Vec2 (with much smaller gains for Wav2Vec2). WavLM and HuBERT exhibited more predictable quality-performance relationships compared to Wav2Vec2. These findings establish quantitative relationships between measurable audio quality and speaker verification accuracy at the condition level, though substantial within-condition variability limits utterance-level prediction accuracy. Full article
Show Figures

Figure 1

16 pages, 282 KB  
Review
Dysphagia and Dysarthria in Neurodegenerative Diseases: A Multisystem Network Approach to Assessment and Management
by Maria Luisa Fiorella, Luca Ballini, Valentina Lavermicocca, Maria Sterpeta Ragno, Domenico A. Restivo and Rosario Marchese-Ragona
Audiol. Res. 2026, 16(1), 9; https://doi.org/10.3390/audiolres16010009 - 12 Jan 2026
Viewed by 353
Abstract
Dysphagia and dysarthria are common, co-occurring manifestations in neurodegenerative diseases, resulting from damage to distributed neural networks involving cortical, subcortical, cerebellar, and brainstem regions. These disorders profoundly affect patient health and quality of life through complex sensorimotor impairments. Objective: The aims was [...] Read more.
Dysphagia and dysarthria are common, co-occurring manifestations in neurodegenerative diseases, resulting from damage to distributed neural networks involving cortical, subcortical, cerebellar, and brainstem regions. These disorders profoundly affect patient health and quality of life through complex sensorimotor impairments. Objective: The aims was to provide a comprehensive, evidence-based review of the neuroanatomical substrates, pathophysiology, diagnostic approaches, and management strategies for dysphagia and dysarthria in neurodegenerative diseases with emphasis on their multisystem nature and integrated treatment approaches. Methods: A narrative literature review was conducted using PubMed, Scopus, and Web of Science databases (2000–2024), focusing on Parkinson’s disease (PD), amyotrophic lateral sclerosis (ALS), progressive supranuclear palsy (PSP), and multiple system atrophy (MSA). Search terms included “dysphagia”, “dysarthria”, “neurodegenerative diseases”, “neural networks”, “swallowing control” and “speech production.” Studies on neuroanatomy, pathophysiology, diagnostic tools, and therapeutic interventions were included. Results: Contemporary neuroscience demonstrates that swallowing and speech control involve extensive neural networks beyond the brainstem, including bilateral sensorimotor cortex, insula, cingulate gyrus, basal ganglia, and cerebellum. Disease-specific patterns reflect multisystem involvement: PD affects basal ganglia and multiple brainstem nuclei; ALS involves cortical and brainstem motor neurons; MSA causes widespread autonomic and motor degeneration; PSP produces tau-related damage across multiple brain regions. Diagnostic approaches combining fiberoptic endoscopic evaluation, videofluoroscopy, acoustic analysis, and neuroimaging enable precise characterization. Management requires multidisciplinary Integrated teams implementing coordinated speech-swallowing therapy, pharmacological interventions, and assistive technologies. Conclusions: Dysphagia and dysarthria in neurodegenerative diseases result from multifocal brain damage affecting distributed neural networks. Understanding this multisystem pathophysiology enables more effective integrated assessment and treatment approaches, enhancing patient outcomes and quality of life. Full article
19 pages, 3791 KB  
Article
A Machine Learning Framework for Cognitive Impairment Screening from Speech with Multimodal Large Models
by Shiyu Chen, Ying Tan, Wenyu Hu, Yingxi Chen, Lihua Chen, Yurou He, Weihua Yu and Yang Lü
Bioengineering 2026, 13(1), 73; https://doi.org/10.3390/bioengineering13010073 - 8 Jan 2026
Viewed by 411
Abstract
Background: Early diagnosis of Alzheimer’s disease (AD) is essential for slowing disease progression and mitigating cognitive decline. However, conventional diagnostic methods are often invasive, time-consuming, and costly, limiting their utility in large-scale screening. There is an urgent need for scalable, non-invasive, and [...] Read more.
Background: Early diagnosis of Alzheimer’s disease (AD) is essential for slowing disease progression and mitigating cognitive decline. However, conventional diagnostic methods are often invasive, time-consuming, and costly, limiting their utility in large-scale screening. There is an urgent need for scalable, non-invasive, and accessible screening tools. Methods: We propose a novel screening framework combining a pre-trained multimodal large language model with structured MMSE speech tasks. An artificial intelligence-assisted multilingual Mini-Mental State Examination system (AAM-MMSE) was utilized to collect voice data from 1098 participants in Sichuan and Chongqing. CosyVoice2 was used to extract speaker embeddings, speech labels, and acoustic features, which were converted into statistical representations. Fourteen machine learning models were developed for subject classification into three diagnostic categories: Healthy Control (HC), Mild Cognitive Impairment (MCI), and Alzheimer’s Disease (AD). SHAP analysis was employed to assess the importance of the extracted speech features. Results: Among the evaluated models, LightGBM and Gradient Boosting classifiers exhibited the highest performance, achieving an average AUC of 0.9501 across classification tasks. SHAP-based analysis revealed that spectral complexity, energy dynamics, and temporal features were the most influential in distinguishing cognitive states, aligning with known speech impairments in early-stage AD. Conclusions: This framework offers a non-invasive, interpretable, and scalable solution for cognitive screening. It is suitable for both clinical and telemedicine applications, demonstrating the potential of speech-based AI models in early AD detection. Full article
(This article belongs to the Section Biosignal Processing)
Show Figures

Figure 1

14 pages, 1392 KB  
Article
AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots
by Xiugong Qin, Fenghu Pan, Jing Gao, Shilong Huang, Yichen Sun and Xiao Zhong
Electronics 2026, 15(1), 239; https://doi.org/10.3390/electronics15010239 - 5 Jan 2026
Viewed by 304
Abstract
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited [...] Read more.
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited adaptability to diverse speaker characteristics. The quality of the Mel spectrogram directly affects the performance of TTS systems, yet existing methods overlook the potential of enhancing Mel spectrogram quality through more comprehensive speech features. To address the complex acoustic characteristics of home environments, this paper introduces AirSpeech, a post-processing model for Mel-spectrogram synthesis. We adopt a Generative Adversarial Network (GAN) to improve the accuracy of Mel spectrogram prediction and enhance the expressiveness of synthesized speech. By incorporating additional conditioning extracted from synthesized audio using specified speech feature parameters, our method significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments. Furthermore, we propose a global normalization strategy to stabilize the GAN training process. Through extensive evaluations, we demonstrate that the proposed method significantly improves the signal quality and naturalness of synthesized speech, providing a more user-friendly speech interaction solution for smart home applications. Full article
Show Figures

Figure 1

26 pages, 7667 KB  
Article
GRU-Based Deep Multimodal Fusion of Speech and Head-IMU Signals in Mixed Reality for Parkinson’s Disease Detection
by Daria Hemmerling, Milosz Dudek, Justyna Krzywdziak, Magda Żbik, Wojciech Szecowka, Mateusz Daniol, Marek Wodzinski, Monika Rudzinska-Bar and Magdalena Wojcik-Pedziwiatr
Sensors 2026, 26(1), 269; https://doi.org/10.3390/s26010269 - 1 Jan 2026
Viewed by 496
Abstract
Parkinson’s disease (PD) alters both speech and movement, yet most automated assessments still treat these signals separately. We examined whether combining voice with head motion improves discrimination between patients and healthy controls (HC). Synchronous measurements of acoustic and inertial signals were collected using [...] Read more.
Parkinson’s disease (PD) alters both speech and movement, yet most automated assessments still treat these signals separately. We examined whether combining voice with head motion improves discrimination between patients and healthy controls (HC). Synchronous measurements of acoustic and inertial signals were collected using a HoloLens 2 headset. Data were obtained from 165 participants (72 PD/93 HC), following a standardized mixed-reality (MR) protocol. We benchmarked single-modality models against fusion strategies under 5-fold stratified cross-validation. Voice alone was robust (pooled AUC ≈ 0.865), while the inertial channel alone was near chance (AUC ≈ 0.497). Fusion provided a modest but repeatable improvement: gated early-fusion achieved the highest AUC (≈0.875), cross-attention fusion was comparable (≈0.873). Gains were task-dependent. While speech-dominated tasks were already well captured by audio, tasks that embed movement benefited from complementary inertial data. Proposed MR capture proved feasible within a single session and showed that motion acts as a conditional improvement factor rather than a sole predictor. The results outline a practical path to multimodal screening and monitoring for PD, preserving the reliability of acoustic biomarkers while integrating kinematic features when they matter. Full article
Show Figures

Figure 1

17 pages, 899 KB  
Article
Exploring Bidirectional Associations Between Voice Acoustics and Objective Motor Metrics in Parkinson’s Disease
by Anna Carolyna Gianlorenço, Paulo Eduardo Portes Teixeira, Valton Costa, Walter Fabris-Moraes, Paola Gonzalez-Mego, Ciro Ramos-Estebanez, Arianna Di Stadio, Deniz Doruk Camsari, Mirret M. El-Hagrassy, Felipe Fregni, Tim Wagner and Laura Dipietro
Brain Sci. 2026, 16(1), 48; https://doi.org/10.3390/brainsci16010048 - 29 Dec 2025
Viewed by 313
Abstract
Background/Objectives: Speech and motor control share overlapping neural mechanisms, yet their quantitative relationships in Parkinson’s disease (PD) remain underexplored. This study investigated bidirectional associations between acoustic voice features and objective motor metrics to better understand how vocal and motor systems relate in PD. [...] Read more.
Background/Objectives: Speech and motor control share overlapping neural mechanisms, yet their quantitative relationships in Parkinson’s disease (PD) remain underexplored. This study investigated bidirectional associations between acoustic voice features and objective motor metrics to better understand how vocal and motor systems relate in PD. Methods: Cross-sectional baseline data from participants in a randomized neuromodulation trial were analyzed (n = 13). Motor performance was captured using an Integrated Motion Analysis Suite (IMAS), which enabled quantitative, objective characterization of motor performance during balance, gait, and upper- and lower-limb tasks. Acoustic analyses included harmonic-to-noise ratio (HNR), smoothed cepstral peak prominence (CPPS), jitter, shimmer, median fundamental frequency (F0), F0 standard deviation (SD F0), and voice intensity. Univariate linear regressions were conducted in both directions (voice ↔ motor), as well as partial correlations controlling for PD motor symptom severity. Results: When modeling voice outcomes, faster motor performance and shorter movement durations were associated with acoustically clearer voice features (e.g., higher elbow flexion-extension peak speed with higher voice HNR, β = 8.5, R2 = 0.56, p = 0.01). Similarly, when modeling motor outcomes, clearer voice measures were linked with faster movement speed and shorter movement durations (e.g., higher voice HNR with higher peak movement speed in elbow flexion/extension, β = 0.07, R2 = 0.56, p = 0.01). Conclusions: Voice and motor measures in PD showed significant bidirectional associations, suggesting shared sensorimotor control. These exploratory findings, while limited by sample size, support the feasibility of integrated multimodal assessment for future longitudinal studies. Full article
(This article belongs to the Special Issue Computational Intelligence and Brain Plasticity)
Show Figures

Figure 1

19 pages, 1187 KB  
Article
Dual-Pipeline Machine Learning Framework for Automated Interpretation of Pilot Communications at Non-Towered Airports
by Abdullah All Tanvir, Chenyu Huang, Moe Alahmad, Chuyang Yang and Xin Zhong
Aerospace 2026, 13(1), 32; https://doi.org/10.3390/aerospace13010032 - 28 Dec 2025
Viewed by 319
Abstract
Accurate estimation of aircraft operations, such as takeoffs and landings, is critical for airport planning and resource allocation, yet it remains particularly challenging at non-towered airports, where no dedicated surveillance infrastructure exists. Existing solutions, including video analytics, acoustic sensors, and transponder-based systems, are [...] Read more.
Accurate estimation of aircraft operations, such as takeoffs and landings, is critical for airport planning and resource allocation, yet it remains particularly challenging at non-towered airports, where no dedicated surveillance infrastructure exists. Existing solutions, including video analytics, acoustic sensors, and transponder-based systems, are often costly, incomplete, or unreliable in environments with mixed traffic and inconsistent radio usage, highlighting the need for a scalable, infrastructure-free alternative. To address this gap, this study proposes a novel dual-pipeline machine learning framework that classifies pilot radio communications using both textual and spectral features to infer operational intent. A total of 2489 annotated pilot transmissions collected from a U.S. non-towered airport were processed through automatic speech recognition (ASR) and Mel-spectrogram extraction. We benchmarked multiple traditional classifiers and deep learning models, including ensemble methods, long short-term memory (LSTM) networks, and convolutional neural networks (CNNs), across both feature pipelines. Results show that spectral features paired with deep architectures consistently achieved the highest performance, with F1-scores exceeding 91% despite substantial background noise, overlapping transmissions, and speaker variability These findings indicate that operational intent can be inferred reliably from existing communication audio alone, offering a practical, low-cost path toward scalable aircraft operations monitoring and supporting emerging virtual tower and automated air traffic surveillance applications. Full article
(This article belongs to the Special Issue AI, Machine Learning and Automation for Air Traffic Control (ATC))
Show Figures

Figure 1

18 pages, 4190 KB  
Article
Acoustic Characteristics of Vowel Production in Children with Cochlear Implants Using a Multi-View Fusion Model
by Qingqing Xie, Jing Wang, Ling Du, Lifang Zhang and Yanan Li
Algorithms 2026, 19(1), 9; https://doi.org/10.3390/a19010009 - 22 Dec 2025
Viewed by 316
Abstract
This study aims to examine the acoustic characteristics of Mandarin vowels produced by children with cochlear implants and to explore the differences in their speech production compared with those of children with normal hearing. We propose a multiview model-based method for vowel feature [...] Read more.
This study aims to examine the acoustic characteristics of Mandarin vowels produced by children with cochlear implants and to explore the differences in their speech production compared with those of children with normal hearing. We propose a multiview model-based method for vowel feature analysis. This approach involves extracting and fusing formant features, Mel-frequency cepstral coefficients (MFCCs), and linear predictive coding coefficients (LPCCs) to comprehensively represent vowel articulation. We conducted k-means clustering on individual features and applied multiview clustering to the fused features. The results showed that children with cochlear implants formed discernible vowel clusters in the formant space, though with lower compactness than those of normal-hearing children. Furthermore, the MFCCs and LPCCs features revealed significant inter-group differences. Most importantly, the multiview model, utilizing fused features, achieved superior clustering performance compared to any single feature. These findings demonstrated that effective fusion of frequency domain features provided a more comprehensive representation of phonetic characteristics, offering potential value for clinical assessment and targeted speech intervention in children with hearing impairment. Full article
Show Figures

Figure 1

13 pages, 284 KB  
Article
Two-Stage Domain Adaptation for LLM-Based ASR by Decoupling Linguistic and Acoustic Factors
by Lin Zheng, Xuyang Wang, Qingwei Zhao and Ta Li
Appl. Sci. 2026, 16(1), 60; https://doi.org/10.3390/app16010060 - 20 Dec 2025
Viewed by 382
Abstract
Large language models (LLMs) have been increasingly applied in Automatic Speech Recognition (ASR), achieving significant advancements. However, the performance of LLM-based ASR (LLM-ASR) models remains unsatisfactory when applied across domains due to domain shifts between acoustic and linguistic conditions. To address this challenge, [...] Read more.
Large language models (LLMs) have been increasingly applied in Automatic Speech Recognition (ASR), achieving significant advancements. However, the performance of LLM-based ASR (LLM-ASR) models remains unsatisfactory when applied across domains due to domain shifts between acoustic and linguistic conditions. To address this challenge, we propose a decoupled two-stage domain adaptation framework that separates the adaptation process into text-only and audio-only stages. In the first stage, we leverage abundant text data from the target domain to refine the LLM component, thereby improving its contextual and linguistic alignment with the target domain. In the second stage, we employ a pseudo-labeling method with unlabeled audio data in the target domain and introduce two key enhancements: (1) incorporating decoupled auxiliary Connectionist Temporal Classification (CTC) loss to improve the robustness of the speech encoder under different acoustic conditions; (2) adopting a synchronous LLM tuning strategy, allowing the LLM to continuously learn linguistic alignment from pseudo-labeled transcriptions enriched with domain textual knowledge. The experimental results demonstrate that our proposed methods significantly improve the performance of LLM-ASR in the target domain, achieving a relative word error rate reduction of 19.2%. Full article
(This article belongs to the Special Issue Speech Recognition: Techniques, Applications and Prospects)
Show Figures

Figure 1

16 pages, 1381 KB  
Article
Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition
by Da-Eun Chae and Seok-Pil Lee
Electronics 2025, 14(24), 4972; https://doi.org/10.3390/electronics14244972 - 18 Dec 2025
Viewed by 318
Abstract
Multimodal emotion recognition (MER) often relies on single-scale representations that fail to capture the hierarchical structure of emotional signals. This paper proposes a Dual Routing Mixture-of-Experts (MoE) model that dynamically selects between local (fine-grained) and global (contextual) representations extracted from speech and text [...] Read more.
Multimodal emotion recognition (MER) often relies on single-scale representations that fail to capture the hierarchical structure of emotional signals. This paper proposes a Dual Routing Mixture-of-Experts (MoE) model that dynamically selects between local (fine-grained) and global (contextual) representations extracted from speech and text encoders. The framework first obtains local–global embeddings using WavLM and RoBERTa, then employs a scale-aware routing mechanism to activate the most informative expert before bidirectional cross-attention fusion. Experiments on the IEMOCAP dataset show that the proposed model achieves stable performance across all folds, reaching an average unweighted accuracy (UA) of 75.27% and weighted accuracy (WA) of 74.09%. The model consistently outperforms single-scale baselines and simple concatenation methods, confirming the importance of dynamic multi-scale cue selection. Ablation studies highlight that neither local-only nor global-only representations are sufficient, while routing behavior analysis reveals emotion-dependent scale preferences—such as strong reliance on local acoustic cues for anger and global contextual cues for low-arousal emotions. These findings demonstrate that emotional expressions are inherently multi-scale and that scale-aware expert activation provides a principled approach beyond conventional single-scale fusion. Full article
Show Figures

Figure 1

Back to TopTop