MDPI - Publisher of Open Access Journals

30 pages, 1349 KB

Open AccessArticle

A Lightweight Multimodal Architecture for Punctuation Restoration in Kazakh ASR

by Aidana Karibayeva, Oleg Myssov, Balzhan Abduali, Dina Amirova and Adina Karybayeva

Computers 2026, 15(6), 345; https://doi.org/10.3390/computers15060345 - 28 May 2026

Viewed by 516

In this paper, we first present a multimodal architecture called CrossAttn-v1. This model is designed to recover punctuation marks in Kazakh and combines contextual XLM-RoBERTa-large text embeddings with the Whisper large-v3 encoder states via a cross-attention mechanism. In addition, a 4-dimensional prosodic vector [...] Read more.

In this paper, we first present a multimodal architecture called CrossAttn-v1. This model is designed to recover punctuation marks in Kazakh and combines contextual XLM-RoBERTa-large text embeddings with the Whisper large-v3 encoder states via a cross-attention mechanism. In addition, a 4-dimensional prosodic vector and a CRF output layer are used. The model was trained using an adapted Whisper ASR model on 33,332 utterances from the KazakhTTS2 corpus. After adaptation, the word error rate decreased from 45.7% to 4.25%. On the in-domain test set (56,396 tokens), CrossAttn-v1 achieved F1-macro = 0.8485 for recovering five-class punctuation marks. Furthermore, CrossAttn-v1 outperformed the GPT-4o zero-shot model by +0.294 F1 and the M3 Hybrid model based on prosody alone by +0.070 F1. The class analysis showed that the Whisper encoder states were particularly useful for prosody-dependent punctuation. For example, it outperformed M3 Hybrid by +9.5 percentage points on the QUESTION mark and by +20.2 percentage points on the EXCLAIM mark. On 883 out-of-domain natural speech recordings, the model performed similarly to the text-only baseline model (Δ = −0.041, not significant), suggesting that domain mismatch in the Whisper training corpus was a major factor limiting generalization. Full article

(This article belongs to the Special Issue Advances in Multimodal Learning and Representation)

► Show Figures

Figure 1

24 pages, 1038 KB

Open AccessArticle

Avant-Garde Poetry and the Tékhnē of Traditional Versification

by Evgenii Kazartsev and Nikita Kirichenko

Arts 2026, 15(5), 97; https://doi.org/10.3390/arts15050097 - 2 May 2026

Viewed by 490

Abstract

This article offers a theoretically nuanced and empirically grounded investigation into the paradoxical afterlife of classical versification within the poetic practices of the Russian and Soviet avant-garde. Challenging the persistent historiographic narrative that equates avant-garde poetics with an unequivocal rupture from tradition, the [...] Read more.

This article offers a theoretically nuanced and empirically grounded investigation into the paradoxical afterlife of classical versification within the poetic practices of the Russian and Soviet avant-garde. Challenging the persistent historiographic narrative that equates avant-garde poetics with an unequivocal rupture from tradition, the study demonstrates that canonical metrical forms—most notably iambic tetrameter—continued to operate as structurally productive, albeit critically reconfigured, elements within experimental verse. Drawing on a broad corpus encompassing poetic manifestos, verse texts, and prose writings by Vladimir Maiakovskii, Ilia Sel’vinskii, Semen Kirsanov, and Nikolai Aseev, the authors combine close formal analysis with quantitative prosodic modeling, including linguistic and speech models derived from Kolmogorov–Taranovsky verse theory. The article argues that avant-garde poets did not simply negate inherited metrics but subjected them to a process of internal recomposition, shifting attention from meter as a fixed scheme to rhythm as a dynamic, semantically charged construct. While rhythmic innovation is shown to be consciously engineered in verse, the analysis of verse-like fragments in prose reveals persistent, unconscious attachments to “classical” rhythmic patterns, particularly the Pushkinian alternating rhythm. This tension between declarative rejection and latent continuity illuminates the avant-garde’s distinctive mode of negotiating tradition: not abolishing it, but instrumentalizing it within a broader project of total artistic reorganization. The study thus reframes avant-garde prosody as a site where innovation and inheritance coexist in a state of productive contradiction, reshaping our understanding of modernist poetic technique. Full article

(This article belongs to the Special Issue The European Avant-Garde(s) and Technologies: Unfinished Modernity and the Idea of Tékhnē—the One Hundred Years’ Revolution, 1850–1950)

► Show Figures

Figure 1

62 pages, 7579 KB

Open AccessArticle

Phonological Choices Drive F0 Range Expansion and Lengthening in Bengali and English Infant-Directed Speech

by Kristine M. Yu, Sameer ud Dowla Khan and Megha Sundara

Languages 2026, 11(4), 68; https://doi.org/10.3390/languages11040068 - 1 Apr 2026

Viewed by 1124

Abstract

This study builds on a small body of work, all on Japanese, demonstrating how intonational phonology is critical for understanding prosodic modifications in infant-directed speech (IDS) relative to adult-directed speech. We performed similar analyses on simulated infant-directed speech vs. reading of a story [...] Read more.

This study builds on a small body of work, all on Japanese, demonstrating how intonational phonology is critical for understanding prosodic modifications in infant-directed speech (IDS) relative to adult-directed speech. We performed similar analyses on simulated infant-directed speech vs. reading of a story in English and Bengali: two languages that – unlike Japanese – both have stress and do not use fundamental frequency (F0) to signal changes in word-level meaning, but that have two very different intonational grammars. These differences allowed us to disentangle previous hypotheses about intonational exaggeration in IDS being concentrated in a particular part of the melody. We tested hypotheses that state this locus of exaggeration is either at: the final position in the melody (final in the intonational phrase), the most unpredictable part of the melody, or in pragmatically informative tones. Our results support the first hypothesis. We found that the phonological choices of speakers to chunk the story into shorter, larger prosodic constituents drive intonational exaggeration in IDS. This is because the intonational phrase-final position in both languages is the site of greatest pre-boundary lengthening and F0 range expansion. We also demonstrate: (i) quantification of predictability in intonational melodies using probabilistic finite state automaton representations of intonational grammars and (ii) F0 statistical analyses that are robust and scalable to large, naturalistic IDS corpora. Full article

(This article belongs to the Special Issue Advances in the Acquisition of Prosody)

► Show Figures

Figure 1

21 pages, 2194 KB

Open AccessArticle

Joint Modeling and KAFusion Feature Fusion for Prosody-Controllable Speech Synthesis

by Dongfeng Ye, Lin Jiang, Nianxin Ni and Wei Wan

Electronics 2026, 15(7), 1354; https://doi.org/10.3390/electronics15071354 - 25 Mar 2026

Viewed by 421

Abstract

To address the limited expressiveness in current speech synthesis caused by coarse-grained prosody modeling and simplistic feature fusion strategies, a joint prosody modeling framework and a nonlinear fusion method named KAFusion are proposed, based on the Kolmogorov–Arnold (KA) representation theorem. The joint modeling [...] Read more.

To address the limited expressiveness in current speech synthesis caused by coarse-grained prosody modeling and simplistic feature fusion strategies, a joint prosody modeling framework and a nonlinear fusion method named KAFusion are proposed, based on the Kolmogorov–Arnold (KA) representation theorem. The joint modeling integrates pitch and energy as prosodic priors with text encodings to jointly guide duration prediction, enabling explicit control over speech rate and tone. During feature fusion, KAFusion facilitates nonlinear interactions among features through its nested inner and outer functions. Information entropy serves as the quantitative metric, and both theoretical and experimental results demonstrate the fusion module’s efficacy in suppressing redundancy while preserving task-critical content. Evaluations on the AISHELL3 dataset show a 5.8% improvement in MOS over the baseline. Ablation studies further validate the effectiveness of the proposed components, where KAFusion achieves an output entropy of 3.47, which is 18.4% higher than that of linear fusion (2.93) and indicates richer information content. Full article

► Show Figures

Figure 1

20 pages, 770 KB

Open AccessSystematic Review

Speech and Language Changes During Rapid Eye Movement (REM) Sleep with Potential Diagnostic Markers: A Systematic Review

by Maria Pagano, Francesco Corallo, Anna Anselmo, Davide Cardile, Rosaria De Luca, Angelo Quartarone, Rocco Salvatore Calabrò and Irene Cappadona

Brain Sci. 2026, 16(2), 216; https://doi.org/10.3390/brainsci16020216 - 11 Feb 2026

Viewed by 840

Abstract

Background: Rapid Eye Movement (REM) sleep behavior disorder (RBD) is a parasomnia resulting from degeneration of pontine and medullary circuits responsible for muscle atonia during REM sleep, leading to dream-enactment behaviors and vocalizations. It is strongly linked to α-synucleinopathies, particularly Parkinson’s disease. Current [...] Read more.

Background: Rapid Eye Movement (REM) sleep behavior disorder (RBD) is a parasomnia resulting from degeneration of pontine and medullary circuits responsible for muscle atonia during REM sleep, leading to dream-enactment behaviors and vocalizations. It is strongly linked to α-synucleinopathies, particularly Parkinson’s disease. Current biomarkers such as neurophysiological measures and imaging support diagnosis and monitoring, but remain invasive or costly. Aim: This study aims to evaluate vocal and speech alterations as exploratory, non-validated candidate biomarkers of REM sleep behavior disorder. Methods: A systematic review was conducted according to PRISMA 2020 guidelines. PubMed, IEEE Digital Library Web of Science, Embase and the Cochrane Library were systematically searched for studies published from database inception to November 2025, as preregistered on the Open Science Framework. Studies were selected through a multi-step screening process and underwent qualitative quality assessment. Results: Twelve studies met inclusion criteria. Individuals with RBD exhibited abnormal nocturnal vocalizations and early lexical, syntactic, and narrative disruptions despite preserved perceptual speech. Quantitative analyses identified consistent deficits in prosody, phonation stability, timing, and articulation, with significant group differences and diagnostic accuracy up to 96% sensitivity. Multilingual cohorts demonstrated progression over time, while digital phenotyping detected emerging Parkinsonian signs with AUC > 0.70. Conclusions: Speech and vocal abnormalities in iRBD reflect early neurodegenerative changes and show promising but still exploratory diagnostic and prognostic potential. Integrating vocal markers with established biomarkers may enhance early detection; however, further research is required to validate a reliable and reproducible vocal signature of prodromal synucleinopathies. Full article

(This article belongs to the Special Issue The Relationship Between Sleep and the Brain: New Insights into Sleep Disorders and Their Effects on Mental Health, Cognition, and Quality of Life)

► Show Figures

Figure 1

38 pages, 6181 KB

Open AccessArticle

An AIoT-Based Framework for Automated English-Speaking Assessment: Architecture, Benchmarking, and Reliability Analysis of Open-Source ASR

by Paniti Netinant, Rerkchai Fooprateepsiri, Ajjima Rukhiran and Meennapa Rukhiran

Informatics 2026, 13(2), 19; https://doi.org/10.3390/informatics13020019 - 26 Jan 2026

Viewed by 2102

Abstract

The emergence of low-cost edge devices has enabled the integration of automatic speech recognition (ASR) into IoT environments, creating new opportunities for real-time language assessment. However, achieving reliable performance on resource-constrained hardware remains a significant challenge, especially on the Artificial Internet of Things [...] Read more.

The emergence of low-cost edge devices has enabled the integration of automatic speech recognition (ASR) into IoT environments, creating new opportunities for real-time language assessment. However, achieving reliable performance on resource-constrained hardware remains a significant challenge, especially on the Artificial Internet of Things (AIoT). This study presents an AIoT-based framework for automated English-speaking assessment that integrates architecture and system design, ASR benchmarking, and reliability analysis on edge devices. The proposed AIoT-oriented architecture incorporates a lightweight scoring framework capable of analyzing pronunciation, fluency, prosody, and CEFR-aligned speaking proficiency within an automated assessment system. Seven open-source ASR models—four Whisper variants (tiny, base, small, and medium) and three Vosk models—were systematically benchmarked in terms of recognition accuracy, inference latency, and computational efficiency. Experimental results indicate that Whisper-medium deployed on the Raspberry Pi 5 achieved the strongest overall performance, reducing inference latency by 42–48% compared with the Raspberry Pi 4 and attaining the lowest Word Error Rate (WER) of 6.8%. In contrast, smaller models such as Whisper-tiny, with a WER of 26.7%, exhibited two- to threefold higher scoring variability, demonstrating how recognition errors propagate into automated assessment reliability. System-level testing revealed that the Raspberry Pi 5 can sustain near real-time processing with approximately 58% CPU utilization and around 1.2 GB of memory, whereas the Raspberry Pi 4 frequently approaches practical operational limits under comparable workloads. Validation using real learner speech data (approximately 100 sessions) confirmed that the proposed system delivers accurate, portable, and privacy-preserving speaking assessment using low-power edge hardware. Overall, this work introduces a practical AIoT-based assessment framework, provides a comprehensive benchmark of open-source ASR models on edge platforms, and offers empirical insights into the trade-offs among recognition accuracy, inference latency, and scoring stability in edge-based ASR deployments. Full article

► Show Figures

Figure 1

20 pages, 1415 KB

Open AccessArticle

Decoding How Articulation and Pauses Influence Pronunciation Proficiency in Korean Learners of English

by Tae-Jin Yoon, Seunghee Han and Seunghee Ha

Behav. Sci. 2026, 16(2), 179; https://doi.org/10.3390/bs16020179 - 26 Jan 2026

Viewed by 724

Abstract

This study investigates how temporal fluency cues shape human ratings of L2 English pronunciation in Korean learners, using a large read-speech corpus annotated with five-point pronunciation scores. We focus on two timing-derived measures—articulation rate (AR) and mean silence duration (SilMean)—and examine whether these [...] Read more.

This study investigates how temporal fluency cues shape human ratings of L2 English pronunciation in Korean learners, using a large read-speech corpus annotated with five-point pronunciation scores. We focus on two timing-derived measures—articulation rate (AR) and mean silence duration (SilMean)—and examine whether these cues predict (i) articulation-accuracy ratings and (ii) prosody/fluency ratings. To account for dependencies in corpus data and to control for key learner- and task-level covariates, we fitted cumulative link mixed models with random intercepts for speakers and scripts, including proficiency band (ability), age, gender, and test type as fixed effects. Across models, faster articulation and shorter silent intervals were associated with higher articulation ratings, and a combined model including both AR and SilMean provided the best fit (lowest AIC). Temporal cues were even more strongly associated with prosody ratings, supporting construct alignment between timing measures and the prosody dimension of the rubric. Marginal predicted probabilities illustrate how the likelihood of receiving high ratings (score ≥ 4) increases with AR across proficiency and linguistic-complexity strata (with SilMean held constant), and how long silent intervals reduce these probabilities when AR is held constant. These findings indicate that temporal organization provides robust information about perceived pronunciation quality in read L2 speech and underscore the importance of construct-aware modeling when developing AI-based scoring and feedback systems trained on human-labeled data. Full article

(This article belongs to the Special Issue Factors Influencing the Effect of Bilingualism on Cognitive Control and Cognitive Reserve)

► Show Figures

Figure 1

33 pages, 3147 KB

Open AccessReview

Perception–Production of Second-Language Mandarin Tones Based on Interpretable Computational Methods: A Review

by Yujiao Huang, Zhaohong Xu, Xianming Bei and Huakun Huang

Mathematics 2026, 14(1), 145; https://doi.org/10.3390/math14010145 - 30 Dec 2025

Cited by 1 | Viewed by 1888

Abstract

We survey recent advances in second-language (L2) Mandarin lexical tones research and show how an interpretable computational approach can deliver parameter-aligned feedback across perception–production (P ↔ P). We synthesize four strands: (A) conventional evaluations and tasks (identification, same–different, imitation/read-aloud) that reveal robust tone-pair [...] Read more.

We survey recent advances in second-language (L2) Mandarin lexical tones research and show how an interpretable computational approach can deliver parameter-aligned feedback across perception–production (P ↔ P). We synthesize four strands: (A) conventional evaluations and tasks (identification, same–different, imitation/read-aloud) that reveal robust tone-pair asymmetries and early P ↔ P decoupling; (B) physiological and behavioral instrumentation (e.g., EEG, eye-tracking) that clarifies cue weighting and time course; (C) audio-only speech analysis, from classic F0 tracking and MFCC–prosody fusion to CNN/RNN/CTC and self-supervised pipelines; and (D) interpretable learning, including attention and relational models (e.g., graph neural networks, GNNs) opened with explainable AI (XAI). Across strands, evidence converges on tones as time-evolving F0 trajectories, so movement, turning-point timing, and local F0 range are more diagnostic than height alone, and the contrast between Tone 2 (rising) and Tone 3 (dipping/low) remains the persistent difficulty; learners with tonal vs. non-tonal language backgrounds weight these cues differently. Guided by this synthesis, we outline a tool-oriented framework that pairs perception and production on the same items, jointly predicts tone labels and parameter targets, and uses XAI to generate local attributions and counterfactual edits, making feedback classroom-ready. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

14 pages, 639 KB

Open AccessArticle

Recognising Emotions from the Voice: A tDCS and fNIRS Double-Blind Study on the Role of the Cerebellum in Emotional Prosody

by Sharon Mara Luciano, Laura Sagliano, Alessia Salzillo, Luigi Trojano and Francesco Panico

Brain Sci. 2025, 15(12), 1327; https://doi.org/10.3390/brainsci15121327 - 13 Dec 2025

Cited by 2 | Viewed by 941

Abstract

Background: Emotional prosody refers to the variations in pitch, pause, melody, rhythm, and stress of pronunciation conveying emotional meaning during speech. Although several studies demonstrated that the cerebellum is involved in the network subserving recognition of emotional facial expressions, there is only [...] Read more.

Background: Emotional prosody refers to the variations in pitch, pause, melody, rhythm, and stress of pronunciation conveying emotional meaning during speech. Although several studies demonstrated that the cerebellum is involved in the network subserving recognition of emotional facial expressions, there is only preliminary evidence suggesting its possible contribution to recognising emotional prosody by modulating the activity of cerebello-prefrontal circuits. The present study aims to further explore the role of the left and right cerebellum in the recognition of emotional prosody in a sample of healthy individuals who were required to identify emotions (happiness, anger, sadness, surprise, disgust, and neutral) from vocal stimuli selected from a validated database (EMOVO corpus). Methods: Anodal transcranial Direct Current Stimulation (tDCS) was used in offline mode to modulate cerebellar activity before the emotional prosody recognition task, and functional near-infrared spectroscopy (fNIRS) was used to monitor stimulation-related changes in oxy- and deoxy- haemoglobin (O2HB and HHB) in prefrontal areas (PFC). Results: Right cerebellar stimulation reduced reaction times in the recognition of all emotions (except neutral and disgust) as compared to both the sham and left cerebellar stimulation, while accuracy was not affected by the stimulation. Haemodynamic data revealed that right cerebellar stimulation reduced O2HB and increased HHB in the PFC bilaterally relative to the other stimulation conditions. Conclusions: These findings are consistent with the involvement of the right cerebellum in modulating emotional processing and in regulating cerebello-prefrontal circuits. Full article

(This article belongs to the Topic The Relationship Between Bodily, Autonomic, and Communicative Behaviors and the Experiential and Cognitive Aspects of Emotion)

► Show Figures

Figure 1

24 pages, 3980 KB

Open AccessArticle

Bridging Text and Speech for Emotion Understanding: An Explainable Multimodal Transformer Fusion Framework with Unified Audio–Text Attribution

by Ashutosh Pandey, Jasmeet Singh and Maninder Kaur

J. Intell. 2025, 13(12), 159; https://doi.org/10.3390/jintelligence13120159 - 3 Dec 2025

Cited by 4 | Viewed by 1841

Abstract

Conversational interactions, rich in both linguistic and vocal cues, provide a natural context for studying these processes. In this work, we propose an explainable multimodal transformer framework that integrates textual semantics (via RoBERTa) and acoustic prosody (via WavLM) to advance emotion understanding. By [...] Read more.

Conversational interactions, rich in both linguistic and vocal cues, provide a natural context for studying these processes. In this work, we propose an explainable multimodal transformer framework that integrates textual semantics (via RoBERTa) and acoustic prosody (via WavLM) to advance emotion understanding. By projecting both modalities into a shared latent space, our model captures the complementary contributions of language and speech to affective communication, achieving an 0.83 accuracy value across five emotion categories. Crucially, we embed explainable AI (XAI) techniques including Integrated Gradients and Occlusion to attribute predictions to specific linguistic tokens and prosodic patterns, thereby aligning computational mechanisms with human cognitive processes of emotion perception. Beyond performance gains, this work demonstrates how multimodal AI systems can support transparent, human-centered emotion recognition. Full article

(This article belongs to the Special Issue Social Cognition and Emotions)

► Show Figures

Figure 1

14 pages, 2974 KB

Open AccessData Descriptor

Articulatory Data on Preboundary Lengthening Across Prominence Conditions in American English

by Jiyoung Jang, Sahyang Kim and Taehong Cho

Data 2025, 10(12), 197; https://doi.org/10.3390/data10120197 - 1 Dec 2025

Viewed by 618

Abstract

This article presents articulatory–kinematic data on preboundary lengthening (Intonational Phrase-final lengthening) from the productions of ten native speakers of American English—a relatively rare class of phonetic data compared with the more widely available acoustic data. The dataset includes three trisyllabic nonce words (bábaba, [...] Read more.

This article presents articulatory–kinematic data on preboundary lengthening (Intonational Phrase-final lengthening) from the productions of ten native speakers of American English—a relatively rare class of phonetic data compared with the more widely available acoustic data. The dataset includes three trisyllabic nonce words (bábaba, babába, bababá), each designed to manipulate the location of lexical stress. These were produced under prosodic conditions that varied in boundary position and focus-induced phrasal prominence, enabling analysis of how preboundary lengthening is distributed across words with different lexical stress locations and how it interacts with prosodic prominence. Articulatory data were collected using electromagnetic articulography (EMA, Carstens AG200), providing kinematic measurements such as movement duration, peak velocity, and displacement of articulatory gestures. The accompanying files allow examination of individual speaker variation in these measures as modulated by prosodic structure, including boundary and prominence effects. While theoretical findings have been reported in a previous study, the full dataset, including detailed descriptions of individual speaker patterns, is made available here. By making these less commonly available articulatory data publicly available, we aim to promote broad reuse and support further research in prosody, articulatory phonetics, and speech production. Full article

► Show Figures

Figure 1

25 pages, 2990 KB

Open AccessArticle

Declination and Segmentation in Children with Childhood Apraxia of Speech

by Jill C. Thorson, Rachel T. Babcock, Julia M. Fisher, Kirrie J. Ballard and Donald A. Robin

Languages 2025, 10(12), 296; https://doi.org/10.3390/languages10120296 - 30 Nov 2025

Viewed by 1559

Abstract

Childhood apraxia of speech (CAS) is characterized by atypical timing between segments, leading to prosodic disruption at the lexical level. This study tested whether prosodic impairment in CAS extends to the intonational level by examining declination of fundamental frequency (f0). Eleven children with [...] Read more.

Childhood apraxia of speech (CAS) is characterized by atypical timing between segments, leading to prosodic disruption at the lexical level. This study tested whether prosodic impairment in CAS extends to the intonational level by examining declination of fundamental frequency (f0). Eleven children with CAS and ten typically developing (TD) peers aged 5 to 11 years old produced real and nonce multisyllabic words embedded in carrier phrases. Acoustic measures of inter-segment duration (within-word, between-word) and average f0 across segments were extracted. Children with CAS exhibited significantly longer inter-segment durations both within and between words, influenced by lexical stress position (first syllable, second syllable) and word status (real, nonce). They also showed shallower f0 declination slopes than TD peers, indicating reduced overall pitch fall. Segmentation and declination were not significantly correlated, suggesting distinct mechanisms underlying timing and pitch organization. Consistent with prior work, segmentation was greatest for nonce words with non-initial stress. Reduced declination in CAS may reflect limitations in prosodic planning or programming at the intonational level. These findings highlight dissociable disruptions in timing and pitch patterning in CAS, contributing to a more comprehensive understanding of prosodic control in motor speech disorders. Full article

(This article belongs to the Special Issue Advances in the Acquisition of Prosody)

► Show Figures

Figure 1

27 pages, 1695 KB

Open AccessReview

Overcoming the Challenge of Singing Among Cochlear Implant Users: An Analysis of the Disrupted Feedback Loop and Strategies for Improvement

by Stephanie M. Younan, Emmeline Y. Lin, Brooke Barry, Arjun Kurup, Karen C. Barrett and Nicole T. Jiam

Brain Sci. 2025, 15(11), 1192; https://doi.org/10.3390/brainsci15111192 - 4 Nov 2025

Cited by 1 | Viewed by 2416

Abstract

Background: Cochlear implants (CIs) are transformative neuroprosthetics that restore speech perception for individuals with severe-to-profound hearing loss. However, temporal envelope cues are well-represented within the signal processing, while spectral envelope cues are poorly accessed by CI users, resulting in substantial deficits compared to [...] Read more.

Background: Cochlear implants (CIs) are transformative neuroprosthetics that restore speech perception for individuals with severe-to-profound hearing loss. However, temporal envelope cues are well-represented within the signal processing, while spectral envelope cues are poorly accessed by CI users, resulting in substantial deficits compared to normal-hearing individuals. This profoundly impairs the perception of complex auditory stimuli like music and vocal prosody, significantly impacting users’ quality of life, social engagement, and artistic expression. Methods: This narrative review synthesizes research on CI signal-processing limitations, perceptual and production challenges in music and singing, the role of the auditory–motor feedback loop, and strategies for improvement, including rehabilitation, technology, and the influence of neuroplasticity and sensitive developmental periods. Results: The degraded signal causes marked deficits in pitch, timbre, and vocal emotion perception. Critically, this impoverished input functionally breaks the high-fidelity auditory–motor feedback loop essential for vocal control, transforming it from a precise fine-tuner into a gross error detector sensitive only to massive pitch shifts (~6 semitones). This neurophysiological breakdown directly causes pervasive pitch inaccuracies and melodic distortion in singing. Despite these challenges, improvements are possible through advanced sound-processing strategies, targeted auditory–motor training that leverages neuroplasticity, and capitalizing on sensitive periods for auditory development. Conclusions: The standard CI signal creates a fundamental neurophysiological barrier to singing. Overcoming this requires a paradigm shift toward holistic, patient-centered care that moves beyond speech-centric goals. Integrating personalized, music-based rehabilitation with advanced CI programming is essential for improving vocal production, fostering musical engagement, and ultimately enhancing the overall quality of life for CI users. Full article

(This article belongs to the Special Issue Language, Communication and the Brain—2nd Edition)

► Show Figures

Figure 1

19 pages, 7222 KB

Open AccessArticle

Multi-Channel Spectro-Temporal Representations for Speech-Based Parkinson’s Disease Detection

by Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee and Myunggi Yi

J. Imaging 2025, 11(10), 341; https://doi.org/10.3390/jimaging11100341 - 1 Oct 2025

Cited by 1 | Viewed by 1798

Abstract

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary [...] Read more.

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary time–frequency representations—mel spectrogram, constant-Q transform (CQT), and gammatone spectrogram—into a three-channel input analogous to an RGB image. This fused representation is evaluated across CNNs (ResNet, DenseNet, and EfficientNet) and Vision Transformer using the PC-GITA dataset, under 10-fold subject-independent cross-validation for robust assessment. Results showed that fusion consistently improves performance over single representations across architectures. EfficientNet-B2 achieves the highest accuracy (84.39% ± 5.19%) and F1-score (84.35% ± 5.52%), outperforming recent methods using handcrafted features or pretrained models (e.g., Wav2Vec2.0, HuBERT) on the same task and dataset. Performance varies with sentence type, with emotionally salient and prosodically emphasized utterances yielding higher AUC, suggesting that richer prosody enhances discriminability. Our findings indicate that multi-channel fusion enhances sensitivity to subtle speech impairments in PD by integrating complementary spectral information. Our approach implies that multi-channel fusion could enhance the detection of discriminative acoustic biomarkers, potentially offering a more robust and effective framework for speech-based PD screening, though further validation is needed before clinical application. Full article

(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)

► Show Figures

Figure 1

29 pages, 2669 KB

Open AccessArticle

How Has Poets’ Reading Style Changed? A Phonetic Analysis of the Effects of Historical Phases and Gender on 20th Century Spanish Poetry Reading

by Valentina Colonna

Languages 2025, 10(10), 255; https://doi.org/10.3390/languages10100255 - 30 Sep 2025

Viewed by 1836

Abstract

Poetry reading remains a largely underexplored area in phonetic research. While previous studies have highlighted its potential and challenges, experimental research in the Spanish context is still limited. This study aims to examine the evolution of Spanish poetry reading over time, focusing on [...] Read more.

Poetry reading remains a largely underexplored area in phonetic research. While previous studies have highlighted its potential and challenges, experimental research in the Spanish context is still limited. This study aims to examine the evolution of Spanish poetry reading over time, focusing on its main prosodic features. Applying the VIP-VSP phonetic model to 40 poetry recordings, we analyzed the organizational and prosodic indices that characterize poetry reading. Mean speech rate, plenus (the ratio of speaking time to pausing), and pitch span emerged as key parameters for capturing change. The results identified two distinct historical phases—first and second radio-television—showing significant effects on speech rate, plenus, and pitch span: speech rate and pitch span increased over time, while plenus decreased. Gender also played a key role, with female voices exhibiting significantly higher values in both pitch span and plenus. Variability and recurring strategies were observed within and across authors. This study confirms that poetry reading has evolved along a ‘stylistic-chronological’ trajectory, while also reflecting gender-based distinctions. These findings underscore the need for interdisciplinary analytical approaches and diversified classification groupings to fully capture the complexity of this mode of speech. Full article

► Show Figures

Figure 1

Search Results (88)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (88)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI