Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (60)

Search Parameters:
Keywords = phoneme recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
14 pages, 2521 KB  
Article
Longitudinal Correlation of Frequency-to-Place Mismatch and Postoperative Speech Perception Outcomes in Cochlear Implant Recipients: Monosyllable, Consonant, Word, and Sentence
by Toshihito Sahara, Yujiro Hoshi, Anjin Mori, Hajime Koyama, Yasuhiro Osaki, Waki Nakajima, Takeshi Fujita, Akinori Kashio and Katsumi Doi
Audiol. Res. 2026, 16(2), 56; https://doi.org/10.3390/audiolres16020056 - 10 Apr 2026
Viewed by 212
Abstract
Background/Objectives: Frequency-to-place mismatch between cochlear implant (CI) electrodes and cochlear tonotopy has been suggested to affect postoperative speech perception. This study aimed to examine the associations between frequency-to-place mismatch and speech perception outcomes across multiple linguistic levels in patients with CI and [...] Read more.
Background/Objectives: Frequency-to-place mismatch between cochlear implant (CI) electrodes and cochlear tonotopy has been suggested to affect postoperative speech perception. This study aimed to examine the associations between frequency-to-place mismatch and speech perception outcomes across multiple linguistic levels in patients with CI and to assess how these associations change over time using postoperative computed tomography. Methods: This retrospective cohort study included 44 postlingually deafened adults who underwent unilateral cochlear implantation with a Flex28 electrode by a single surgeon at a tertiary care hospital. Speech perception was assessed using CI-2004, a Japanese speech perception test consisting of monosyllables, consonants, words, and sentences, in quiet settings at 3, 6, and 12 months after CI activation. Partial correlation analyses between frequency-to-place mismatch and postoperative speech perception scores were performed in 35 of the 44 patients, controlling for age and mean preoperative pure-tone thresholds. Results: Negative associations were observed between frequency-to-place mismatch and CI-2004 scores, particularly for monosyllable and consonant perception in uncorrected analyses. After correction for multiple comparisons, only consonant perception at 3 months after CI activation remained significant (r = −0.52, p = 0.002). Similar patterns were observed for other speech measures and at later time points, although these did not remain significant after correction. Conclusions: Frequency-to-place mismatch was associated with postoperative speech perception outcomes, particularly those involving phoneme-level recognition. After correction for multiple comparisons, only consonant perception at 3 months after CI activation remained significant. Full article
Show Figures

Figure 1

23 pages, 1714 KB  
Article
Enhancing Korean-Accented English ASR with Transliteration-Based Data Synthesis
by Hana Jang, Taehwa Kim, Hyungwoo Choi and Youngbeom Jung
Electronics 2026, 15(7), 1380; https://doi.org/10.3390/electronics15071380 - 26 Mar 2026
Viewed by 450
Abstract
Despite recent advances in automatic speech recognition (ASR), performance remains limited for Korean-accented English due to the limited availability of accent-specific speech data, including pronunciation and prosodic variations. To address this limitation, we propose a synthetic data generation framework for improving Whisper-based ASR [...] Read more.
Despite recent advances in automatic speech recognition (ASR), performance remains limited for Korean-accented English due to the limited availability of accent-specific speech data, including pronunciation and prosodic variations. To address this limitation, we propose a synthetic data generation framework for improving Whisper-based ASR performance. Synthetic speech is generated by converting English text into Hangul-based phonetic transcriptions using an intermediate IPA representation to reflect the phonological characteristics of Korean-accented English. The ASR model is fine-tuned using Low-Rank Adaptation with a mixture of synthetic and authentic speech data. Experimental results demonstrate relative reductions of up to 16.40% in the character error rate, 14.93% in the word error rate, and 14.81% in the phoneme error rate compared to the pretrained baseline. Full article
Show Figures

Figure 1

17 pages, 391 KB  
Article
Phoneme Automaticity: A Test of the Phonemic Proficiency Hypothesis
by David D. Paige and William H. Rupley
Educ. Sci. 2026, 16(2), 286; https://doi.org/10.3390/educsci16020286 - 10 Feb 2026
Viewed by 552
Abstract
Readers use letter-sound knowledge and blending skills to consciously decode unfamiliar words, gradually building a large inventory of recognizable words. The storage of these words in long-term memory depends on forming connections between phonological sounds and their written forms, which results in a [...] Read more.
Readers use letter-sound knowledge and blending skills to consciously decode unfamiliar words, gradually building a large inventory of recognizable words. The storage of these words in long-term memory depends on forming connections between phonological sounds and their written forms, which results in a highly accurate and automatic recall of word pronunciations. For words to be read instantly, these phoneme-grapheme connections need to operate automatically, allowing for rapid, effortless recognition. In this study, we examine the connections between rapid access to phonemic sounds, spelling ability, and word reading among first- and second-grade students. Using a quantitative, correlational design, we examined first- and second-grade students’ performance on timed phoneme substitution, spelling, and sight-word reading and tested predictive relations using ordinary least squares regression. The results show that rapid skill in substituting phonemes directly predicts both automatic word reading and spelling. This finding highlights the importance of developing quick phonemic manipulation skills, which are essential for achieving fluent, precise word reading. Full article
(This article belongs to the Special Issue Advances in Evidence-Based Literacy Instructional Practices)
Show Figures

Figure 1

36 pages, 1309 KB  
Article
Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment
by Philipp Ollmann, Erik Sonnleitner, Marc Kurz, Jens Krösche and Stephan Selinger
Information 2026, 17(1), 40; https://doi.org/10.3390/info17010040 - 4 Jan 2026
Viewed by 871
Abstract
Reading proficiency in early childhood is crucial for academic success and intellectual development. However, more and more children are struggling with reading. According to the last PISA study in Austria, one out of five children is dealing with reading difficulties. The reasons for [...] Read more.
Reading proficiency in early childhood is crucial for academic success and intellectual development. However, more and more children are struggling with reading. According to the last PISA study in Austria, one out of five children is dealing with reading difficulties. The reasons for this are diverse, but an application that tracks children while reading aloud and guides them when they experience difficulties could offer meaningful help. Therefore, this proposal explores a prototyping approach for a core component that tracks children’s reading using a self-supervised Wav2Vec2 model with a limited amount of data. Self-supervised learning allows models to learn general representations from large amounts of unlabeled audio, which can then be fine-tuned on smaller, task-specific datasets, making it especially useful when labeled data is limited. Our model is operating on the phonetic level with the help of the International Phonetic Alphabet (IPA). To implement this, the KidsTALC dataset from the Leibniz University Hannover was used, which contains spontaneous speech recordings of German-speaking children. To enhance the training data and improve robustness, several data augmentation techniques were applied and evaluated, including pitch shifting, formant shifting, and speed variation. The models were trained using different data configurations to compare the effects of data variety and quality on recognition performance. The best model trained in this work achieved a phoneme error rate (PER) of 14.3% and a word error rate (WER) of 31.6% on unseen child speech data, demonstrating the potential of self-supervised models for such use cases. Full article
(This article belongs to the Special Issue AI Technology-Enhanced Learning and Teaching)
Show Figures

Figure 1

11 pages, 555 KB  
Article
Human–AI Feedback Loop for Pronunciation Training: A Mobile Application with Phoneme-Level Error Highlighting
by Aleksei Demin, Georgii Vorontsov and Dmitrii Chaikovskii
Multimodal Technol. Interact. 2026, 10(1), 2; https://doi.org/10.3390/mti10010002 - 26 Dec 2025
Viewed by 1471
Abstract
This paper presents an AI-augmented pronunciation training approach for Russian language learners through a mobile application that supports an interactive learner–system feedback loop. The system combines a pre-trained Wav2Vec2Phoneme neural network with Needleman–Wunsch global sequence alignment to convert reference and learner speech into [...] Read more.
This paper presents an AI-augmented pronunciation training approach for Russian language learners through a mobile application that supports an interactive learner–system feedback loop. The system combines a pre-trained Wav2Vec2Phoneme neural network with Needleman–Wunsch global sequence alignment to convert reference and learner speech into aligned phoneme sequences. Rather than producing an overall pronunciation score, the application provides localized, interpretable feedback by highlighting phoneme-level matches and mismatches in a red/green transcription, enabling learners to see where sounds were substituted, omitted, or added. Implemented as a WeChat Mini Program with a WebSocket-based backend, the design illustrates how speech-to-phoneme models and alignment procedures can be integrated into a lightweight mobile interface for autonomous pronunciation practice. We further provide a feature-level comparison with widely used commercial applications (Duolingo, HelloChinese, Babbel), emphasizing differences in feedback granularity and interpretability rather than unvalidated accuracy claims. Overall, the work demonstrates the feasibility of alignment-based phoneme-level feedback for mobile pronunciation training and motivates future evaluation of recognition reliability, latency, and learning outcomes on representative learner data. Full article
Show Figures

Figure 1

34 pages, 1422 KB  
Article
Orthographic Learning of Inconsistent Non-Words in Good and Poor Spellers: Linking Dictation and Eye-Tracking Measures
by Julie Robidoux, Antonin Rossier-Bisaillon, Boutheina Jemel and Brigitte Stanké
Behav. Sci. 2026, 16(1), 22; https://doi.org/10.3390/bs16010022 - 22 Dec 2025
Viewed by 542
Abstract
The French writing system contains numerous phoneme-to-grapheme inconsistencies that vary in their properties and distribution across words. These inconsistencies represent a major challenge for children learning to spell, especially for poor spellers or children with dyslexia-dysorthographia. To our knowledge, no study has examined [...] Read more.
The French writing system contains numerous phoneme-to-grapheme inconsistencies that vary in their properties and distribution across words. These inconsistencies represent a major challenge for children learning to spell, especially for poor spellers or children with dyslexia-dysorthographia. To our knowledge, no study has examined how inconsistencies shape orthographic learning using both eye-movement data and dictation performance, in children with good and poor spelling skills. In this eye-tracking study, twenty French-speaking children aged 9 to 12 (good spellers: n = 10; poor spellers: n = 10) learned the spelling of six bisyllabic non-words containing an inconsistent syllable across three learning cycles while we recorded their eye movements. One week later, children completed delayed dictation and recognition tasks assessing long-term consolidation and retrieval. Both groups improved their spelling accuracy and exhibited shorter and fewer fixations across learning cycles, reflecting progressive orthographic learning. However, poor spellers fixated more often and longer on the inconsistent syllable and demonstrated weaker long-term retention, suggesting a less holistic encoding and difficulties consolidating orthographic representations over time. Future research should examine whether these learning patterns generalize to real words, classroom contexts, and to children with dyslexia-dysorthographia across broader learning conditions. Full article
(This article belongs to the Special Issue Understanding Dyslexia and Developmental Language Disorders)
Show Figures

Figure 1

29 pages, 1708 KB  
Article
Speech Recognition and Synthesis Models and Platforms for the Kazakh Language
by Aidana Karibayeva, Vladislav Karyukin, Balzhan Abduali and Dina Amirova
Information 2025, 16(10), 879; https://doi.org/10.3390/info16100879 - 10 Oct 2025
Cited by 1 | Viewed by 5015
Abstract
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language [...] Read more.
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language with limited audio corpora, language models, and high-quality speech synthesis systems. This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. Special attention is given to linguistic and technical barriers, including the agglutinative structure, rich vowel system, and phonemic variability. Both open-source and commercial solutions were evaluated, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, and TurkicTTS. Speech recognition systems were assessed using BLEU, WER, TER, chrF, and COMET, while speech synthesis was evaluated with MCD, PESQ, STOI, and DNSMOS, thus covering both lexical–semantic and acoustic–perceptual characteristics. The results demonstrate that, for speech-to-text (STT), the strongest performance was achieved by Soyle on domain-specific data (BLEU 74.93, WER 18.61), while Voiser showed balanced accuracy (WER 40.65–37.11, chrF 80.88–84.51) and GPT-4 Transcribe achieved robust semantic preservation (COMET up to 1.02). In contrast, Whisper performed weakest (WER 77.10, BLEU 13.22), requiring further adaptation for Kazakh. For text-to-speech (TTS), KazakhTTS2 delivered the most natural perceptual quality (DNSMOS 8.79–8.96), while OpenAI TTS achieved the best spectral accuracy (MCD 123.44–117.11, PESQ 1.14). TurkicTTS offered reliable intelligibility (STOI 0.15, PESQ 1.16), and ElevenLabs produced natural but less spectrally accurate speech. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

31 pages, 3671 KB  
Article
Research on Wu Dialect Recognition and Regional Variations Based on Deep Learning
by Xinyi Yue, Lizhi Miao and Jiahao Ding
Appl. Sci. 2025, 15(18), 10227; https://doi.org/10.3390/app151810227 - 19 Sep 2025
Cited by 2 | Viewed by 2756
Abstract
Wu dialects carry deep regional culture, but due to significant internal variations, automated recognition faces considerable challenges. This study focuses on speech recognition and semantic feedback for Wu dialects, proposing a deep learning system with regional adaptability and semantic feedback capabilities. First, a [...] Read more.
Wu dialects carry deep regional culture, but due to significant internal variations, automated recognition faces considerable challenges. This study focuses on speech recognition and semantic feedback for Wu dialects, proposing a deep learning system with regional adaptability and semantic feedback capabilities. First, a corpus covering multiple Wu dialect regions (WXDPC) is constructed, and a two-level phoneme mapping and regional difference modeling mechanism is introduced. By incorporating geographical region labels and transfer learning, the model’s performance in non-central regions is improved. Experimental results show that as the training corpus increases, the model’s CER significantly decreases. After the introduction of regional labels, the CER in non-central Wu dialect regions decreased by 4.5%, demonstrating the model’s effectiveness in complex dialect environments. This system provides technical support for the preservation and application of Wu dialects and offers valuable experience for the promotion of other dialect recognition systems. Full article
Show Figures

Figure 1

25 pages, 4660 KB  
Article
Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition
by Sanghun Jeon, Jieun Lee and Yong-Ju Lee
AI 2025, 6(9), 222; https://doi.org/10.3390/ai6090222 - 9 Sep 2025
Cited by 1 | Viewed by 2908
Abstract
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional [...] Read more.
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional neural network (CNN)-based models, such as 3D-CNN + DenseNet-121 (CER: 5.31%), and transformer-based alternatives, such as vision transformers (CER: 4.05%). The Video Swin Transformer captures multiscale spatial representations with high computational efficiency, whereas the Conformer back-end enhances temporal modeling across diverse phoneme categories. Evaluation of a high-resolution dataset comprising 740,000 utterances across 185 classes highlighted the effectiveness of the model in addressing visually confusing phonemes, such as diphthongs (/ai/, /au/) and labio-dental sounds (/f/, /v/). Dual-Stream Former achieved phoneme recognition error rates of 10.39% for diphthongs and 9.25% for labiodental sounds, surpassing those of CNN-based architectures by more than 6%. Although the model’s large parameter count (168.6 M) poses resource challenges, its hierarchical design ensures scalability. Future work will explore lightweight adaptations and multimodal extensions to increase deployment feasibility. These findings underscore the transformative potential of Dual-Stream Former for advancing VSR applications such as silent communication and assistive technologies by achieving unparalleled precision and robustness in diverse settings. Full article
Show Figures

Figure 1

18 pages, 2065 KB  
Article
Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Symmetry 2025, 17(9), 1478; https://doi.org/10.3390/sym17091478 - 8 Sep 2025
Cited by 2 | Viewed by 1597
Abstract
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction [...] Read more.
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction with two pho-neme-aware augmentation strategies. (1) Dynamic Boundary-Aligned Phoneme Dropout progressively removes entire IPA segments according to a curriculum schedule, simulating real-world phenomena such as elision, lenition, and tonal drift while ensuring training stability. (2) Phoneme-Aware SpecAugment confines all time- and frequency-masking operations within phoneme boundaries and prioritizes high-attention regions, thereby preserving intra-phonemic contours and formant integrity. Built on the Whistle encoder—which integrates a Conformer backbone, Connectionist Temporal Classification–Conditional Random Field (CTC-CRF) alignment, and a multi-lingual phonetic space—the approach requires only a grapheme-to-phoneme lexicon and Montreal Forced Aligner outputs, without any additional manual labeling. Experiments on the Cantonese subset of Common Voice demonstrate consistent gains: Dynamic Dropout alone reduces phoneme error rate (PER) from 17.8% to 16.7% with 50 h of speech and 16.4% to 15.1% with 100 h, while the combination of the two augmentations further lowers PER to 15.9%/14.4%. These results confirm that structure-aware phoneme-level perturbations provide an effective and low-cost solution for building robust Cantonese ASR systems under low-resource conditions. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

37 pages, 618 KB  
Systematic Review
Interaction, Artificial Intelligence, and Motivation in Children’s Speech Learning and Rehabilitation Through Digital Games: A Systematic Literature Review
by Chra Abdoulqadir and Fernando Loizides
Information 2025, 16(7), 599; https://doi.org/10.3390/info16070599 - 12 Jul 2025
Cited by 4 | Viewed by 5159
Abstract
The integration of digital serious games into speech learning (rehabilitation) has demonstrated significant potential in enhancing accessibility and inclusivity for children with speech disabilities. This review of the state of the art examines the role of serious games, Artificial Intelligence (AI), and Natural [...] Read more.
The integration of digital serious games into speech learning (rehabilitation) has demonstrated significant potential in enhancing accessibility and inclusivity for children with speech disabilities. This review of the state of the art examines the role of serious games, Artificial Intelligence (AI), and Natural Language Processing (NLP) in speech rehabilitation, with a particular focus on interaction modalities, engagement autonomy, and motivation. We have reviewed 45 selected studies. Our key findings show how intelligent tutoring systems, adaptive voice-based interfaces, and gamified speech interventions can empower children to engage in self-directed speech learning, reducing dependence on therapists and caregivers. The diversity of interaction modalities, including speech recognition, phoneme-based exercises, and multimodal feedback, demonstrates how AI and Assistive Technology (AT) can personalise learning experiences to accommodate diverse needs. Furthermore, the incorporation of gamification strategies, such as reward systems and adaptive difficulty levels, has been shown to enhance children’s motivation and long-term participation in speech rehabilitation. The gaps identified show that despite advancements, challenges remain in achieving universal accessibility, particularly regarding speech recognition accuracy, multilingual support, and accessibility for users with multiple disabilities. This review advocates for interdisciplinary collaboration across educational technology, special education, cognitive science, and human–computer interaction (HCI). Our work contributes to the ongoing discourse on lifelong inclusive education, reinforcing the potential of AI-driven serious games as transformative tools for bridging learning gaps and promoting speech rehabilitation beyond clinical environments. Full article
Show Figures

Graphical abstract

15 pages, 1359 KB  
Article
Phoneme-Aware Hierarchical Augmentation and Semantic-Aware SpecAugment for Low-Resource Cantonese Speech Recognition
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Sensors 2025, 25(14), 4288; https://doi.org/10.3390/s25144288 - 9 Jul 2025
Cited by 2 | Viewed by 1949
Abstract
Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments [...] Read more.
Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments and Tacotron-2 synthesis, injects adversarial phoneme variants into both transcripts and their aligned audio segments, enlarging pronunciation diversity. Concurrently, a semantic-aware SpecAugment scheme exploits wav2vec 2.0 attention heat maps and keyword boundaries to adaptively mask informative time–frequency regions; a reinforcement-learning controller tunes the masking schedule online, forcing the model to rely on a wider context. On the Common Voice Cantonese 50 h subset, the combined strategy reduces the character error rate (CER) from 26.17% to 16.88% with wav2vec 2.0 and from 38.83% to 23.55% with Zipformer. At 100 h, the CER further drops to 4.27% and 2.32%, yielding relative gains of 32–44%. Ablation studies confirm that phoneme-level and masking components provide complementary benefits. The framework offers a practical, model-independent path toward accurate ASR for Cantonese and other low-resource tonal languages. This paper presents an intelligent sensing-oriented modeling framework for speech signals, which is suitable for deployment on edge or embedded systems to process input from audio sensors (e.g., microphones) and shows promising potential for voice-interactive terminal applications. Full article
Show Figures

Figure 1

15 pages, 1545 KB  
Article
Speech Recognition in Noise: Analyzing Phoneme, Syllable, and Word-Based Scoring Methods and Their Interaction with Hearing Loss
by Saransh Jain, Vijaya Kumar Narne, Bharani, Hema Valayutham, Thejaswini Madan, Sunil Kumar Ravi and Chandni Jain
Diagnostics 2025, 15(13), 1619; https://doi.org/10.3390/diagnostics15131619 - 26 Jun 2025
Viewed by 2503
Abstract
Introduction: This study aimed to compare different scoring methods, such as phoneme, syllable, and word-based scoring, during word recognition in noise testing and their interaction with hearing loss severity. These scoring methods provided a structured framework for refining clinical audiological diagnosis by revealing [...] Read more.
Introduction: This study aimed to compare different scoring methods, such as phoneme, syllable, and word-based scoring, during word recognition in noise testing and their interaction with hearing loss severity. These scoring methods provided a structured framework for refining clinical audiological diagnosis by revealing underlying auditory processing at multiple linguistic levels. We highlight how scoring differences inform differential diagnosis and guide targeted audiological interventions. Methods: Pure tone audiometry and word-in-noise testing were conducted on 100 subjects with a wide range of hearing loss severity. Speech recognition was scored using phoneme, syllable, and word-based methods. All procedures were designed to reflect standard diagnostic protocols in clinical audiology. Discriminant function analysis examined how these scoring methods differentiate the degree of hearing loss. Results: Results showed that each method provides unique information about auditory processing. Phoneme-based scoring has pointed out basic auditory discrimination; syllable-based scoring can capture temporal and phonological processing, while word-based scoring reflects real-world listening conditions by incorporating contextual knowledge. These findings emphasize the diagnostic value of each scoring approach in clinical settings, aiding differential diagnosis and treatment planning. Conclusions: This study showed the effect of different scoring methods on hearing loss differentiation concerning severity. We recommend the integration of phoneme-based scoring into standard diagnostic batteries to enhance early detection and personalize rehabilitation strategies. Future research must involve studies about integration with other speech perception tests and applicability across different clinical settings. Full article
Show Figures

Figure 1

17 pages, 3439 KB  
Article
A Novel Approach for Visual Speech Recognition Using the Partition-Time Masking and Swin Transformer 3D Convolutional Model
by Xiangliang Zhang, Yu Hu, Xiangzhi Liu, Yu Gu, Tong Li, Jibin Yin and Tao Liu
Sensors 2025, 25(8), 2366; https://doi.org/10.3390/s25082366 - 8 Apr 2025
Cited by 9 | Viewed by 3060
Abstract
Visual speech recognition is a technology that relies on visual information, offering unique advantages in noisy environments or when communicating with individuals with speech impairments. However, this technology still faces challenges, such as limited generalization ability due to different speech habits, high recognition [...] Read more.
Visual speech recognition is a technology that relies on visual information, offering unique advantages in noisy environments or when communicating with individuals with speech impairments. However, this technology still faces challenges, such as limited generalization ability due to different speech habits, high recognition error rates caused by confusable phonemes, and difficulties adapting to complex lighting conditions and facial occlusions. This paper proposes a lip reading data augmentation method—Partition-Time Masking (PTM)—to address these challenges and improve lip reading models’ performance and generalization ability. Applying nonlinear transformations to the training data enhances the model’s generalization ability when handling diverse speakers and environmental conditions. A lip-reading recognition model architecture, Swin Transformer and 3D Convolution (ST3D), was designed to overcome the limitations of traditional lip-reading models that use ResNet-based front-end feature extraction networks. By adopting a strategy that combines Swin Transformer and 3D convolution, the proposed model enhances performance. To validate the effectiveness of the Partition-Time Masking data augmentation method, experiments were conducted on the LRW video dataset using the DC-TCN model, achieving a peak accuracy of 92.15%. The ST3D model was validated on the LRW and LRW1000 video datasets, achieving a maximum accuracy of 56.1% on the LRW1000 dataset and 91.8% on the LRW dataset, outperforming current mainstream lip reading models and demonstrating superior performance on challenging easily confused samples. Full article
(This article belongs to the Special Issue Sensors for Biomechanical and Rehabilitation Engineering)
Show Figures

Figure 1

20 pages, 1420 KB  
Article
A Survey of Grapheme-to-Phoneme Conversion Methods
by Shiyang Cheng, Pengcheng Zhu, Jueting Liu and Zehua Wang
Appl. Sci. 2024, 14(24), 11790; https://doi.org/10.3390/app142411790 - 17 Dec 2024
Cited by 6 | Viewed by 9912
Abstract
Grapheme-to-phoneme conversion (G2P) is the task of converting letters (grapheme sequences) into their pronunciations (phoneme sequences). It plays a crucial role in natural language processing, text-to-speech synthesis, and automatic speech recognition systems. This paper provides a systematical overview of the G2P conversion from [...] Read more.
Grapheme-to-phoneme conversion (G2P) is the task of converting letters (grapheme sequences) into their pronunciations (phoneme sequences). It plays a crucial role in natural language processing, text-to-speech synthesis, and automatic speech recognition systems. This paper provides a systematical overview of the G2P conversion from different perspectives. The conversion methods are first presented in the paper; detailed discussions are conducted on methods based on deep learning technology. For each method, the key ideas, advantages, disadvantages, and representative models are summarized. This paper then mentioned the learning strategies and multilingual G2P conversions. Finally, this paper summarized the commonly used monolingual and multilingual datasets, including Mandarin, Japanese, Arabic, etc. Two tables illustrated the performance of various methods with relative datasets. After making a general overall of G2P conversion, this paper concluded with the current issues and the future directions of deep learning-based G2P conversion. Full article
(This article belongs to the Collection Trends and Prospects in Multimedia)
Show Figures

Figure 1

Back to TopTop