Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (36)

Search Parameters:
Keywords = multilingual speech recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
10 pages, 873 KB  
Proceeding Paper
Utilizing Residual Network 50 Convolutional Neural Network Architecture for Enhanced Philippine Regional Language Classification on Jetson Orin Nano
by John Paul T. Cruz, Aaron B. Abadiano, FP O. Sangilan, Emmy Grace T. Requillo and Roben A. Juanatas
Eng. Proc. 2026, 134(1), 2; https://doi.org/10.3390/engproc2026134002 - 26 Mar 2026
Viewed by 338
Abstract
Visual speech recognition systems encounter significant challenges in multilingual nations such as the Philippines, where numerous regional languages, including Cebuano and Ilocano, feature distinct phonetic-visual characteristics. Deep learning models such as the Lip Reading Network and the Lightweight Crowd Segmentation Network have demonstrated [...] Read more.
Visual speech recognition systems encounter significant challenges in multilingual nations such as the Philippines, where numerous regional languages, including Cebuano and Ilocano, feature distinct phonetic-visual characteristics. Deep learning models such as the Lip Reading Network and the Lightweight Crowd Segmentation Network have demonstrated strong performance with 3D Convolutional Neural Networks (CNNs). However, their substantial computational requirements restrict deployment on portable edge devices. We introduce a more efficient alternative that integrates a 2D Residual Network 50 architecture with a Long Short-Term Memory network and Connectionist Temporal Classification for lip-reading classification of Philippine regional languages. The proposed model is deployed on the Jetson Orin Nano, a high-performance edge device optimized for real-time inference through Compute Unified Device Architecture acceleration. Using a dataset of 2000 annotated videos encompassing 10 lexicons each for Cebuano and Ilocano, the model’s effectiveness was evaluated. Results achieved a regional language classification accuracy of 90%, with lexicon-level accuracies of 74% for Cebuano and 66% for Ilocano. This work represents a step toward developing accessible and scalable communication aids for deaf communities in linguistically diverse environments, leveraging transfer learning on pretrained models. Full article
Show Figures

Figure 1

39 pages, 1016 KB  
Article
The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages
by Aidana Karibayeva, Vladislav Karyukin, Ualsher Tukeyev, Balzhan Abduali, Dina Amirova, Diana Rakhimova, Rashid Aliyev and Assem Shormakova
Appl. Sci. 2025, 15(24), 12880; https://doi.org/10.3390/app152412880 - 5 Dec 2025
Viewed by 2317
Abstract
The development of parallel audio corpora for Turkic languages, such as Kazakh, Uzbek, and Tatar, remains a significant challenge in the development of multilingual speech synthesis, recognition systems, and machine translation. These languages are low-resource in speech technologies, lacking sufficiently large audio datasets [...] Read more.
The development of parallel audio corpora for Turkic languages, such as Kazakh, Uzbek, and Tatar, remains a significant challenge in the development of multilingual speech synthesis, recognition systems, and machine translation. These languages are low-resource in speech technologies, lacking sufficiently large audio datasets with aligned transcriptions that are crucial for modern recognition, synthesis, and understanding systems. This article presents the development and experimental evaluation of a speech corpus focused on Turkic languages, intended for use in speech synthesis and automatic translation tasks. The primary objective is to create parallel audio corpora using a cascade generation method, which combines artificial intelligence and text-to-speech (TTS) technologies to generate both audio and text, and to evaluate the quality and suitability of the generated data. To evaluate the quality of synthesized speech, metrics measuring naturalness, intonation, expressiveness, and linguistic adequacy were applied. As a result, a multimodal (Kazakh–Turkish, Kazakh–Tatar, Kazakh–Uzbek) corpus was created, combining high-quality natural Kazakh audio with transcription and translation, along with synthetic audio in Turkish, Tatar, and Uzbek. These corpora offer a unique resource for speech and text processing research, enabling the integration of ASR, MT, TTS, and speech-to-speech translation (STS). Full article
Show Figures

Figure 1

41 pages, 2890 KB  
Article
STREAM: A Semantic Transformation and Real-Time Educational Adaptation Multimodal Framework in Personalized Virtual Classrooms
by Leyli Nouraei Yeganeh, Yu Chen, Nicole Scarlett Fenty, Amber Simpson and Mohsen Hatami
Future Internet 2025, 17(12), 564; https://doi.org/10.3390/fi17120564 - 5 Dec 2025
Viewed by 1614
Abstract
Most adaptive learning systems personalize around content sequencing and difficulty adjustment rather than transforming instructional material within the lesson itself. This paper presents the STREAM (Semantic Transformation and Real-Time Educational Adaptation Multimodal) framework. This modular pipeline decomposes multimodal educational content into semantically tagged, [...] Read more.
Most adaptive learning systems personalize around content sequencing and difficulty adjustment rather than transforming instructional material within the lesson itself. This paper presents the STREAM (Semantic Transformation and Real-Time Educational Adaptation Multimodal) framework. This modular pipeline decomposes multimodal educational content into semantically tagged, pedagogically annotated units for regeneration into alternative formats while preserving source traceability. STREAM is designed to integrate automatic speech recognition, transformer-based natural language processing, and planned computer vision components to extract instructional elements from teacher explanations, slides, and embedded media. Each unit receives metadata, including time codes, instructional type, cognitive demand, and prerequisite concepts, designed to enable format-specific regeneration with explicit provenance links. For a predefined visual-learner profile, the system generates annotated path diagrams, two-panel instructional guides, and entity pictograms with complete back-link coverage. Ablation studies confirm that individual components contribute measurably to output completeness without compromising traceability. This paper reports results from a tightly scoped feasibility pilot that processes a single five-minute elementary STEM video offline under clean audio–visual conditions. We position the pilot’s limitations as testable hypotheses that require validation across diverse content domains, authentic deployments with ambient noise and bandwidth constraints, multiple learner profiles, including multilingual students and learners with disabilities, and controlled comprehension studies. The contribution is a transparent technical demonstration of feasibility and a methodological scaffold for investigating whether within-lesson content transformation can support personalized learning at scale. Full article
Show Figures

Graphical abstract

21 pages, 3633 KB  
Article
One System, Two Rules: Asymmetrical Coupling of Speech Production and Reading Comprehension in the Trilingual Brain
by Yuanbo Wang, Yingfang Meng, Qiuyue Yang and Ruiming Wang
Brain Sci. 2025, 15(12), 1288; https://doi.org/10.3390/brainsci15121288 - 29 Nov 2025
Viewed by 672
Abstract
Background/Objectives: The functional architecture connecting speech production and reading comprehension remains unclear in multilinguals. This study investigated the cross-modal interaction between these systems in trilinguals to resolve the debate between Age of Acquisition (AoA) and usage frequency. Methods: We recruited 144 Uyghur (L1)–Chinese [...] Read more.
Background/Objectives: The functional architecture connecting speech production and reading comprehension remains unclear in multilinguals. This study investigated the cross-modal interaction between these systems in trilinguals to resolve the debate between Age of Acquisition (AoA) and usage frequency. Methods: We recruited 144 Uyghur (L1)–Chinese (L2)–English (L3) trilinguals, a population uniquely dissociating acquisition order from social dominance. Participants completed a production-to-comprehension priming paradigm, naming pictures in one language before performing a lexical decision task on translated words. Data were analyzed using linear mixed-effects models. Results: Significant cross-language priming confirmed an integrated lexicon, yet a fundamental asymmetry emerged. The top-down influence of production was governed by AoA; earlier-acquired languages (specifically L1) generated more effective priming signals than L2. Conversely, the bottom-up efficiency of recognition was driven by social usage frequency; the socially dominant L2 was the most receptive target, surpassing the heritage L1. Conclusions: The trilingual lexicon operates via “Two Rules”: a history-driven production system (AoA) and an environment-driven recognition system (Social Usage). This asymmetrical baseline challenges simple bilingual extensions and clarifies the dynamics of multilingual language control. Full article
(This article belongs to the Topic Language: From Hearing to Speech and Writing)
Show Figures

Figure 1

16 pages, 3476 KB  
Article
ROboMC: A Portable Multimodal System for eHealth Training and Scalable AI-Assisted Education
by Marius Cioca and Adriana-Lavinia Cioca
Inventions 2025, 10(6), 103; https://doi.org/10.3390/inventions10060103 - 11 Nov 2025
Cited by 1 | Viewed by 1310
Abstract
AI-based educational chatbots can expand access to learning, but many remain limited to text-only interfaces and fixed infrastructures, while purely generative responses raise concerns of reliability and consistency. In this context, we present ROboMC, a portable and multimodal system that combines a validated [...] Read more.
AI-based educational chatbots can expand access to learning, but many remain limited to text-only interfaces and fixed infrastructures, while purely generative responses raise concerns of reliability and consistency. In this context, we present ROboMC, a portable and multimodal system that combines a validated knowledge base with generative responses (OpenAI) and voice–text interaction, designed to enable both text and voice interaction, ensuring reliability and flexibility in diverse educational scenarios. The system, developed in Django, integrates two response pipelines: local search using normalized keywords and fuzzy matching in the LocalQuestion database, and fallback to the generative model GPT-3.5-Turbo (OpenAI, San Francisco, CA, USA) with a prompt adapted exclusively for Romanian and an explicit disclaimer. All interactions are logged in AutomaticQuestion for later analysis, supported by a semantic encoder (SentenceTransformer—paraphrase-multilingual-MiniLM-L12-v2’, Hugging Face Inc., New York, NY, USA) that ensures search tolerance to variations in phrasing. Voice interaction is managed through gTTS (Google LLC, Mountain View, CA, USA) with integrated audio playback, while portability is achieved through deployment on a Raspberry Pi 4B (Raspberry Pi Foundation, Cambridge, UK) with microphone, speaker, and battery power. Voice input is enabled through a cloud-based speech-to-text component (Google Web Speech API accessed via the Python SpeechRecognition library, (Anthony Zhang, open-source project, USA) using the Google Web Speech API (Google LLC, Mountain View, CA, USA; language = “ro-RO”)), allowing users to interact by speaking. Preliminary tests showed average latencies of 120–180 ms for validated responses on laptop and 250–350 ms on Raspberry Pi, respectively, 2.5–3.5 s on laptop and 4–6 s on Raspberry Pi for generative responses, timings considered acceptable for real educational scenarios. A small-scale usability study (N ≈ 35) indicated good acceptability (SUS ~80/100), with participants valuing the balance between validated and generative responses, the voice integration, and the hardware portability. Although system validation was carried out in the eHealth context, its architecture allows extension to any educational field: depending on the content introduced into the validated database, ROboMC can be adapted to medicine, engineering, social sciences, or other disciplines, relying on ChatGPT only when no clear match is found in the local base, making it a scalable and interdisciplinary solution. Full article
Show Figures

Figure 1

17 pages, 2127 KB  
Article
Leveraging Large Language Models for Real-Time UAV Control
by Kheireddine Choutri, Samiha Fadloun, Ayoub Khettabi, Mohand Lagha, Souham Meshoul and Raouf Fareh
Electronics 2025, 14(21), 4312; https://doi.org/10.3390/electronics14214312 - 2 Nov 2025
Cited by 3 | Viewed by 3083
Abstract
As drones become increasingly integrated into civilian and industrial domains, the demand for natural and accessible control interfaces continues to grow. Conventional manual controllers require technical expertise and impose cognitive overhead, limiting their usability in dynamic and time-critical scenarios. To address these limitations, [...] Read more.
As drones become increasingly integrated into civilian and industrial domains, the demand for natural and accessible control interfaces continues to grow. Conventional manual controllers require technical expertise and impose cognitive overhead, limiting their usability in dynamic and time-critical scenarios. To address these limitations, this paper presents a multilingual voice-driven control framework for quadrotor drones, enabling real-time operation in both English and Arabic. The proposed architecture combines offline Speech-to-Text (STT) processing with large language models (LLMs) to interpret spoken commands and translate them into executable control code. Specifically, Vosk is employed for bilingual STT, while Google Gemini provides semantic disambiguation, contextual inference, and code generation. The system is designed for continuous, low-latency operation within an edge–cloud hybrid configuration, offering an intuitive and robust human–drone interface. While speech recognition and safety validation are processed entirely offline, high-level reasoning and code generation currently rely on cloud-based LLM inference. Experimental evaluation demonstrates an average speech recognition accuracy of 95% and end-to-end command execution latency between 300 and 500 ms, validating the feasibility of reliable, multilingual, voice-based UAV control. This research advances multimodal human–robot interaction by showcasing the integration of offline speech recognition and LLMs for adaptive, safe, and scalable aerial autonomy. Full article
Show Figures

Figure 1

17 pages, 2618 KB  
Article
Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW
by Hadia Arshad, Tahir Abdullah, Mariam Rehman, Afzaal Hussain, Faria Kanwal and Mehwish Parveen
Information 2025, 16(11), 928; https://doi.org/10.3390/info16110928 - 22 Oct 2025
Viewed by 1669
Abstract
Whisper is a transformer-based multilingual model that has illustrated state-of-the-art behavior in numerous languages. However, the efficiency remains persistent with the limited computational resources. To address this issue, an experiment was performed on librispeech-train-clean-100 for training purposes. The test-clean set was utilized to [...] Read more.
Whisper is a transformer-based multilingual model that has illustrated state-of-the-art behavior in numerous languages. However, the efficiency remains persistent with the limited computational resources. To address this issue, an experiment was performed on librispeech-train-clean-100 for training purposes. The test-clean set was utilized to evaluate its performance. To enhance efficiency and to cater the computational needs, a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation, was employed to add a limited number of trainable parameters into the frozen layers of the model. The results showed that Low-Rank Adaptation attained excellent Automatic Speech Recognition results while using fewer computational resources, showing its effectiveness for resource-saving adaptation. The research work emphasizes the promise of Low-Rank Adaptation as a lightweight and scalable fine-tuning strategy for large speech models using a transformer architecture. The baseline Whisper Small model achieved a word error rate of 16.7% without any parameter-efficient adaptation. In contrast, the Low-Rank Adaptation enhanced fine-tuned model achieved a lower word error rate of 6.08%, demonstrating the adaptability of the proposed parameter-efficient approach. Full article
Show Figures

Figure 1

18 pages, 2065 KB  
Article
Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Symmetry 2025, 17(9), 1478; https://doi.org/10.3390/sym17091478 - 8 Sep 2025
Cited by 2 | Viewed by 1597
Abstract
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction [...] Read more.
Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction with two pho-neme-aware augmentation strategies. (1) Dynamic Boundary-Aligned Phoneme Dropout progressively removes entire IPA segments according to a curriculum schedule, simulating real-world phenomena such as elision, lenition, and tonal drift while ensuring training stability. (2) Phoneme-Aware SpecAugment confines all time- and frequency-masking operations within phoneme boundaries and prioritizes high-attention regions, thereby preserving intra-phonemic contours and formant integrity. Built on the Whistle encoder—which integrates a Conformer backbone, Connectionist Temporal Classification–Conditional Random Field (CTC-CRF) alignment, and a multi-lingual phonetic space—the approach requires only a grapheme-to-phoneme lexicon and Montreal Forced Aligner outputs, without any additional manual labeling. Experiments on the Cantonese subset of Common Voice demonstrate consistent gains: Dynamic Dropout alone reduces phoneme error rate (PER) from 17.8% to 16.7% with 50 h of speech and 16.4% to 15.1% with 100 h, while the combination of the two augmentations further lowers PER to 15.9%/14.4%. These results confirm that structure-aware phoneme-level perturbations provide an effective and low-cost solution for building robust Cantonese ASR systems under low-resource conditions. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

23 pages, 1233 KB  
Article
Decoding the Digits: How Number Notation Influences Cognitive Effort and Performance in Chinese-to-English Sight Translation
by Xueyan Zong, Lei Song and Shanshan Yang
Behav. Sci. 2025, 15(9), 1195; https://doi.org/10.3390/bs15091195 - 1 Sep 2025
Cited by 1 | Viewed by 1315
Abstract
Numbers present persistent challenges in interpreting, yet cognitive mechanisms underlying notation-specific processing remain underexplored. While eye-tracking studies in visually-assisted simultaneous interpreting have advanced number research, they predominantly examine Arabic numerals in non-Chinese contexts—neglecting notation diversity increasingly prevalent in computer-assisted interpreting systems where Automatic [...] Read more.
Numbers present persistent challenges in interpreting, yet cognitive mechanisms underlying notation-specific processing remain underexplored. While eye-tracking studies in visually-assisted simultaneous interpreting have advanced number research, they predominantly examine Arabic numerals in non-Chinese contexts—neglecting notation diversity increasingly prevalent in computer-assisted interpreting systems where Automatic Speech Recognition outputs vary across languages. Addressing these gaps, this study investigated how number notation (Arabic digits vs. Chinese character numbers) affects trainee interpreters’ cognitive effort and performance in Chinese-to-English sight translation. Employing a mixed-methods design, we measured global (task-level) and local (number-specific) eye movements alongside expert assessments, output analysis, and subjective assessments. Results show that Chinese character numbers demand significantly greater cognitive effort than Arabic digits, evidenced by more and longer fixations, more extensive saccadic movements, and a larger eye-voice span. Concurrently, sight translation quality decreased markedly with Chinese character numbers, with more processing attempts yet lower accuracy and fluency. Subjective workload ratings confirmed higher mental, physical, and temporal demands in Task 2. These findings reveal an effort-quality paradox where greater cognitive investment in processing complex notations leads to poorer outcomes, and highlight the urgent need for notation-specific training strategies and adaptive technologies in multilingual communication. Full article
(This article belongs to the Section Cognition)
Show Figures

Figure 1

37 pages, 618 KB  
Systematic Review
Interaction, Artificial Intelligence, and Motivation in Children’s Speech Learning and Rehabilitation Through Digital Games: A Systematic Literature Review
by Chra Abdoulqadir and Fernando Loizides
Information 2025, 16(7), 599; https://doi.org/10.3390/info16070599 - 12 Jul 2025
Cited by 4 | Viewed by 5159
Abstract
The integration of digital serious games into speech learning (rehabilitation) has demonstrated significant potential in enhancing accessibility and inclusivity for children with speech disabilities. This review of the state of the art examines the role of serious games, Artificial Intelligence (AI), and Natural [...] Read more.
The integration of digital serious games into speech learning (rehabilitation) has demonstrated significant potential in enhancing accessibility and inclusivity for children with speech disabilities. This review of the state of the art examines the role of serious games, Artificial Intelligence (AI), and Natural Language Processing (NLP) in speech rehabilitation, with a particular focus on interaction modalities, engagement autonomy, and motivation. We have reviewed 45 selected studies. Our key findings show how intelligent tutoring systems, adaptive voice-based interfaces, and gamified speech interventions can empower children to engage in self-directed speech learning, reducing dependence on therapists and caregivers. The diversity of interaction modalities, including speech recognition, phoneme-based exercises, and multimodal feedback, demonstrates how AI and Assistive Technology (AT) can personalise learning experiences to accommodate diverse needs. Furthermore, the incorporation of gamification strategies, such as reward systems and adaptive difficulty levels, has been shown to enhance children’s motivation and long-term participation in speech rehabilitation. The gaps identified show that despite advancements, challenges remain in achieving universal accessibility, particularly regarding speech recognition accuracy, multilingual support, and accessibility for users with multiple disabilities. This review advocates for interdisciplinary collaboration across educational technology, special education, cognitive science, and human–computer interaction (HCI). Our work contributes to the ongoing discourse on lifelong inclusive education, reinforcing the potential of AI-driven serious games as transformative tools for bridging learning gaps and promoting speech rehabilitation beyond clinical environments. Full article
Show Figures

Graphical abstract

19 pages, 2212 KB  
Article
A Self-Evaluated Bilingual Automatic Speech Recognition System for Mandarin–English Mixed Conversations
by Xinhe Hai, Kaviya Aranganadin, Cheng-Cheng Yeh, Zhengmao Hua, Chen-Yun Huang, Hua-Yi Hsu and Ming-Chieh Lin
Appl. Sci. 2025, 15(14), 7691; https://doi.org/10.3390/app15147691 - 9 Jul 2025
Cited by 1 | Viewed by 4033
Abstract
Bilingual communication is increasingly prevalent in this globally connected world, where cultural exchanges and international interactions are unavoidable. Existing automatic speech recognition (ASR) systems are often limited to single languages. However, the growing demand for bilingual ASR in human–computer interactions, particularly in medical [...] Read more.
Bilingual communication is increasingly prevalent in this globally connected world, where cultural exchanges and international interactions are unavoidable. Existing automatic speech recognition (ASR) systems are often limited to single languages. However, the growing demand for bilingual ASR in human–computer interactions, particularly in medical services, has become indispensable. This article addresses this need by creating an application programming interface (API)-based platform using VOSK, a popular open-source single-language ASR toolkit, to efficiently deploy a self-evaluated bilingual ASR system that seamlessly handles both primary and secondary languages in tasks like Mandarin–English mixed-speech recognition. The mixed error rate (MER) is used as a performance metric, and a workflow is outlined for its calculation using the edit distance algorithm. Results show a remarkable reduction in the Mandarin–English MER, dropping from ∼65% to under 13%, after implementing the self-evaluation framework and mixed-language algorithms. These findings highlight the importance of a well-designed system to manage the complexities of mixed-language speech recognition, offering a promising method for building a bilingual ASR system using existing monolingual models. The framework might be further extended to a trilingual or multilingual ASR system by preparing mixed-language datasets and computer development without involving complex training. Full article
Show Figures

Figure 1

28 pages, 9455 KB  
Article
Advancing Emotionally Aware Child–Robot Interaction with Biophysical Data and Insight-Driven Affective Computing
by Diego Resende Faria, Amie Louise Godkin and Pedro Paulo da Silva Ayrosa
Sensors 2025, 25(4), 1161; https://doi.org/10.3390/s25041161 - 14 Feb 2025
Cited by 9 | Viewed by 5134
Abstract
This paper investigates the integration of affective computing techniques using biophysical data to advance emotionally aware machines and enhance child–robot interaction (CRI). By leveraging interdisciplinary insights from neuroscience, psychology, and artificial intelligence, the study focuses on creating adaptive, emotion-aware systems capable of dynamically [...] Read more.
This paper investigates the integration of affective computing techniques using biophysical data to advance emotionally aware machines and enhance child–robot interaction (CRI). By leveraging interdisciplinary insights from neuroscience, psychology, and artificial intelligence, the study focuses on creating adaptive, emotion-aware systems capable of dynamically recognizing and responding to human emotional states. Through a real-world CRI pilot study involving the NAO robot, this research demonstrates how facial expression analysis and speech emotion recognition can be employed to detect and address negative emotions in real time, fostering positive emotional engagement. The emotion recognition system combines handcrafted and deep learning features for facial expressions, achieving an 85% classification accuracy during real-time CRI, while speech emotions are analyzed using acoustic features processed through machine learning models with an 83% accuracy rate. Offline evaluation of the combined emotion dataset using a Dynamic Bayesian Mixture Model (DBMM) achieved a 92% accuracy for facial expressions, and the multilingual speech dataset yielded 98% accuracy for speech emotions using the DBMM ensemble. Observations from psychological and technological aspects, coupled with statistical analysis, reveal the robot’s ability to transition negative emotions into neutral or positive states in most cases, contributing to emotional regulation in children. This work underscores the potential of emotion-aware robots to support therapeutic and educational interventions, particularly for pediatric populations, while setting a foundation for developing personalized and empathetic human–machine interactions. These findings demonstrate the transformative role of affective computing in bridging the gap between technological functionality and emotional intelligence across diverse domains. Full article
(This article belongs to the Special Issue Multisensory AI for Human-Robot Interaction)
Show Figures

Figure 1

20 pages, 1420 KB  
Article
A Survey of Grapheme-to-Phoneme Conversion Methods
by Shiyang Cheng, Pengcheng Zhu, Jueting Liu and Zehua Wang
Appl. Sci. 2024, 14(24), 11790; https://doi.org/10.3390/app142411790 - 17 Dec 2024
Cited by 6 | Viewed by 9912
Abstract
Grapheme-to-phoneme conversion (G2P) is the task of converting letters (grapheme sequences) into their pronunciations (phoneme sequences). It plays a crucial role in natural language processing, text-to-speech synthesis, and automatic speech recognition systems. This paper provides a systematical overview of the G2P conversion from [...] Read more.
Grapheme-to-phoneme conversion (G2P) is the task of converting letters (grapheme sequences) into their pronunciations (phoneme sequences). It plays a crucial role in natural language processing, text-to-speech synthesis, and automatic speech recognition systems. This paper provides a systematical overview of the G2P conversion from different perspectives. The conversion methods are first presented in the paper; detailed discussions are conducted on methods based on deep learning technology. For each method, the key ideas, advantages, disadvantages, and representative models are summarized. This paper then mentioned the learning strategies and multilingual G2P conversions. Finally, this paper summarized the commonly used monolingual and multilingual datasets, including Mandarin, Japanese, Arabic, etc. Two tables illustrated the performance of various methods with relative datasets. After making a general overall of G2P conversion, this paper concluded with the current issues and the future directions of deep learning-based G2P conversion. Full article
(This article belongs to the Collection Trends and Prospects in Multimedia)
Show Figures

Figure 1

12 pages, 2630 KB  
Article
Multimodal Seed Data Augmentation for Low-Resource Audio Latin Cuengh Language
by Lanlan Jiang, Xingguo Qin, Jingwei Zhang and Jun Li
Appl. Sci. 2024, 14(20), 9533; https://doi.org/10.3390/app14209533 - 18 Oct 2024
Cited by 2 | Viewed by 1829
Abstract
Latin Cuengh is a low-resource dialect that is prevalent in select ethnic minority regions in China. This language presents unique challenges for intelligent research and preservation efforts, primarily due to its oral tradition and the limited availability of textual resources. Prior research has [...] Read more.
Latin Cuengh is a low-resource dialect that is prevalent in select ethnic minority regions in China. This language presents unique challenges for intelligent research and preservation efforts, primarily due to its oral tradition and the limited availability of textual resources. Prior research has sought to bolster intelligent processing capabilities with regard to Latin Cuengh through data augmentation techniques leveraging scarce textual data, with modest success. In this study, we introduce an innovative multimodal seed data augmentation model designed to significantly enhance the intelligent recognition and comprehension of this dialect. After supplementing the pre-trained model with extensive speech data, we fine-tune its performance with a modest corpus of multilingual textual seed data, employing both Latin Cuengh and Chinese texts as bilingual seed data to enrich its multilingual properties. We then refine its parameters through a variety of downstream tasks. The proposed model achieves a commendable performance across both multi-classification and binary classification tasks, with its average accuracy and F1 measure increasing by more than 3%. Moreover, the model’s training efficiency is substantially ameliorated through strategic seed data augmentation. Our research provides insights into the informatization of low-resource languages and contributes to their dissemination and preservation. Full article
Show Figures

Figure 1

10 pages, 585 KB  
Technical Note
Text-Independent Phone-to-Audio Alignment Leveraging SSL (TIPAA-SSL) Pre-Trained Model Latent Representation and Knowledge Transfer
by Noé Tits, Prernna Bhatnagar and Thierry Dutoit
Acoustics 2024, 6(3), 772-781; https://doi.org/10.3390/acoustics6030042 - 29 Aug 2024
Cited by 1 | Viewed by 2918
Abstract
In this paper, we present a novel approach for text-independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (Wav2Vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model [...] Read more.
In this paper, we present a novel approach for text-independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (Wav2Vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained using forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages. Full article
(This article belongs to the Special Issue Developments in Acoustic Phonetic Research)
Show Figures

Figure 1

Back to TopTop