Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (70)

Search Parameters:
Keywords = evaluation of speech synthesis

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
30 pages, 6201 KB  
Article
AFAD-MSA: Dataset and Models for Arabic Fake Audio Detection
by Elsayed Issa
Computation 2026, 14(1), 20; https://doi.org/10.3390/computation14010020 - 14 Jan 2026
Viewed by 63
Abstract
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of [...] Read more.
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness. Full article
Show Figures

Figure 1

14 pages, 1392 KB  
Article
AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots
by Xiugong Qin, Fenghu Pan, Jing Gao, Shilong Huang, Yichen Sun and Xiao Zhong
Electronics 2026, 15(1), 239; https://doi.org/10.3390/electronics15010239 - 5 Jan 2026
Viewed by 238
Abstract
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited [...] Read more.
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited adaptability to diverse speaker characteristics. The quality of the Mel spectrogram directly affects the performance of TTS systems, yet existing methods overlook the potential of enhancing Mel spectrogram quality through more comprehensive speech features. To address the complex acoustic characteristics of home environments, this paper introduces AirSpeech, a post-processing model for Mel-spectrogram synthesis. We adopt a Generative Adversarial Network (GAN) to improve the accuracy of Mel spectrogram prediction and enhance the expressiveness of synthesized speech. By incorporating additional conditioning extracted from synthesized audio using specified speech feature parameters, our method significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments. Furthermore, we propose a global normalization strategy to stabilize the GAN training process. Through extensive evaluations, we demonstrate that the proposed method significantly improves the signal quality and naturalness of synthesized speech, providing a more user-friendly speech interaction solution for smart home applications. Full article
Show Figures

Figure 1

33 pages, 3147 KB  
Review
Perception–Production of Second-Language Mandarin Tones Based on Interpretable Computational Methods: A Review
by Yujiao Huang, Zhaohong Xu, Xianming Bei and Huakun Huang
Mathematics 2026, 14(1), 145; https://doi.org/10.3390/math14010145 - 30 Dec 2025
Viewed by 382
Abstract
We survey recent advances in second-language (L2) Mandarin lexical tones research and show how an interpretable computational approach can deliver parameter-aligned feedback across perception–production (P ↔ P). We synthesize four strands: (A) conventional evaluations and tasks (identification, same–different, imitation/read-aloud) that reveal robust tone-pair [...] Read more.
We survey recent advances in second-language (L2) Mandarin lexical tones research and show how an interpretable computational approach can deliver parameter-aligned feedback across perception–production (P ↔ P). We synthesize four strands: (A) conventional evaluations and tasks (identification, same–different, imitation/read-aloud) that reveal robust tone-pair asymmetries and early P ↔ P decoupling; (B) physiological and behavioral instrumentation (e.g., EEG, eye-tracking) that clarifies cue weighting and time course; (C) audio-only speech analysis, from classic F0 tracking and MFCC–prosody fusion to CNN/RNN/CTC and self-supervised pipelines; and (D) interpretable learning, including attention and relational models (e.g., graph neural networks, GNNs) opened with explainable AI (XAI). Across strands, evidence converges on tones as time-evolving F0 trajectories, so movement, turning-point timing, and local F0 range are more diagnostic than height alone, and the contrast between Tone 2 (rising) and Tone 3 (dipping/low) remains the persistent difficulty; learners with tonal vs. non-tonal language backgrounds weight these cues differently. Guided by this synthesis, we outline a tool-oriented framework that pairs perception and production on the same items, jointly predicts tone labels and parameter targets, and uses XAI to generate local attributions and counterfactual edits, making feedback classroom-ready. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

20 pages, 1070 KB  
Article
LJ-TTS: A Paired Real and Synthetic Speech Dataset for Single-Speaker TTS Analysis
by Viola Negroni, Davide Salvi, Luca Comanducci, Taiba Majid Wani, Madleen Uecker, Irene Amerini, Stefano Tubaro and Paolo Bestagini
Electronics 2026, 15(1), 169; https://doi.org/10.3390/electronics15010169 - 30 Dec 2025
Viewed by 383
Abstract
In this paper, we present LJ-TTS, a large-scale single-speaker dataset of real and synthetic speech designed to support research in text-to-speech (TTS) synthesis and analysis. The dataset builds upon high-quality recordings of a single English speaker, alongside outputs generated by 11 state-of-the-art TTS [...] Read more.
In this paper, we present LJ-TTS, a large-scale single-speaker dataset of real and synthetic speech designed to support research in text-to-speech (TTS) synthesis and analysis. The dataset builds upon high-quality recordings of a single English speaker, alongside outputs generated by 11 state-of-the-art TTS models, including both autoregressive and non-autoregressive architectures. By maintaining a controlled single-speaker setting, LJ-TTS enables precise comparison of speech characteristics across different generative models, isolating the effects of synthesis methods from speaker variability. Unlike multi-speaker datasets lacking alignment between real and synthetic samples, LJ-TTS provides exact utterance-level correspondence, allowing fine-grained analyses that are otherwise impractical. The dataset supports systematic evaluation of synthetic speech across multiple dimensions, including deepfake detection, source tracing, and phoneme-level analyses. LJ-TTS provides a standardized resource for benchmarking generative models, assessing the limits of current TTS systems, and developing robust detection and evaluation methods. The dataset is publicly available to the research community to foster reproducible and controlled studies in speech synthesis and synthetic speech detection. Full article
(This article belongs to the Special Issue Emerging Trends in Generative-AI Based Audio Processing)
Show Figures

Figure 1

19 pages, 745 KB  
Review
Two Languages and One Aphasia: A Systematic Scoping Review of Primary Progressive Aphasia in Chinese Bilingual Speakers, and Implications for Diagnosis and Clinical Care
by Weifeng Han, Lin Zhou, Juan Lu and Shane Pill
Brain Sci. 2026, 16(1), 20; https://doi.org/10.3390/brainsci16010020 - 24 Dec 2025
Viewed by 424
Abstract
Background/Objectives: Primary progressive aphasia (PPA) is characterised by progressive decline in language and communication. However, existing diagnostic frameworks and assessment tools are largely based on Indo-European languages, which limits their applicability to Chinese bilingual speakers whose linguistic profiles differ markedly in tonal [...] Read more.
Background/Objectives: Primary progressive aphasia (PPA) is characterised by progressive decline in language and communication. However, existing diagnostic frameworks and assessment tools are largely based on Indo-European languages, which limits their applicability to Chinese bilingual speakers whose linguistic profiles differ markedly in tonal phonology, logographic writing, and bilingual organisation. This review aimed to (a) describe how PPA presents in Chinese bilingual speakers, (b) evaluate how well current speech–language and neuropsychological assessments capture these impairments, and (c) identify linguistically and culturally informed strategies to improve clinical practice. Methods: A systematic review was conducted in accordance with the PRISMA-ScR guidelines. Four databases (PubMed, Scopus, Web of Science, PsycINFO) were searched, complemented by backward and forward citation chaining. Eight empirical studies met the inclusion criteria. Data were extracted on participant characteristics, PPA variant, language background, speech–language and writing profiles, and assessment tools used. Thematic analysis was applied to address the research questions. Results: Across variants, Chinese bilingual speakers demonstrated universal PPA features expressed through language-specific pathways. Mandarin speakers exhibited tone-segment integration errors, tonal substitution, and disruptions in logographic writing. Lexical-semantic degradation reflected homophony and compounding characteristics. Bilingual individuals showed parallel or asymmetric decline influenced by dominance and usage. Standard English-based naming, repetition, and writing assessments did not reliably capture tone accuracy, radical-level writing errors, or bilingual patterns. Sociocultural factors, including stigma, delayed help-seeking, and family-centred care expectations, further shaped diagnostic pathways. Conclusions: Chinese PPA cannot be meaningfully assessed using tools designed for Indo-European languages. Findings highlight the need for tone-sensitive repetition tasks, logographic writing assessments, bilingual diagnostic protocols, and culturally responsive communication-partner support. This review provides a comprehensive synthesis to date on Chinese bilingual PPA and establishes a foundation for linguistically inclusive diagnostic and clinical models. Full article
Show Figures

Figure 1

39 pages, 1016 KB  
Article
The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages
by Aidana Karibayeva, Vladislav Karyukin, Ualsher Tukeyev, Balzhan Abduali, Dina Amirova, Diana Rakhimova, Rashid Aliyev and Assem Shormakova
Appl. Sci. 2025, 15(24), 12880; https://doi.org/10.3390/app152412880 - 5 Dec 2025
Viewed by 1655
Abstract
The development of parallel audio corpora for Turkic languages, such as Kazakh, Uzbek, and Tatar, remains a significant challenge in the development of multilingual speech synthesis, recognition systems, and machine translation. These languages are low-resource in speech technologies, lacking sufficiently large audio datasets [...] Read more.
The development of parallel audio corpora for Turkic languages, such as Kazakh, Uzbek, and Tatar, remains a significant challenge in the development of multilingual speech synthesis, recognition systems, and machine translation. These languages are low-resource in speech technologies, lacking sufficiently large audio datasets with aligned transcriptions that are crucial for modern recognition, synthesis, and understanding systems. This article presents the development and experimental evaluation of a speech corpus focused on Turkic languages, intended for use in speech synthesis and automatic translation tasks. The primary objective is to create parallel audio corpora using a cascade generation method, which combines artificial intelligence and text-to-speech (TTS) technologies to generate both audio and text, and to evaluate the quality and suitability of the generated data. To evaluate the quality of synthesized speech, metrics measuring naturalness, intonation, expressiveness, and linguistic adequacy were applied. As a result, a multimodal (Kazakh–Turkish, Kazakh–Tatar, Kazakh–Uzbek) corpus was created, combining high-quality natural Kazakh audio with transcription and translation, along with synthetic audio in Turkish, Tatar, and Uzbek. These corpora offer a unique resource for speech and text processing research, enabling the integration of ASR, MT, TTS, and speech-to-speech translation (STS). Full article
Show Figures

Figure 1

13 pages, 974 KB  
Case Report
Smith-Lemli-Opitz Syndrome (SLOS)—Case Description and the Impact of Therapeutic Interventions on Psychomotor Development
by Natalia Kozera, Robert Śmigiel and Anna Rozensztrauch
J. Clin. Med. 2025, 14(23), 8569; https://doi.org/10.3390/jcm14238569 - 3 Dec 2025
Viewed by 546
Abstract
Background/Objectives: Smith–Lemli–Opitz syndrome (SLOS) is a genetic metabolic disorder characterized by impaired cholesterol synthesis and a wide range of developmental anomalies. This article presents a case of a girl with SLOS, diagnosed with two pathogenic variants of the DHCR7 gene. The objective [...] Read more.
Background/Objectives: Smith–Lemli–Opitz syndrome (SLOS) is a genetic metabolic disorder characterized by impaired cholesterol synthesis and a wide range of developmental anomalies. This article presents a case of a girl with SLOS, diagnosed with two pathogenic variants of the DHCR7 gene. The objective is to evaluate the impact of early, multidisciplinary therapeutic interventions on the patient’s development. Methods: Following diagnosis, a comprehensive metabolic therapy was initiated, including cholesterol and cholic acid supplementation. An interdisciplinary therapeutic approach was employed, involving physical therapy, speech therapy, and sensory integration, aimed at addressing various developmental challenges faced by the patient. Results: The therapy led to gradual improvements in the patient’s psychomotor development, although the cholesterol levels were only partially improved and the accumulation of sterol precursors (7-DHC and 8-DHC) persisted. The coordinated care model facilitated better outcomes compared to less integrated efforts. Conclusions: The results highlight the importance of early diagnosis and integrated care in optimizing developmental outcomes for children with SLOS. A multidisciplinary approach is essential for addressing the complexities of the syndrome and promoting overall development. Full article
(This article belongs to the Section Clinical Pediatrics)
Show Figures

Figure 1

16 pages, 1701 KB  
Article
Research on YOLOv5s-Based Multimodal Assistive Gesture and Micro-Expression Recognition with Speech Synthesis
by Xiaohua Li and Chaiyan Jettanasen
Computation 2025, 13(12), 277; https://doi.org/10.3390/computation13120277 - 1 Dec 2025
Viewed by 397
Abstract
Effective communication between deaf–mute and visually impaired individuals remains a challenge in the fields of human–computer interaction and accessibility technology. Current solutions mostly rely on single-modal recognition, which often leads to issues such as semantic ambiguity and loss of emotional information. To address [...] Read more.
Effective communication between deaf–mute and visually impaired individuals remains a challenge in the fields of human–computer interaction and accessibility technology. Current solutions mostly rely on single-modal recognition, which often leads to issues such as semantic ambiguity and loss of emotional information. To address these challenges, this study proposes a lightweight multimodal fusion framework that combines gestures and micro-expressions, which are then processed through a recognition network and a speech synthesis module. The core innovations of this research are as follows: (1) a lightweight YOLOv5s improvement structure that integrates residual modules and efficient downsampling modules, which reduces the model complexity and computational overhead while maintaining high accuracy; (2) a multimodal fusion method based on an attention mechanism, which adaptively and efficiently integrates complementary information from gestures and micro-expressions, significantly improving the semantic richness and accuracy of joint recognition; (3) an end-to-end real-time system that outputs the visual recognition results through a high-quality text-to-speech module, completing the closed-loop from “visual signal” to “speech feedback”. We conducted evaluations on the publicly available hand gesture dataset HaGRID and a curated micro-expression image dataset. The results show that, for the joint gesture and micro-expression tasks, our proposed multimodal recognition system achieves a multimodal joint recognition accuracy of 95.3%, representing a 4.5% improvement over the baseline model. The system was evaluated in a locally deployed environment, achieving a real-time processing speed of 22 FPS, with a speech output latency below 0.8 s. The mean opinion score (MOS) reached 4.5, demonstrating the effectiveness of the proposed approach in breaking communication barriers between the hearing-impaired and visually impaired populations. Full article
Show Figures

Figure 1

13 pages, 791 KB  
Review
Telehealth Delivery of Speech–Language Pathology Services for Children with Cleft Palate and Velopharyngeal Dysfunction: A Systematic Review
by Nisreen Naser Al Awaji, Alanoud Nawaf Alsinan, Raja S. Alamri, Nourah A. Bin Ruaydan, Lama S. Alharbi, Lana T. Albesher and Latifa Alrutaiq
Children 2025, 12(11), 1523; https://doi.org/10.3390/children12111523 - 11 Nov 2025
Viewed by 1062
Abstract
Background/Objectives: This systematic review aimed to evaluate the effectiveness, caregiver satisfaction, and accessibility of telehealth-delivered speech–language pathology (SLP) services for children with cleft palate and/or velopharyngeal dysfunction (VPD). Methods: Based on PRISMA 2020, we searched PubMed, CINAHL, Scopus, PsycINFO, and the Cochrane Library [...] Read more.
Background/Objectives: This systematic review aimed to evaluate the effectiveness, caregiver satisfaction, and accessibility of telehealth-delivered speech–language pathology (SLP) services for children with cleft palate and/or velopharyngeal dysfunction (VPD). Methods: Based on PRISMA 2020, we searched PubMed, CINAHL, Scopus, PsycINFO, and the Cochrane Library (2000–31 May 2025) for studies enrolling participants ≤ 18 years of age with cleft/VPD who received telehealth services (assessment, therapy, counseling/follow-up), with or without in-person comparators. Screening and data extraction were performed in duplicate. Risk of bias was appraised using RoB 2 (randomized) and CASP checklists (non-randomized/service designs). To account for heterogeneity, we conducted a SWiM-aligned narrative synthesis and summarized certainty with GRADE. Results: Eleven studies met the inclusion criteria. Telehealth delivery of SLP services was feasible and generally acceptable to families. Caregiver-mediated interventions frequently showed within-group improvements in speech outcomes, while remote assessment demonstrated moderate agreement with in-person ratings. However, the overall certainty of evidence was rated as very low to low because of small sample sizes, single-center designs, and heterogeneous outcomes. Conclusions: Telehealth is a feasible and acceptable mode for delivering SLP in pediatric cleft/VPD patients, with encouraging signals for caregiver-mediated articulation therapy and maintaining multidisciplinary follow-up. Implementation is best embedded within hybrid pathways, reserving in-person visits for complex assessments. Adequately powered comparative studies with standardized outcomes, longer follow-up, and equity-focused implementation are needed. Full article
(This article belongs to the Section Pediatric Dentistry & Oral Medicine)
Show Figures

Figure 1

7 pages, 1456 KB  
Proceeding Paper
Towards a More Natural Urdu: A Comprehensive Approach to Text-to-Speech and Voice Cloning
by Muhammad Ramiz Saud, Muhammad Romail Imran and Raja Hashim Ali
Eng. Proc. 2025, 87(1), 112; https://doi.org/10.3390/engproc2025087112 - 20 Oct 2025
Cited by 12 | Viewed by 1295
Abstract
This paper introduces a comprehensive approach to building natural-sounding Urdu Text-to-Speech (TTS) and voice cloning systems, addressing the lack of computational resources for Urdu. We developed a large-scale dataset of over 100 h of Urdu speech, carefully cleaned and phonetically aligned through an [...] Read more.
This paper introduces a comprehensive approach to building natural-sounding Urdu Text-to-Speech (TTS) and voice cloning systems, addressing the lack of computational resources for Urdu. We developed a large-scale dataset of over 100 h of Urdu speech, carefully cleaned and phonetically aligned through an automated transcription pipeline to preserve linguistic accuracy. The dataset was then used to fine-tune Tacotron2, a neural network model originally trained for English, with modifications tailored to Urdu’s phonological and morphological features. To further enhance naturalness, we integrated voice cloning techniques that capture regional accents and produce personalized speech outputs. Model performance was evaluated through mean opinion score (MOS), word error rate (WER), and speaker similarity, showing substantial improvements compared to previous Urdu systems. The results demonstrate clear progress toward natural and intelligible Urdu speech synthesis, while also revealing challenges such as handling dialectal variation and preventing model overfitting. This work contributes an essential resource and methodology for advancing Urdu natural language processing (NLP), with promising applications in education, accessibility, entertainment, and assistive technologies. Full article
(This article belongs to the Proceedings of The 5th International Electronic Conference on Applied Sciences)
Show Figures

Graphical abstract

29 pages, 1708 KB  
Article
Speech Recognition and Synthesis Models and Platforms for the Kazakh Language
by Aidana Karibayeva, Vladislav Karyukin, Balzhan Abduali and Dina Amirova
Information 2025, 16(10), 879; https://doi.org/10.3390/info16100879 - 10 Oct 2025
Viewed by 3743
Abstract
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language [...] Read more.
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language with limited audio corpora, language models, and high-quality speech synthesis systems. This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. Special attention is given to linguistic and technical barriers, including the agglutinative structure, rich vowel system, and phonemic variability. Both open-source and commercial solutions were evaluated, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, and TurkicTTS. Speech recognition systems were assessed using BLEU, WER, TER, chrF, and COMET, while speech synthesis was evaluated with MCD, PESQ, STOI, and DNSMOS, thus covering both lexical–semantic and acoustic–perceptual characteristics. The results demonstrate that, for speech-to-text (STT), the strongest performance was achieved by Soyle on domain-specific data (BLEU 74.93, WER 18.61), while Voiser showed balanced accuracy (WER 40.65–37.11, chrF 80.88–84.51) and GPT-4 Transcribe achieved robust semantic preservation (COMET up to 1.02). In contrast, Whisper performed weakest (WER 77.10, BLEU 13.22), requiring further adaptation for Kazakh. For text-to-speech (TTS), KazakhTTS2 delivered the most natural perceptual quality (DNSMOS 8.79–8.96), while OpenAI TTS achieved the best spectral accuracy (MCD 123.44–117.11, PESQ 1.14). TurkicTTS offered reliable intelligibility (STOI 0.15, PESQ 1.16), and ElevenLabs produced natural but less spectrally accurate speech. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

19 pages, 10385 KB  
Article
All’s Well That FID’s Well? Result Quality and Metric Scores in GAN Models for Lip-Synchronization Tasks
by Carina Geldhauser, Johan Liljegren and Pontus Nordqvist
Electronics 2025, 14(17), 3487; https://doi.org/10.3390/electronics14173487 - 31 Aug 2025
Viewed by 1108
Abstract
This exploratory study investigates the usability of performance metrics for generative adversarial network (GAN)-based models for speech-driven facial animation. These models focus on the transfer of speech information from an audio file to a still image to generate talking-head videos in a small-scale [...] Read more.
This exploratory study investigates the usability of performance metrics for generative adversarial network (GAN)-based models for speech-driven facial animation. These models focus on the transfer of speech information from an audio file to a still image to generate talking-head videos in a small-scale “everyday usage” setting. Two models, LipGAN and a custom implementation of a Wasserstein GAN with gradient penalty (L1WGAN-GP), are examined for their visual performance and scoring according to commonly used metrics: Quantitative comparisons using FID, SSIM, and PSNR metrics on the GRIDTest dataset show mixed results, and metrics fail to capture local artifacts crucial for lip synchronization, pointing to limitations in their applicability for video animation tasks. The study points towards the inadequacy of current quantitative measures and emphasizes the continued necessity of human qualitative assessment for evaluating talking-head video quality. Full article
(This article belongs to the Special Issue New Trends in AI-Assisted Computer Vision)
Show Figures

Figure 1

38 pages, 3579 KB  
Systematic Review
Integrating Artificial Intelligence and Extended Reality in Language Education: A Systematic Literature Review (2017–2024)
by Weijian Yan, Belle Li and Victoria L. Lowell
Educ. Sci. 2025, 15(8), 1066; https://doi.org/10.3390/educsci15081066 - 19 Aug 2025
Cited by 3 | Viewed by 5929
Abstract
This systematic literature review examines the integration of Artificial Intelligence (AI) and Extended Reality (XR) technologies in language education, synthesizing findings from 32 empirical studies published between 2017 and 2024. Guided by the PRISMA framework, we searched four databases—ERIC, Web of Science, Scopus, [...] Read more.
This systematic literature review examines the integration of Artificial Intelligence (AI) and Extended Reality (XR) technologies in language education, synthesizing findings from 32 empirical studies published between 2017 and 2024. Guided by the PRISMA framework, we searched four databases—ERIC, Web of Science, Scopus, and IEEE Xplore—to identify studies that explicitly integrated both AI and XR to support language learning. The review explores publication trends, educational settings, target languages, language skills, learning outcomes, and theoretical frameworks, and analyzes how AI–XR technologies have been pedagogically integrated, and identifies affordances, challenges, design considerations, and future directions of AI–XR integration. Key integration strategies include coupling AI with XR technologies such as automatic speech recognition, natural language processing, computer vision, and conversational agents to support skills like speaking, vocabulary, writing, and intercultural competence. The reported affordances pertain to technical, pedagogical, and affective dimensions. However, challenges persist in terms of technical limitations, pedagogical constraints, scalability and generalizability, ethical and human-centered concerns, and infrastructure and cost barriers. Design recommendations and future directions emphasize the need for adaptive AI dialogue systems, broader pedagogical applications, longitudinal studies, learner-centered interaction, scalable and accessible design, and evaluation. This review offers a comprehensive synthesis to guide researchers, educators, and developers in designing effective AI–XR language learning experiences. Full article
(This article belongs to the Section Technology Enhanced Education)
Show Figures

Figure 1

25 pages, 10870 KB  
Article
XTTS-Based Data Augmentation for Profanity Keyword Recognition in Low-Resource Speech Scenarios
by Shin-Chi Lai, Yi-Chang Zhu, Szu-Ting Wang, Yen-Ching Chang, Ying-Hsiu Hung, Jhen-Kai Tang and Wen-Kai Tsai
Appl. Syst. Innov. 2025, 8(4), 108; https://doi.org/10.3390/asi8040108 - 31 Jul 2025
Viewed by 2082
Abstract
As voice cloning technology rapidly advances, the risk of personal voices being misused by malicious actors for fraud or other illegal activities has significantly increased, making the collection of speech data increasingly challenging. To address this issue, this study proposes a data augmentation [...] Read more.
As voice cloning technology rapidly advances, the risk of personal voices being misused by malicious actors for fraud or other illegal activities has significantly increased, making the collection of speech data increasingly challenging. To address this issue, this study proposes a data augmentation method based on XText-to-Speech (XTTS) synthesis to tackle the challenges of small-sample, multi-class speech recognition, using profanity as a case study to achieve high-accuracy keyword recognition. Two models were therefore evaluated: a CNN model (Proposed-I) and a CNN-Transformer hybrid model (Proposed-II). Proposed-I leverages local feature extraction, improving accuracy on a real human speech (RHS) test set from 55.35% without augmentation to 80.36% with XTTS-enhanced data. Proposed-II integrates CNN’s local feature extraction with Transformer’s long-range dependency modeling, further boosting test set accuracy to 88.90% while reducing the parameter count by approximately 41%, significantly enhancing computational efficiency. Compared to a previously proposed incremental architecture, the Proposed-II model achieves an 8.49% higher accuracy while reducing parameters by about 98.81% and MACs by about 98.97%, demonstrating exceptional resource efficiency. By utilizing XTTS and public corpora to generate a novel keyword speech dataset, this study enhances sample diversity and reduces reliance on large-scale original speech data. Experimental analysis reveals that an optimal synthetic-to-real speech ratio of 1:5 significantly improves the overall system accuracy, effectively addressing data scarcity. Additionally, the Proposed-I and Proposed-II models achieve accuracies of 97.54% and 98.66%, respectively, in distinguishing real from synthetic speech, demonstrating their strong potential for speech security and anti-spoofing applications. Full article
(This article belongs to the Special Issue Advancements in Deep Learning and Its Applications)
Show Figures

Figure 1

16 pages, 2365 KB  
Article
Fast Inference End-to-End Speech Synthesis with Style Diffusion
by Hui Sun, Jiye Song and Yi Jiang
Electronics 2025, 14(14), 2829; https://doi.org/10.3390/electronics14142829 - 15 Jul 2025
Viewed by 3459
Abstract
In recent years, deep learning-based end-to-end Text-To-Speech (TTS) models have made significant progress in enhancing speech naturalness and fluency. However, existing Variational Inference Text-to-Speech (VITS) models still face challenges such as insufficient pitch modeling, inadequate contextual dependency capture, and low inference efficiency in [...] Read more.
In recent years, deep learning-based end-to-end Text-To-Speech (TTS) models have made significant progress in enhancing speech naturalness and fluency. However, existing Variational Inference Text-to-Speech (VITS) models still face challenges such as insufficient pitch modeling, inadequate contextual dependency capture, and low inference efficiency in the decoder. To address these issues, this paper proposes an improved TTS framework named Q-VITS. Q-VITS incorporates Rotary Position Embedding (RoPE) into the text encoder to enhance long-sequence modeling, adopts a frame-level prior modeling strategy to optimize one-to-many mappings, and designs a style extractor based on a diffusion model for controllable style rendering. Additionally, the proposed decoder ConfoGAN integrates explicit F0 modeling, Pseudo-Quadrature Mirror Filter (PQMF) multi-band synthesis and Conformer structure. The experimental results demonstrate that Q-VITS outperforms the VITS in terms of speech quality, pitch accuracy, and inference efficiency in both subjective Mean Opinion Score (MOS) and objective Mel-Cepstral Distortion (MCD) and Root Mean Square Error (RMSE) evaluations on a single-speaker dataset, achieving performance close to ground-truth audio. These improvements provide an effective solution for efficient and controllable speech synthesis. Full article
Show Figures

Figure 1

Back to TopTop