MDPI - Publisher of Open Access Journals

17 pages, 14845 KB

Open AccessArticle

A Collaborative Robotic System for Autonomous Object Handling with Natural User Interaction

by Federico Neri, Gaetano Lettera, Giacomo Palmieri and Massimo Callegari

Robotics 2026, 15(3), 49; https://doi.org/10.3390/robotics15030049 - 27 Feb 2026

In Industry 5.0, the transition from fixed traditional automation to flexible human–robot collaboration (HRC) needs interfaces that are both intuitive and efficient. This paper introduces a novel, multimodal control system for autonomous object handling, specifically designed to enhance natural user interaction in dynamic [...] Read more.

In Industry 5.0, the transition from fixed traditional automation to flexible human–robot collaboration (HRC) needs interfaces that are both intuitive and efficient. This paper introduces a novel, multimodal control system for autonomous object handling, specifically designed to enhance natural user interaction in dynamic work environments. The system integrates a 6-Degrees of Freedom (DoF) collaborative robot (UR5e) with a hand-eye RGB-D vision system to achieve robust autonomy. The core technical contribution lies in a vision pipeline utilizing deep learning for object detection and point cloud processing for accurate 6D pose estimation, enabling advanced tasks such as human-aware object handover directly onto the operator’s hand. Crucially, an Automatic Speech Recognition (ASR) is incorporated, providing a Natural Language Understanding (NLU) layer that allows operators to issue real-time commands for task modification, error correction and object selection. Experimental results demonstrate that this multimodal approach offers a streamlined workflow aiming to improve operational flexibility compared to traditional HMIs, while enhancing the perceived naturalness of the collaborative task. The system establishes a framework for highly responsive and intuitive human–robot workspaces, advancing the state of the art in natural interaction for collaborative object manipulation. Full article

(This article belongs to the Special Issue Human–Robot Collaboration in Industry 5.0)

38 pages, 6181 KB

Open AccessArticle

An AIoT-Based Framework for Automated English-Speaking Assessment: Architecture, Benchmarking, and Reliability Analysis of Open-Source ASR

by Paniti Netinant, Rerkchai Fooprateepsiri, Ajjima Rukhiran and Meennapa Rukhiran

Informatics 2026, 13(2), 19; https://doi.org/10.3390/informatics13020019 - 26 Jan 2026

Viewed by 600

Abstract

The emergence of low-cost edge devices has enabled the integration of automatic speech recognition (ASR) into IoT environments, creating new opportunities for real-time language assessment. However, achieving reliable performance on resource-constrained hardware remains a significant challenge, especially on the Artificial Internet of Things [...] Read more.

The emergence of low-cost edge devices has enabled the integration of automatic speech recognition (ASR) into IoT environments, creating new opportunities for real-time language assessment. However, achieving reliable performance on resource-constrained hardware remains a significant challenge, especially on the Artificial Internet of Things (AIoT). This study presents an AIoT-based framework for automated English-speaking assessment that integrates architecture and system design, ASR benchmarking, and reliability analysis on edge devices. The proposed AIoT-oriented architecture incorporates a lightweight scoring framework capable of analyzing pronunciation, fluency, prosody, and CEFR-aligned speaking proficiency within an automated assessment system. Seven open-source ASR models—four Whisper variants (tiny, base, small, and medium) and three Vosk models—were systematically benchmarked in terms of recognition accuracy, inference latency, and computational efficiency. Experimental results indicate that Whisper-medium deployed on the Raspberry Pi 5 achieved the strongest overall performance, reducing inference latency by 42–48% compared with the Raspberry Pi 4 and attaining the lowest Word Error Rate (WER) of 6.8%. In contrast, smaller models such as Whisper-tiny, with a WER of 26.7%, exhibited two- to threefold higher scoring variability, demonstrating how recognition errors propagate into automated assessment reliability. System-level testing revealed that the Raspberry Pi 5 can sustain near real-time processing with approximately 58% CPU utilization and around 1.2 GB of memory, whereas the Raspberry Pi 4 frequently approaches practical operational limits under comparable workloads. Validation using real learner speech data (approximately 100 sessions) confirmed that the proposed system delivers accurate, portable, and privacy-preserving speaking assessment using low-power edge hardware. Overall, this work introduces a practical AIoT-based assessment framework, provides a comprehensive benchmark of open-source ASR models on edge platforms, and offers empirical insights into the trade-offs among recognition accuracy, inference latency, and scoring stability in edge-based ASR deployments. Full article

► Show Figures

Figure 1

16 pages, 1736 KB

Open AccessArticle

User Experience Enhancement of a Gamified Speech Therapy Program Using the Double Diamond Design Framework

by Sujin Kim, Eunjin Kwon, Jaesun Yu, Younggeun Choi, Myoung-Hwan Ko, Yun-ju Jo, Hyun-Gi Kim and Heecheon You

Appl. Sci. 2026, 16(2), 826; https://doi.org/10.3390/app16020826 - 13 Jan 2026

Viewed by 347

Abstract

The global rise in childhood speech disorders highlights the need for accessible and engaging home-based rehabilitation tools. This study applied the Double Diamond design framework to enhance the user experience (UX) of Smart Speech, a gamified functional speech therapy program. Using heuristic evaluation, [...] Read more.

The global rise in childhood speech disorders highlights the need for accessible and engaging home-based rehabilitation tools. This study applied the Double Diamond design framework to enhance the user experience (UX) of Smart Speech, a gamified functional speech therapy program. Using heuristic evaluation, expert interviews, and benchmarking, six core UX problem areas were identified, including insufficient guidance, low personalized motivation, limited feedback, and accessibility issues. Through an iterative ideation process, 78 UX improvement concepts were generated, encompassing motivational reinforcement (e.g., praise stickers and character interaction), automated training guidance, enhanced feedback mechanisms, and error-prevention features. A usability evaluation with 20 participants, including speech-language pathologists (SLPs) and parents, showed significant improvements across key dimensions, with increases of 1.1 to 2.6 points on a 7-point scale. These findings demonstrate that systematic UX design can substantially improve engagement, usability, and the potential therapeutic utility of home-based speech therapy systems. Full article

(This article belongs to the Special Issue Novel Approaches and Applications in Ergonomic Design, 4th Edition)

► Show Figures

Figure 1

20 pages, 1508 KB

Open AccessArticle

Bidirectional Translation of ASL and English Using Machine Vision and CNN and Transformer Networks

by Stefanie Amiruzzaman, Md Amiruzzaman, Raga Mouni Batchu, James Dracup, Alexander Pham, Benjamin Crocker, Linh Ngo and M. Ali Akber Dewan

Computers 2026, 15(1), 20; https://doi.org/10.3390/computers15010020 - 4 Jan 2026

Viewed by 647

Abstract

This study presents a real-time, bidirectional system for translating American Sign Language (ASL) to and from English using computer vision and transformer-based models to enhance accessibility for deaf and hard of hearing users. Leveraging publicly available sign language and text–to-gloss datasets, the system [...] Read more.

This study presents a real-time, bidirectional system for translating American Sign Language (ASL) to and from English using computer vision and transformer-based models to enhance accessibility for deaf and hard of hearing users. Leveraging publicly available sign language and text–to-gloss datasets, the system integrates MediaPipe-based holistic landmark extraction with CNN- and transformer-based architectures to support translation across video, text, and speech modalities within a web-based interface. In the ASL-to-English direction, the sign-to-gloss model achieves a 25.17% word error rate (WER) on the RWTH-PHOENIX-Weather 2014T benchmark, which is competitive with recent continuous sign language recognition systems, and the gloss-level translation attains a ROUGE-L score of 79.89, indicating strong preservation of sign content and ordering. In the reverse English-to-ASL direction, the English-to-Gloss transformer trained on ASLG-PC12 achieves a ROUGE-L score of 96.00, demonstrating high-fidelity gloss sequence generation suitable for landmark-based ASL animation. These results highlight a favorable accuracy-efficiency trade-off achieved through compact model architectures and low-latency decoding, supporting practical real-time deployment. Full article

(This article belongs to the Section AI-Driven Innovations)

► Show Figures

Figure 1

19 pages, 745 KB

Open AccessReview

Two Languages and One Aphasia: A Systematic Scoping Review of Primary Progressive Aphasia in Chinese Bilingual Speakers, and Implications for Diagnosis and Clinical Care

by Weifeng Han, Lin Zhou, Juan Lu and Shane Pill

Brain Sci. 2026, 16(1), 20; https://doi.org/10.3390/brainsci16010020 - 24 Dec 2025

Viewed by 598

Abstract

Background/Objectives: Primary progressive aphasia (PPA) is characterised by progressive decline in language and communication. However, existing diagnostic frameworks and assessment tools are largely based on Indo-European languages, which limits their applicability to Chinese bilingual speakers whose linguistic profiles differ markedly in tonal [...] Read more.

Background/Objectives: Primary progressive aphasia (PPA) is characterised by progressive decline in language and communication. However, existing diagnostic frameworks and assessment tools are largely based on Indo-European languages, which limits their applicability to Chinese bilingual speakers whose linguistic profiles differ markedly in tonal phonology, logographic writing, and bilingual organisation. This review aimed to (a) describe how PPA presents in Chinese bilingual speakers, (b) evaluate how well current speech–language and neuropsychological assessments capture these impairments, and (c) identify linguistically and culturally informed strategies to improve clinical practice. Methods: A systematic review was conducted in accordance with the PRISMA-ScR guidelines. Four databases (PubMed, Scopus, Web of Science, PsycINFO) were searched, complemented by backward and forward citation chaining. Eight empirical studies met the inclusion criteria. Data were extracted on participant characteristics, PPA variant, language background, speech–language and writing profiles, and assessment tools used. Thematic analysis was applied to address the research questions. Results: Across variants, Chinese bilingual speakers demonstrated universal PPA features expressed through language-specific pathways. Mandarin speakers exhibited tone-segment integration errors, tonal substitution, and disruptions in logographic writing. Lexical-semantic degradation reflected homophony and compounding characteristics. Bilingual individuals showed parallel or asymmetric decline influenced by dominance and usage. Standard English-based naming, repetition, and writing assessments did not reliably capture tone accuracy, radical-level writing errors, or bilingual patterns. Sociocultural factors, including stigma, delayed help-seeking, and family-centred care expectations, further shaped diagnostic pathways. Conclusions: Chinese PPA cannot be meaningfully assessed using tools designed for Indo-European languages. Findings highlight the need for tone-sensitive repetition tasks, logographic writing assessments, bilingual diagnostic protocols, and culturally responsive communication-partner support. This review provides a comprehensive synthesis to date on Chinese bilingual PPA and establishes a foundation for linguistically inclusive diagnostic and clinical models. Full article

(This article belongs to the Special Issue Primary Progressive Aphasia: What Happens to Speech and Language? What Can We Do to Help?)

► Show Figures

Figure 1

20 pages, 3334 KB

Open AccessArticle

The Development of Northern Thai Dialect Speech Recognition System

by Jakramate Bootkrajang, Papangkorn Inkeaw, Jeerayut Chaijaruwanich, Supawat Taerungruang, Adisorn Boonyawisit, Bak Jong Min Sutawong, Vataya Chunwijitra and Phimphaka Taninpong

Appl. Sci. 2026, 16(1), 160; https://doi.org/10.3390/app16010160 - 23 Dec 2025

Cited by 1 | Viewed by 612

Abstract

This study investigated the necessary ingredients for the development of an automatic speech recognition (ASR) system for the Northern Thai language. Building an ASR model for such an arguably low-resource language poses challenges both in terms of the quantity and the quality of [...] Read more.

This study investigated the necessary ingredients for the development of an automatic speech recognition (ASR) system for the Northern Thai language. Building an ASR model for such an arguably low-resource language poses challenges both in terms of the quantity and the quality of the corpus. The experimental results demonstrated that the current state-of-the-art deep neural network trained in an end-to-end manner, and pre-trained from a closely related language, such as Standard Thai, often outperformed its traditional HMM-based counterparts. The results also suggested that incorporating northern Thai-specific tonal information and augmenting the character-based end-to-end model with an n-gram language model further improves the recognition performance. Surprisingly, the quality of the transcription of the speech corpus was not found to positively correlate with the recognition performance in the case of the end-to-end system. The results show that the end-to-end ASR system was able to achieve the best word error rate (WER) of 0.94 on out-of-sample data. This is equivalent to 77.02% and 60.34% relative word error rate reduction over the 4.09 and 2.37 WERs of the traditional TDNN-HMM and the vanilla deep neural network baselines. Full article

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

► Show Figures

Figure 1

13 pages, 284 KB

Open AccessArticle

Two-Stage Domain Adaptation for LLM-Based ASR by Decoupling Linguistic and Acoustic Factors

by Lin Zheng, Xuyang Wang, Qingwei Zhao and Ta Li

Appl. Sci. 2026, 16(1), 60; https://doi.org/10.3390/app16010060 - 20 Dec 2025

Viewed by 482

Abstract

Large language models (LLMs) have been increasingly applied in Automatic Speech Recognition (ASR), achieving significant advancements. However, the performance of LLM-based ASR (LLM-ASR) models remains unsatisfactory when applied across domains due to domain shifts between acoustic and linguistic conditions. To address this challenge, [...] Read more.

Large language models (LLMs) have been increasingly applied in Automatic Speech Recognition (ASR), achieving significant advancements. However, the performance of LLM-based ASR (LLM-ASR) models remains unsatisfactory when applied across domains due to domain shifts between acoustic and linguistic conditions. To address this challenge, we propose a decoupled two-stage domain adaptation framework that separates the adaptation process into text-only and audio-only stages. In the first stage, we leverage abundant text data from the target domain to refine the LLM component, thereby improving its contextual and linguistic alignment with the target domain. In the second stage, we employ a pseudo-labeling method with unlabeled audio data in the target domain and introduce two key enhancements: (1) incorporating decoupled auxiliary Connectionist Temporal Classification (CTC) loss to improve the robustness of the speech encoder under different acoustic conditions; (2) adopting a synchronous LLM tuning strategy, allowing the LLM to continuously learn linguistic alignment from pseudo-labeled transcriptions enriched with domain textual knowledge. The experimental results demonstrate that our proposed methods significantly improve the performance of LLM-ASR in the target domain, achieving a relative word error rate reduction of 19.2%. Full article

(This article belongs to the Special Issue Speech Recognition: Techniques, Applications and Prospects)

► Show Figures

Figure 1

17 pages, 4452 KB

Open AccessArticle

SAUCF: A Framework for Secure, Natural-Language-Guided UAS Control

by Nihar Shah, Varun Aggarwal and Dharmendra Saraswat

Drones 2025, 9(12), 860; https://doi.org/10.3390/drones9120860 - 14 Dec 2025

Viewed by 561

Abstract

Precision agriculture increasingly recognizes the transformative potential of unmanned aerial systems (UASs) for crop monitoring and field assessment, yet research consistently highlights significant usability barriers as the main constraints to widespread adoption. Complex mission planning processes, including detailed flight plan creation and way [...] Read more.

Precision agriculture increasingly recognizes the transformative potential of unmanned aerial systems (UASs) for crop monitoring and field assessment, yet research consistently highlights significant usability barriers as the main constraints to widespread adoption. Complex mission planning processes, including detailed flight plan creation and way point management, pose substantial technical challenges that mainly affect non-expert operators. Farmers and their teams generally prefer user-friendly, straightforward tools, as evidenced by the rapid adoption of GPS guidance systems, which underscores the need for simpler mission planning in UAS operations. To enhance accessibility and safety in UAS control, especially for non-expert operators in agriculture and related fields, we propose a Secure UAS Control Framework (SAUCF): a comprehensive system for natural-language-driven UAS mission management with integrated dual-factor biometric authentication. The framework converts spoken user instructions into executable flight plans by leveraging a language-model-powered mission planner that interprets transcribed voice commands and generates context-aware operational directives, including takeoff, location monitoring, return-to-home, and landing operations. Mission orchestration is performed through a large language model (LLM) agent, coupled with a human-in-the-loop supervision mechanism that enables operators to review, adjust, or confirm mission plans before deployment. Additionally, SAUCF offers a manual override feature, allowing users to assume direct control or interrupt missions at any stage, ensuring safety and adaptability in dynamic environments. Proof-of-concept demonstrations on a UAS plat-form with on-board computing validated reliable speech-to-text transcription, biometric verification via voice matching and face authentication, and effective Sim2Real transfer of natural-language-driven mission plans from simulation environments to physical UAS operations. Initial evaluations showed that SAUCF reduced mission planning time, minimized command errors, and simplified complex multi-objective workflows compared to traditional waypoint-based tools, though comprehensive field validation remains necessary to confirm these preliminary findings. The integration of natural-language-based interaction, real-time identity verification, human-in-the-loop LLM orchestration, and manual override capabilities allows SAUCF to significantly lower the technical barrier to UAS operation while ensuring mission security, operational reliability, and operator agency in real-world conditions. These findings lay the groundwork for systematic field trials and suggest that prioritizing ease of operation in mission planning can drive broader deployment of UAS technologies. Full article

(This article belongs to the Section Artificial Intelligence in Drones (AID))

► Show Figures

Figure 1

22 pages, 929 KB

Open AccessArticle

Low-Resource Speech Recognition by Fine-Tuning Whisper with Optuna-LoRA

by Huan Wang, Jie Bin, Chunyan Gou, Lian Yang, Baolin Hou and Mingwei Qin

Appl. Sci. 2025, 15(24), 13090; https://doi.org/10.3390/app152413090 - 12 Dec 2025

Viewed by 1532

Abstract

In low-resource speech recognition, the performance of the Whisper model is often limited by the size of the available training data. To address this challenge, this paper proposes a training optimization method for the Whisper model that integrates Low-Rank Adaptation (LoRA) with the [...] Read more.

In low-resource speech recognition, the performance of the Whisper model is often limited by the size of the available training data. To address this challenge, this paper proposes a training optimization method for the Whisper model that integrates Low-Rank Adaptation (LoRA) with the Optuna hyperparameter optimization framework. This combined approach enables efficient fine-tuning and enhances model performance. A dual-metric early stopping strategy, based on validation loss and relative word error rate improvement, is introduced to ensure robust convergence during training. Experimental data were collected from three low-resource languages in Xinjiang, China: Uyghur, Kazakh, and Kyrgyz. Compared to baseline LoRA fine-tuning, the proposed optimization method reduces WER by 20.98%, 6.46%, and 8.72%, respectively, across the three languages. The dual-metric early stopping strategy effectively shortens optimization time while preventing overfitting. Overall, these results demonstrate that the proposed method significantly reduces both WER and computational costs, making it highly effective for low-resource speech recognition tasks. Full article

► Show Figures

Figure 1

15 pages, 828 KB

Open AccessArticle

N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition

by Meryam Telmem, Naouar Laaidi, Youssef Ghanou and Hassan Satori

Mach. Learn. Knowl. Extr. 2025, 7(4), 164; https://doi.org/10.3390/make7040164 - 10 Dec 2025

Viewed by 769

Abstract

This work investigates how different language modeling techniques affect the performance of an end-to-end automatic speech recognition (ASR) system for the Amazigh language. A (CNN-BiLSTM-CTC) model enhanced with an attention mechanism was used as the baseline. During decoding, two external language models were [...] Read more.

This work investigates how different language modeling techniques affect the performance of an end-to-end automatic speech recognition (ASR) system for the Amazigh language. A (CNN-BiLSTM-CTC) model enhanced with an attention mechanism was used as the baseline. During decoding, two external language models were integrated using shallow fusion: a trigram N-gram model built with KenLM and a recurrent neural network language model (RNN-LM) trained on the same Tifdigit corpus. Four decoding methods were compared: greedy decoding; beam search; beam search with an N-gram language model; and beam search with a compact recurrent neural network language model. Experimental results on the Tifdigit dataset reveal a clear trade-off: the N-gram language model produces the best results compared to RNN-LM, with a phonetic error rate (PER) of 0.0268, representing a relative improvement of 4.0% over the greedy baseline model, and translates into an accuracy of 97.32%. This suggests that N-gram models can outperform neural approaches when reliable, limited data and lexical resources are available. The improved N-gram approach notably outperformed both simple beam search and the RNN neural language model. This improvement is due to higher-order context modeling, its optimized interpolation weights, and its adaptive lexical weighting tailored to the phonotactic structure of the Amazigh language. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

24 pages, 1289 KB

Open AccessSystematic Review

Electrical Cortical Stimulation for Language Mapping in Epilepsy Surgery—A Systematic Review

by Honglin Zhu, Efthymia Korona, Sepehr Shirani, Fatemeh Samadian, Gonzalo Alarcon, Antonio Valentin and Ioannis Stavropoulos

Brain Sci. 2025, 15(12), 1267; https://doi.org/10.3390/brainsci15121267 - 26 Nov 2025

Viewed by 858

Abstract

Background: Language mapping is a critical component of epilepsy surgery, as postoperative language deficits can significantly impact patients’ quality of life. Electrical stimulation mapping has emerged as a valuable tool for identifying eloquent areas of the brain and minimising post-surgical language deficits. However, [...] Read more.

Background: Language mapping is a critical component of epilepsy surgery, as postoperative language deficits can significantly impact patients’ quality of life. Electrical stimulation mapping has emerged as a valuable tool for identifying eloquent areas of the brain and minimising post-surgical language deficits. However, recent studies have shown that language deficits can occur despite language mapping, potentially due to variability in stimulation techniques and language task selection. The validity of specific linguistic tasks for mapping different cortical regions remain inadequately characterised. Objective: To systematically evaluate the validity of linguistic tasks used during electrical cortical stimulation (ECS) for language mapping in epilepsy surgery, analyse task-specific responses across cortical regions, and assess current evidence supporting optimal task selection for different brain areas. Methods: Following PRISMA [2020] guidelines, a systematic literature search was conducted in PubMed and Scopus covering articles published from January 2013 to November 2025. Studies on language testing with electrical cortical stimulation in epilepsy surgery cases were screened. Two reviewers independently screened 956 articles, with 45 meeting the inclusion criteria. Data extraction included language tasks, stimulation modalities (ECS, SEEG, ECoG, DECS), cortical regions and language error types. Results: Heterogeneity in language testing techniques across various centres was identified. Visual naming deficits were primarily associated with stimulation of the posterior and basal temporal regions, fusiform gyrus, and parahippocampal gyrus. Auditory naming elicited impairments in the posterior superior and middle temporal gyri, angular gyrus, and fusiform gyrus. Spontaneous speech errors varied, with phonemic dysphasic errors linked to the inferior frontal and supramarginal gyri, and semantic errors arising from superior temporal and perisylvian parietal regions. Conclusions: Task-specific language mapping reveals distinct cortical specialisations, with systematic patterns emerging across studies. However, marked variability in testing protocols and inadequate standardisation limit reproducibility and cross-centre comparisons. Overall, refining and standardising the language task implementation process could lead to improved outcomes, ultimately minimising resection-related language impairment. Future research should validate task–region associations through prospective multicentre studies with long-term outcome assessment. Full article

(This article belongs to the Topic Language: From Hearing to Speech and Writing)

► Show Figures

Figure 1

38 pages, 6745 KB

Open AccessArticle

Tongan Speech Recognition Based on Layer-Wise Fine-Tuning Transfer Learning and Lexicon Parameter Enhancement

by Junhao Geng, Dongyao Jia, Ziqi Li, Zihao He, Nengkai Wu, Weijia Zhang and Rongtao Cui

Appl. Sci. 2025, 15(21), 11412; https://doi.org/10.3390/app152111412 - 24 Oct 2025

Viewed by 823

Abstract

Speech recognition, as a key driver of artificial intelligence and global communication, has advanced rapidly in major languages, while studies on low-resource languages remain limited. Tongan, a representative Polynesian language, carries significant cultural value. However, Tongan speech recognition faces three main challenges: data [...] Read more.

Speech recognition, as a key driver of artificial intelligence and global communication, has advanced rapidly in major languages, while studies on low-resource languages remain limited. Tongan, a representative Polynesian language, carries significant cultural value. However, Tongan speech recognition faces three main challenges: data scarcity, limited adaptability of transfer learning, and weak dictionary modeling. This study proposes improvements in adaptive transfer learning and NBPE-based dictionary modeling to address these issues. An adaptive transfer learning strategy with layer-wise unfreezing and dynamic learning rate adjustment is introduced, enabling effective adaptation of pretrained models to the target language while improving accuracy and efficiency. In addition, the MEA-AGA is developed by combining the Mind Evolutionary Algorithm (MEA) with the Adaptive Genetic Algorithm (AGA) to optimize the number of byte-pair encoding (NBPE) parameters, thereby enhancing recognition accuracy and speed. The collected Tongan speech data were expanded and preprocessed, after which the experiments were conducted on an NVIDIA RTX 4070 GPU (16 GB) using CUDA 11.8 under the Ubuntu 18.04 operating system. Experimental results show that the proposed method achieved a word error rate (WER) of 26.18% and a word-per-second (WPS) rate of 68, demonstrating clear advantages over baseline methods and confirming its effectiveness for low-resource language applications. Although the proposed approach demonstrates promising performance, this study is still limited by the relatively small corpus size and the early stage of research exploration. Future work will focus on expanding the dataset, refining adaptive transfer strategies, and enhancing cross-lingual generalization to further improve the robustness and scalability of the model. Full article

(This article belongs to the Special Issue Techniques and Applications of Natural Language Processing)

► Show Figures

Figure 1

17 pages, 2618 KB

Open AccessArticle

Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW

by Hadia Arshad, Tahir Abdullah, Mariam Rehman, Afzaal Hussain, Faria Kanwal and Mehwish Parveen

Information 2025, 16(11), 928; https://doi.org/10.3390/info16110928 - 22 Oct 2025

Viewed by 1396

Abstract

Whisper is a transformer-based multilingual model that has illustrated state-of-the-art behavior in numerous languages. However, the efficiency remains persistent with the limited computational resources. To address this issue, an experiment was performed on librispeech-train-clean-100 for training purposes. The test-clean set was utilized to [...] Read more.

Whisper is a transformer-based multilingual model that has illustrated state-of-the-art behavior in numerous languages. However, the efficiency remains persistent with the limited computational resources. To address this issue, an experiment was performed on librispeech-train-clean-100 for training purposes. The test-clean set was utilized to evaluate its performance. To enhance efficiency and to cater the computational needs, a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation, was employed to add a limited number of trainable parameters into the frozen layers of the model. The results showed that Low-Rank Adaptation attained excellent Automatic Speech Recognition results while using fewer computational resources, showing its effectiveness for resource-saving adaptation. The research work emphasizes the promise of Low-Rank Adaptation as a lightweight and scalable fine-tuning strategy for large speech models using a transformer architecture. The baseline Whisper Small model achieved a word error rate of 16.7% without any parameter-efficient adaptation. In contrast, the Low-Rank Adaptation enhanced fine-tuned model achieved a lower word error rate of 6.08%, demonstrating the adaptability of the proposed parameter-efficient approach. Full article

► Show Figures

Figure 1

7 pages, 1456 KB

Open AccessProceeding Paper

Towards a More Natural Urdu: A Comprehensive Approach to Text-to-Speech and Voice Cloning

by Muhammad Ramiz Saud, Muhammad Romail Imran and Raja Hashim Ali

Eng. Proc. 2025, 87(1), 112; https://doi.org/10.3390/engproc2025087112 - 20 Oct 2025

Cited by 13 | Viewed by 1567

Abstract

This paper introduces a comprehensive approach to building natural-sounding Urdu Text-to-Speech (TTS) and voice cloning systems, addressing the lack of computational resources for Urdu. We developed a large-scale dataset of over 100 h of Urdu speech, carefully cleaned and phonetically aligned through an [...] Read more.

This paper introduces a comprehensive approach to building natural-sounding Urdu Text-to-Speech (TTS) and voice cloning systems, addressing the lack of computational resources for Urdu. We developed a large-scale dataset of over 100 h of Urdu speech, carefully cleaned and phonetically aligned through an automated transcription pipeline to preserve linguistic accuracy. The dataset was then used to fine-tune Tacotron2, a neural network model originally trained for English, with modifications tailored to Urdu’s phonological and morphological features. To further enhance naturalness, we integrated voice cloning techniques that capture regional accents and produce personalized speech outputs. Model performance was evaluated through mean opinion score (MOS), word error rate (WER), and speaker similarity, showing substantial improvements compared to previous Urdu systems. The results demonstrate clear progress toward natural and intelligible Urdu speech synthesis, while also revealing challenges such as handling dialectal variation and preventing model overfitting. This work contributes an essential resource and methodology for advancing Urdu natural language processing (NLP), with promising applications in education, accessibility, entertainment, and assistive technologies. Full article

(This article belongs to the Proceedings of The 5th International Electronic Conference on Applied Sciences)

► Show Figures

Graphical abstract

16 pages, 2489 KB

Open AccessArticle

Sentence-Level Silent Speech Recognition Using a Wearable EMG/EEG Sensor System with AI-Driven Sensor Fusion and Language Model

by Nicholas Satterlee, Xiaowei Zuo, Kee Moon, Sung Q. Lee, Matthew Peterson and John S. Kang

Sensors 2025, 25(19), 6168; https://doi.org/10.3390/s25196168 - 5 Oct 2025

Viewed by 2781

Abstract

Silent speech recognition (SSR) enables communication without vocalization by interpreting biosignals such as electromyography (EMG) and electroencephalography (EEG). Most existing SSR systems rely on high-density, non-wearable sensors and focus primarily on isolated word recognition, limiting their practical usability. This study presents a wearable [...] Read more.

Silent speech recognition (SSR) enables communication without vocalization by interpreting biosignals such as electromyography (EMG) and electroencephalography (EEG). Most existing SSR systems rely on high-density, non-wearable sensors and focus primarily on isolated word recognition, limiting their practical usability. This study presents a wearable SSR system capable of accurate sentence-level recognition using single-channel EMG and EEG sensors with real-time wireless transmission. A moving window-based few-shot learning model, implemented with a Siamese neural network, segments and classifies words from continuous biosignals without requiring pauses or manual segmentation between word signals. A novel sensor fusion model integrates both EMG and EEG modalities, enhancing classification accuracy. To further improve sentence-level recognition, a statistical language model (LM) is applied as post-processing to correct syntactic and lexical errors. The system was evaluated on a dataset of four military command sentences containing ten unique words, achieving 95.25% sentence-level recognition accuracy. These results demonstrate the feasibility of sentence-level SSR using wearable sensors through a window-based few-shot learning model, sensor fusion, and ML applied to limited simultaneous EMG and EEG signals. Full article

(This article belongs to the Special Issue Advanced Sensing Techniques in Biomedical Signal Processing)

► Show Figures

Figure 1

Search Results (136)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (136)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI