MDPI - Publisher of Open Access Journals

25 pages, 492 KiB

Open AccessArticle

A Study on Model Training Strategies for Speaker-Independent and Vocabulary-Mismatched Dysarthric Speech Recognition

by Jinzi Qi and Hugo Van hamme

Appl. Sci. 2025, 15(4), 2006; https://doi.org/10.3390/app15042006 - 14 Feb 2025

Viewed by 932

Automatic speech recognition (ASR) systems often struggle to recognize speech from individuals with dysarthria, a speech disorder with neuromuscular causes, with accuracy declining further for unseen speakers and content. Achieving robustness for such situations requires ASR systems to address speaker-independent and vocabulary-mismatched scenarios, minimizing user adaptation effort. This study focuses on comprehensive training strategies and methods to tackle these challenges, leveraging the transformer-based Wav2Vec2.0 model. Unlike prior research, which often focuses on limited datasets, we systematically explore training data selection strategies across diverse source types (languages, canonical vs. dysarthric, and generic vs. in-domain) in a speaker-independent setting. For the under-explored vocabulary-mismatched scenarios, we evaluate conventional methods, identify their limitations, and propose a solution that uses phonological features as intermediate representations for phone recognition to address these gaps. Experimental results demonstrate that this approach enhances recognition across dysarthric datasets in both speaker-independent and vocabulary-mismatched settings. By integrating advanced transfer learning techniques with the innovative use of phonological features, this study addresses key challenges for dysarthric speech recognition, setting a new benchmark for robustness and adaptability in the field. Full article

► Show Figures

Figure 1

17 pages, 3114 KiB

Open AccessArticle

Real-Time Communication Aid System for Korean Dysarthric Speech

by Kwanghyun Park and Jungpyo Hong

Appl. Sci. 2025, 15(3), 1416; https://doi.org/10.3390/app15031416 - 30 Jan 2025

Viewed by 1331

Abstract

Dysarthria is a speech disorder characterized by difficulties in articulation and vocalization due to impaired control of the articulatory system. Around 30% of individuals with speech disorders have dysarthria, facing significant communication challenges. Existing assistive tools for dysarthria either require additional manipulation or only provide word-level speech support, limiting their ability to support effective communication in real-world situations. Thus, this paper proposes a real-time communication aid system that converts sentence-level Korean dysarthric speech to non-dysarthric normal speech. The proposed system consists of two main parts in cascading form. Specifically, a Korean Automatic Speech Recognition (ASR) model is trained with dysarthric utterances using a conformer-based architecture and the graph transducer network–connectionist temporal classification algorithm, significantly enhancing recognition performance over previous models. Subsequently, a Korean Text-To-Speech (TTS) model based on Jointly Training FastSpeech2 and HiFi-GAN for end-to-end Text-to-Speech (JETS) is pipelined to synthesize high-quality non-dysarthric normal speech. These models are integrated into a single system on an app server, which receives 5–10 s of dysarthric speech and converts it to normal speech after 2–3 s. This can provide a practical communication aid for people with dysarthria. Full article

► Show Figures

Graphical abstract

23 pages, 3932 KiB

Open AccessReview

A Survey of Automatic Speech Recognition for Dysarthric Speech

by Zhaopeng Qian and Kejing Xiao

Electronics 2023, 12(20), 4278; https://doi.org/10.3390/electronics12204278 - 16 Oct 2023

Cited by 15 | Viewed by 5898

Abstract

Dysarthric speech has several pathological characteristics, such as discontinuous pronunciation, uncontrolled volume, slow speech, explosive pronunciation, improper pauses, excessive nasal sounds, and air-flow noise during pronunciation, which differ from healthy speech. Automatic speech recognition (ASR) can be very helpful for speakers with dysarthria. Our research aims to provide a scoping review of ASR for dysarthric speech, covering papers in this field from 1990 to 2022. Our survey found that the development of research studies about the acoustic features and acoustic models of dysarthric speech is nearly synchronous. During the 2010s, deep learning technologies were widely applied to improve the performance of ASR systems. In the era of deep learning, many advanced methods (such as convolutional neural networks, deep neural networks, and recurrent neural networks) are being applied to design acoustic models and lexical and language models for dysarthric-speech-recognition tasks. Deep learning methods are also used to extract acoustic features from dysarthric speech. Additionally, this scoping review found that speaker-dependent problems seriously limit the generalization applicability of the acoustic model. The scarce available speech data cannot satisfy the amount required to train models using big data. Full article

(This article belongs to the Special Issue Advanced Natural Language Processing Technology and Applications)

► Show Figures

Figure 1

16 pages, 8210 KiB

Open AccessArticle

A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology

by Yu-Yi Lin, Wei-Zhong Zheng, Wei Chung Chu, Ji-Yan Han, Ying-Hsiu Hung, Guan-Min Ho, Chia-Yuan Chang and Ying-Hui Lai

Appl. Sci. 2021, 11(6), 2477; https://doi.org/10.3390/app11062477 - 10 Mar 2021

Cited by 29 | Viewed by 4652

Abstract

Voice control is an important way of controlling mobile devices; however, using it remains a challenge for dysarthric patients. Currently, there are many approaches, such as automatic speech recognition (ASR) systems, being used to help dysarthric patients control mobile devices. However, the large computation power requirement for the ASR system increases implementation costs. To alleviate this problem, this study proposed a convolution neural network (CNN) with a phonetic posteriorgram (PPG) speech feature system to recognize speech commands, called CNN–PPG; meanwhile, the CNN model with Mel-frequency cepstral coefficient (CNN–MFCC model) and ASR-based systems were used for comparison. The experiment results show that the CNN–PPG system provided 93.49% accuracy, better than the CNN–MFCC (65.67%) and ASR-based systems (89.59%). Additionally, the CNN–PPG used a smaller model size comprising only 54% parameter numbers compared with the ASR-based system; hence, the proposed system could reduce implementation costs for users. These findings suggest that the CNN–PPG system could augment a communication device to help dysarthric patients control the mobile device via speech commands in the future. Full article

(This article belongs to the Special Issue Machine Learning and Signal Processing for IOT Applications)

► Show Figures

Figure 1

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (4)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI