Applications of Speech and Language Technologies in Healthcare

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (31 October 2022) | Viewed by 22932

Special Issue Editors


E-Mail Website
Guest Editor
HiTZ Basque Center for Language Technologies—Aholab, University of the Basque Country UPV/EHU, 48013 Bilbao, Spain
Interests: speech analysis, processing, and synthesis; signal processing; prosody; spoken language resources; esophageal speech; voice conversion; speaker characterization; technologies for oral disabilities; silent speech interfaces

E-Mail Website
Guest Editor
Department of Signal Theory, Telematics and Communications, University of Granada, 18071 Granada, Spain
Interests: clinical applications of speech technology; silent speech interfaces; neuroprosthetics; speech synthesis; speech processing; automatic speaker verification; speech biometric

E-Mail Website
Guest Editor
Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK
Interests: automatic recognition of atypical speech; detection and tracking of verbal and nonverbal traits in speech and language

Special Issue Information

Dear Colleagues,

Speech and language technologies (SLTs) have experienced a major boom in recent years. The use of voice to interact with machines is no longer a Sci-Fi dream, but a reality. We are currently experiencing an explosion in the use of virtual assistants and other speech-enabled devices that incorporate these technologies. The applications of speech technologies in different fields are innumerable. Most of the existing commercial speech-enabled systems are used at home and in cars (for example, virtual assistants and smart speakers). In addition, these technologies are already being used extensively in the entertainment industry (for subtitling and automatic translation in multimedia channels, social networks, video games, etc.).

Unfortunately, people with speech and language impairments often have difficulties when using these technologies, which aggravates the stigmatization of this population. Hence, it is important to take into account the special needs and characteristics of this sector of society when developing these technologies.

Apart from the abovementioned applications, another use of SLTs is in the form of assistive technology to help people with speech impairments to communicate. Furthermore, speech-based diagnoses of certain voice and respiratory pathologies (e.g., dysphonia, vocal fold nodules, COVID-19, etc.) and, even, certain neurological disorders, such as Alzheimer’s and Parkinson’s disease, have recently been shown to be possible. In addition, new communication interfaces with great potential are emerging, such as silent speech interfaces.

In this Special Issue, we attempt to collect relevant contributions in the development of speech and language technologies focused on improving the integration of people with speech impairments in society, as well as for the detection and monitoring of pathologies or diseases. We also intend to attract studies on the development of applications for voice professionals in the clinical field.

Relevant research topics include (but are not limited to):

  • Speech and language technologies for augmentative and alternative communication (AAC);
  • Silent speech interfaces;
  • Voice conversion (VC) and text-to-speech (TTS) systems for speech restoration;
  • Automatic speech recognition for people with speech impairments;
  • Diagnosis and monitoring of voice disorders and other respiratory diseases;
  • Speech-based diagnosis and assessment of neurological disorders;
  • Personalization of speech tools for people with speech impairments;
  • Voice banking initiatives;
  • Tools and software for speech therapists and clinicians.

Dr. Inma Hernaez Rioja
Dr. José A. González-López
Dr. Heidi Christensen
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech analysis
  • speech processing
  • speech synthesis
  • spoken language resources
  • voice conversion
  • speaker characterization
  • technologies for oral disabilities
  • silent speech interfaces
  • automatic speech recognition
  • dysartric speech
  • esophageal speech
  • augmentative and alternative communication
  • silent speech interfaces
  • neuroprosthetics

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research, Review

3 pages, 195 KiB  
Editorial
Special Issue on Applications of Speech and Language Technologies in Healthcare
by Inma Hernáez-Rioja, Jose A. Gonzalez-Lopez and Heidi Christensen
Appl. Sci. 2023, 13(11), 6840; https://doi.org/10.3390/app13116840 - 05 Jun 2023
Viewed by 719
Abstract
In recent years, the exploration and uptake of digital health technologies have advanced rapidly with a real potential impact to revolutionise healthcare delivery and associated industries [...] Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)

Research

Jump to: Editorial, Review

21 pages, 1431 KiB  
Article
Automated Detection of the Competency of Delivering Guided Self-Help for Anxiety via Speech and Language Processing
by Dalia Attas, Niall Power, Jessica Smithies, Charlotte Bee, Vikki Aadahl, Stephen Kellett, Chris Blackmore and Heidi Christensen
Appl. Sci. 2022, 12(17), 8608; https://doi.org/10.3390/app12178608 - 28 Aug 2022
Cited by 4 | Viewed by 1790
Abstract
Speech and language play an essential role in automatically assessing several psychotherapeutic qualities. These automation procedures require translating the manual rating qualities to speech and language features that accurately capture the assessed psychotherapeutic quality. Speech features can be determined by analysing recordings of [...] Read more.
Speech and language play an essential role in automatically assessing several psychotherapeutic qualities. These automation procedures require translating the manual rating qualities to speech and language features that accurately capture the assessed psychotherapeutic quality. Speech features can be determined by analysing recordings of psychotherapeutic conversations (acoustics), while language-based analyses rely on the transcriptions of such psychotherapeutic conversations (linguistics). Guided self-help is a psychotherapeutic intervention that mainly relay on therapeutic competency of practitioners. This paper investigates the feasibility of automatically analysing guided self-help sessions for mild-to-moderate anxiety to detect and predict practitioner competence. This analysis is performed on sessions drawn from a patient preference randomised controlled trial using actual patient-practitioner conversations manually rated using a valid and reliable measure of competency. The results show the efficacy and potential of automatically detecting practitioners’ competence using a system based on acoustic and linguistic features extracted from transcripts generated by an automatic speech recogniser. Feature extraction, feature selection and classification or regression have been implemented as blocks of the prediction model. The Lasso regression model achieved the best prediction results with an R of 0.92 and lower error rates with an MAE of 1.66 and RMSE of 2.25. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Figure 1

15 pages, 501 KiB  
Article
Intelligibility Improvement of Esophageal Speech Using Sequence-to-Sequence Voice Conversion with Auditory Attention
by Kadria Ezzine, Joseph Di Martino and Mondher Frikha
Appl. Sci. 2022, 12(14), 7062; https://doi.org/10.3390/app12147062 - 13 Jul 2022
Cited by 2 | Viewed by 1129
Abstract
Laryngectomees are individuals whose larynx has been surgically removed, usually due to laryngeal cancer. The immediate consequence of this operation is that these individuals (laryngectomees) are unable to speak. Esophageal speech (ES) remains the preferred alternative speaking method for laryngectomees. However, compared to [...] Read more.
Laryngectomees are individuals whose larynx has been surgically removed, usually due to laryngeal cancer. The immediate consequence of this operation is that these individuals (laryngectomees) are unable to speak. Esophageal speech (ES) remains the preferred alternative speaking method for laryngectomees. However, compared to the laryngeal voice, ES is characterized by low intelligibility and poor quality due to chaotic fundamental frequency F0, specific noises, and low intensity. Our proposal to solve these problems is to take advantage of voice conversion as an effective way to improve speech quality and intelligibility. To this end, we propose in this work a novel esophageal–laryngeal voice conversion (VC) system based on a sequence-to-sequence (Seq2Seq) model combined with an auditory attention mechanism. The originality of the proposed framework is that it adopts an auditory attention technique in our model, which leads to more efficient and adaptive feature mapping. In addition, our VC system does not require the classical DTW alignment process during the learning phase, which avoids erroneous mappings and significantly reduces the computational time. Moreover, to preserve the identity of the target speaker, the excitation and phase coefficients are estimated by querying a binary search tree. In experiments, objective and subjective tests confirmed that the proposed approach performs better even in some difficult cases in terms of speech quality and intelligibility. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Figure 1

32 pages, 1711 KiB  
Article
Things to Consider When Automatically Detecting Parkinson’s Disease Using the Phonation of Sustained Vowels: Analysis of Methodological Issues
by Alex S. Ozbolt, Laureano Moro-Velazquez, Ioan Lina, Ankur A. Butala and Najim Dehak
Appl. Sci. 2022, 12(3), 991; https://doi.org/10.3390/app12030991 - 19 Jan 2022
Cited by 18 | Viewed by 2442
Abstract
Diagnosing Parkinson’s Disease (PD) necessitates monitoring symptom progression. Unfortunately, diagnostic confirmation often occurs years after disease onset. A more sensitive and objective approach is paramount to the expedient diagnosis and treatment of persons with PD (PwPDs). Recent studies have shown that we can [...] Read more.
Diagnosing Parkinson’s Disease (PD) necessitates monitoring symptom progression. Unfortunately, diagnostic confirmation often occurs years after disease onset. A more sensitive and objective approach is paramount to the expedient diagnosis and treatment of persons with PD (PwPDs). Recent studies have shown that we can train accurate models to detect signs of PD from audio recordings of confirmed PwPDs. However, disparities exist between studies and may be caused, in part, by differences in employed corpora or methodologies. Our hypothesis is that unaccounted covariates in methodology, experimental design, and data preparation resulted in overly optimistic results in studies of PD automatic detection employing sustained vowels. These issues include record-wise fold creation rather than subject-wise; an imbalance of age between the PwPD and control classes; using too small of a corpus compared to the sizes of feature vectors; performing cross-validation without including development data; and the absence of cross-corpora testing to confirm results. In this paper, we evaluate the influence of these methodological issues in the automatic detection of PD employing sustained vowels. We perform several experiments isolating each issue to measure its influence employing three different corpora. Moreover, we analyze if the perceived dysphonia of the speakers could be causing differences in results between the corpora. Results suggest that each independent methodological issue analyzed has an effect on classification accuracy. Consequently, we recommend a list of methodological steps to be considered in future experiments to avoid overoptimistic or misleading results. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Figure 1

20 pages, 1448 KiB  
Article
Ultrasonic Doppler Based Silent Speech Interface Using Perceptual Distance
by Ki-Seung Lee
Appl. Sci. 2022, 12(2), 827; https://doi.org/10.3390/app12020827 - 14 Jan 2022
Cited by 3 | Viewed by 1291
Abstract
Moderate performance in terms of intelligibility and naturalness can be obtained using previously established silent speech interface (SSI) methods. Nevertheless, a common problem associated with SSI has involved deficiencies in estimating the spectrum details, which results in synthesized speech signals that are rough, [...] Read more.
Moderate performance in terms of intelligibility and naturalness can be obtained using previously established silent speech interface (SSI) methods. Nevertheless, a common problem associated with SSI has involved deficiencies in estimating the spectrum details, which results in synthesized speech signals that are rough, harsh, and unclear. In this study, harmonic enhancement (HE), was used during postprocessing to alleviate this problem by emphasizing the spectral fine structure of speech signals. To improve the subjective quality of synthesized speech, the difference between synthesized and actual speech was established by calculating the distance in the perceptual domains instead of using the conventional mean square error (MSE). Two deep neural networks (DNNs) were employed to separately estimate the speech spectra and the filter coefficients of HE, connected in a cascading manner. The DNNs were trained to incrementally and iteratively minimize both the MSE and the perceptual distance (PD). A feasibility test showed that the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility measure (STOI) were improved by 17.8 and 2.9%, respectively, compared with previous methods. Subjective listening tests revealed that the proposed method yielded perceptually preferred results compared with that of the conventional MSE-based method. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Graphical abstract

11 pages, 628 KiB  
Article
Influence of TTS Systems Performance on Reaction Times in People with Aphasia
by Giorgia Cistola, Alex Peiró-Lilja, Guillermo Cámbara, Ineke van der Meulen and Mireia Farrús
Appl. Sci. 2021, 11(23), 11320; https://doi.org/10.3390/app112311320 - 29 Nov 2021
Cited by 1 | Viewed by 1507
Abstract
Text-to-speech (TTS) systems provide fundamental reading support for people with aphasia and reading difficulties. However, artificial voices are more difficult to process than natural voices. The current study is an extended analysis of the results of a clinical experiment investigating which, among three [...] Read more.
Text-to-speech (TTS) systems provide fundamental reading support for people with aphasia and reading difficulties. However, artificial voices are more difficult to process than natural voices. The current study is an extended analysis of the results of a clinical experiment investigating which, among three artificial voices and a digitised human voice, is more suitable for people with aphasia and reading impairments. Such results show that the voice synthesised with Ogmios TTS, a concatenative speech synthesis system, caused significantly slower reaction times than the other three voices used in the experiment. The present study explores whether and what voice quality metrics are linked to delayed reaction times. For this purpose, the voices were analysed using an automatic assessment of intelligibility, naturalness, and jitter and shimmer voice quality parameters. This analysis revealed that Ogmios TTS, in general, performed worse than the other voices in all parameters. These observations could explain the significantly delayed reaction times in people with aphasia and reading impairments when listening to Ogmios TTS and could open up consideration about which TTS to choose for compensative devices for these patients based on the voice analysis of these parameters. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Figure 1

14 pages, 614 KiB  
Article
Improving Aphasic Speech Recognition by Using Novel Semi-Supervised Learning Methods on AphasiaBank for English and Spanish
by Iván G. Torre, Mónica Romero and Aitor Álvarez
Appl. Sci. 2021, 11(19), 8872; https://doi.org/10.3390/app11198872 - 24 Sep 2021
Cited by 15 | Viewed by 3150
Abstract
Automatic speech recognition in patients with aphasia is a challenging task for which studies have been published in a few languages. Reasonably, the systems reported in the literature within this field show significantly lower performance than those focused on transcribing non-pathological clean speech. [...] Read more.
Automatic speech recognition in patients with aphasia is a challenging task for which studies have been published in a few languages. Reasonably, the systems reported in the literature within this field show significantly lower performance than those focused on transcribing non-pathological clean speech. It is mainly due to the difficulty of recognizing a more unintelligible voice, as well as due to the scarcity of annotated aphasic data. This work is mainly focused on applying novel semi-supervised learning methods to the AphasiaBank dataset in order to deal with these two major issues, reporting improvements for the English language and providing the first benchmark for the Spanish language for which less than one hour of transcribed aphasic speech was used for training. In addition, the influence of reinforcing the training and decoding processes with out-of-domain acoustic and text data is described by using different strategies and configurations to fine-tune the hyperparameters and the final recognition systems. The interesting results obtained encourage extending this technological approach to other languages and scenarios where the scarcity of annotated data to train recognition models is a challenging reality. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Figure 1

16 pages, 15172 KiB  
Article
Which Utterance Types Are Most Suitable to Detect Hypernasality Automatically?
by Ignacio Moreno-Torres, Andrés Lozano, Enrique Nava and Rosa Bermúdez-de-Alvear
Appl. Sci. 2021, 11(19), 8809; https://doi.org/10.3390/app11198809 - 22 Sep 2021
Cited by 2 | Viewed by 1676
Abstract
Automatic tools to detect hypernasality have been traditionally designed to analyze sustained vowels exclusively. This is in sharp contrast with clinical recommendations, which consider it necessary to use a variety of utterance types (e.g., repeated syllables, sustained sounds, sentences, etc.) This study explores [...] Read more.
Automatic tools to detect hypernasality have been traditionally designed to analyze sustained vowels exclusively. This is in sharp contrast with clinical recommendations, which consider it necessary to use a variety of utterance types (e.g., repeated syllables, sustained sounds, sentences, etc.) This study explores the feasibility of detecting hypernasality automatically based on speech samples other than sustained vowels. The participants were 39 patients and 39 healthy controls. Six types of utterances were used: counting 1-to-10 and repetition of syllable sequences, sustained consonants, sustained vowel, words and sentences. The recordings were obtained, with the help of a mobile app, from Spain, Chile and Ecuador. Multiple acoustic features were computed from each utterance (e.g., MFCC, formant frequency) After a selection process, the best 20 features served to train different classification algorithms. Accuracy was the highest with syllable sequences and also with some words and sentences. Accuracy increased slightly by training the classifiers with between two and three utterances. However, the best results were obtained by combining the results of multiple classifiers. We conclude that protocols for automatic evaluation of hypernasality should include a variety of utterance types. It seems feasible to detect hypernasality automatically with mobile devices. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Figure 1

11 pages, 288 KiB  
Communication
Speech-Based Support System to Supervise Chronic Obstructive Pulmonary Disease Patient Status
by Mireia Farrús, Joan Codina-Filbà, Elisenda Reixach, Erik Andrés, Mireia Sans, Noemí Garcia and Josep Vilaseca
Appl. Sci. 2021, 11(17), 7999; https://doi.org/10.3390/app11177999 - 29 Aug 2021
Cited by 6 | Viewed by 2602
Abstract
Patients with chronic obstructive pulmonary disease (COPD) suffer from voice changes with respect to the healthy population. However, two issues remain to be studied: how long-term speech elements such as prosody are affected; and whether physical effort and medication also affect the speech [...] Read more.
Patients with chronic obstructive pulmonary disease (COPD) suffer from voice changes with respect to the healthy population. However, two issues remain to be studied: how long-term speech elements such as prosody are affected; and whether physical effort and medication also affect the speech of patients with COPD, and if so, how an automatic speech-based detection system of COPD measurements can be influenced by these changes. The aim of the current study is to address both issues. To this end, long read speech from COPD and control groups was recorded, and the following experiments were performed: (a) a statistical analysis over the study and control groups to analyse the effects of physical effort and medication on speech; and (b) an automatic classification experiment to analyse how different recording conditions can affect the performance of a COPD detection system. The results obtained show that speech—especially prosodic features—is affected by physical effort and inhaled medication in both groups, though in opposite ways; and that the recording condition has a relevant role when designing an automatic COPD detection system. The current work takes a step forward in the understanding of speech in patients with COPD, and in turn, in the research on its automatic detection to help professionals supervising patient status. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
14 pages, 675 KiB  
Article
Enrichment of Oesophageal Speech: Voice Conversion with Duration–Matched Synthetic Speech as Target
by Sneha Raman, Xabier Sarasola, Eva Navas and Inma Hernaez
Appl. Sci. 2021, 11(13), 5940; https://doi.org/10.3390/app11135940 - 26 Jun 2021
Cited by 5 | Viewed by 1687
Abstract
Pathological speech such as Oesophageal Speech (OS) is difficult to understand due to the presence of undesired artefacts and lack of normal healthy speech characteristics. Modern speech technologies and machine learning enable us to transform pathological speech to improve intelligibility and quality. We [...] Read more.
Pathological speech such as Oesophageal Speech (OS) is difficult to understand due to the presence of undesired artefacts and lack of normal healthy speech characteristics. Modern speech technologies and machine learning enable us to transform pathological speech to improve intelligibility and quality. We have used a neural network based voice conversion method with the aim of improving the intelligibility and reducing the listening effort (LE) of four OS speakers of varying speaking proficiency. The novelty of this method is the use of synthetic speech matched in duration with the source OS as the target, instead of parallel aligned healthy speech. We evaluated the converted samples from this system using a collection of Automatic Speech Recognition systems (ASR), an objective intelligibility metric (STOI) and a subjective test. ASR evaluation shows that the proposed system had significantly better word recognition accuracy compared to unprocessed OS, and baseline systems which used aligned healthy speech as the target. There was an improvement of at least 15% on STOI scores indicating a higher intelligibility for the proposed system compared to unprocessed OS, and a higher target similarity in the proposed system compared to baseline systems. The subjective test reveals a significant preference for the proposed system compared to unprocessed OS for all OS speakers, except one who was the least proficient OS speaker in the data set. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Figure 1

Review

Jump to: Editorial, Research

17 pages, 469 KiB  
Review
A Situational Analysis of Current Speech-Synthesis Systems for Child Voices: A Scoping Review of Qualitative and Quantitative Evidence
by Camryn Terblanche, Michal Harty, Michelle Pascoe and Benjamin V. Tucker
Appl. Sci. 2022, 12(11), 5623; https://doi.org/10.3390/app12115623 - 01 Jun 2022
Cited by 3 | Viewed by 2265
Abstract
(1) Background: Speech synthesis has customarily focused on adult speech, but with the rapid development of speech-synthesis technology, it is now possible to create child voices with a limited amount of child-speech data. This scoping review summarises the evidence base related to developing [...] Read more.
(1) Background: Speech synthesis has customarily focused on adult speech, but with the rapid development of speech-synthesis technology, it is now possible to create child voices with a limited amount of child-speech data. This scoping review summarises the evidence base related to developing synthesised speech for children. (2) Method: The included studies were those that were (1) published between 2006 and 2021 and (2) included child participants or voices of children aged between 2–16 years old. (3) Results: 58 studies were identified. They were discussed based on the languages used, the speech-synthesis systems and/or methods used, the speech data used, the intelligibility of the speech and the ages of the voices. Based on the reviewed studies, relative to adult-speech synthesis, developing child-speech synthesis is notably more challenging. Child speech often presents with acoustic variability and articulatory errors. To account for this, researchers have most often attempted to adapt adult-speech models, using a variety of different adaptation techniques. (4) Conclusions: Adapting adult speech has proven successful in child-speech synthesis. It appears that the resulting quality can be improved by training a large amount of pre-selected speech data, aided by a neural-network classifier, to better match the children’s speech. We encourage future research surrounding individualised synthetic speech for children with CCN, with special attention to children who make use of low-resource languages. Full article
(This article belongs to the Special Issue Applications of Speech and Language Technologies in Healthcare)
Show Figures

Figure 1

Back to TopTop