1. Introduction
Dysarthria is a motor speech disorder caused by a neurological injury [
1] affecting the brain areas responsible for speech. This damage manifests itself in different ways in different subjects, leading to several types of dysarthria. Speech disorders can affect the production, rhythm, pitch, rate, loudness, quality, and duration of speech. Dysarthric people comprise individuals with a primary speech disorder or those who experience a speech disorder as a result of a disease such as amyotrophic lateral sclerosis (ALS) or Parkinson’s Disease (PD). Dysarthria reduces speech intelligibility, thus affecting social interaction and quality of life.
Despite the fact that in the last several years Automatic Speech Recognition (ASR) systems have recorded an outstanding performance improvement, current commercial technology cannot offer a good solution to help people affected by speech disabilities [
2]. This is a crucial point because people affected by speech impairments usually also have other kinds of disabilities. For instance, people affected by Parkinson’s also have mobility impairments so, a Virtual Assistant (VA) with a hands-free interface that can understand their vocal commands can really help people with Parkinson’s in their daily lives.
For this reason, the most important challenge is to build an ASR system able to understand a dysarthric utterance. The main implementations of such system are based on the Hidden Markov Model (HMM) combined with the Gaussian Mixture Model (GMM) [
3] or on a Deep Neural Network (DNN) [
4].
Nowadays, ASR systems and hands-free interfaces are not designed to process commands from a person with dysarthria, because they are mainly trained on unimpaired speech. In fact, the pronunciation of speakers with dysarthria deviates from that of non-disabled speakers in many aspects. Thus, state-of-the-art approaches in ASR show very low performances when applied to dysarthric speech [
5].
In order to design an ASR system optimised for dysarthric speech, a large amount of dysarthric data is needed. In the last years, the interest of the scientific community in this topic has grown. As a result, several dysarthric speech databases have been created, especially in English [
6,
7,
8]. Moreover, preliminary studies on ASR systems for dysarthric speech have been presented. An earlier study about acoustic and lexical model adaptation was performed in [
9]. This study showed a Word Error Rate (WER) average reduction of 36.99% for an ASR system trained over a large vocabulary dysarthric speech database.
Another interesting approach comes from the idea of tuning GMM-HMM parameters of ASR systems that have been developed for unimpaired speech. In [
10], TORGO database [
6] was exploited to perform a Dysarthric speech recognition task. An acoustic model for a dysarthric speech recognition system using GMM-HMMs and DNN-HMMs was adopted. This approach focused on the tuning of speaker-specific parameters. A relative WER reduction of 17.62% with respect to the baseline system trained on a more complex model was reported. A comparative study among different architectures was performed in [
11]. The results show that hybrid DNN-HMM models outperform classical GMM-HMM ones according to WER measures. The database used was TORGO database [
6] and a 13% improvement in WER was achieved with respect to the classical architectures.
ASR systems are based on the estimation of speech features using a moving window approach. Specifically, for each utterance, a feature matrix vector is created. This matrix comprises different speech features, such as Mel-Feature Cepstral Coefficient (MFCC) or Perceptual Linear Prediction (PLP) [
12], that are estimated in different time intervals, i.e. time windows, that span the word or sentence to be recognised.
Most ASR systems use a fixed frameshift and window size in speech processing. This is based on the assumption that the non-stationary speech signal can be approximated by a piecewise quasi-stationary process. The common choice of 25 ms window and 10 ms shift size is a compromise between data rate and resolution determined empirically to give a reasonable performance on average. In ASR systems for dysarthric speech, these values might not represent the optimal choice. All the mentioned studies on ASR systems for dysarthric speech [
9,
10,
11] share the same values for these parameters. Specifically, they adopt a value of 15 ms as a time step in the moving window procedure instead of 10 ms as included in the standard approach. The window size is 25 ms that is the value usually adopted in ASR for unimpaired speech.
Starting from the above-mentioned findings, in [
13] window and shift sizes were optimised at a single-subject level, thus adopting a speaker-dependent (SD) approach. Specifically, the goal was to estimate subject-specific values for such parameters able to minimize the WER. The ASR system was developed by using Kaldi toolkit [
14] and it was trained over an Italian dysarthric database composed of 5 speakers. The results showed that there exists an Optimal Region (OR) in the window and shift field where the ASR performance is optimised with respect to the standard values. The observed improvements in the WER ranged from 31% to 81% according to the selected speaker.
Although promising, these results were obtained using a small Italian speakers with dysarthria database that contains only five speakers (three males and two females). Furthermore, no comparison of developed ASR performance was performed by using unimpaired speech samples.
This work represents an extension of the paper presented in the proceedings of the 2020 International Conference on Applications in Electronics Pervading Industry, Environment and Society. In this work, we aim at significantly improving the work in [
13] by using a bigger Italian dysarthric database as well as an Italian database of unimpaired speech. The main purpose of this work is to confirm the preliminary findings described in [
13] and to further critically explore the results by looking in at the possible role of subject-specific speech features in determining the time window parameters. To achieve this goal, a relationship between the speaker’s OR and speech features will be introduced.
The findings of this work could be useful to fine-tune the ASR system according to each speaker characteristics, thus optimising the ASR system performance without recording a huge amount of hours of the speaker’s voice.
We describe the materials and methods used in
Section 2 before discussion of the experimental setups in
Section 3. In
Section 4 we report the experimental results and then in
Section 5 we approach a discussion about them. Finally, conclusions are drawn in
Section 6.
5. Discussion
From a technological point of view, in this work we exploit GMM-HMM architecture for the ASR design. As stated in the introduction, several works highlight the advantages of DNN-HMM architecture with respect to GMM-HMM one. We have to stress that this is the case when a large amount of data is available to train the DNN model. In the case of small training sample sizes, the GMM model was shown to have better performance [
16].
Table 2 reports the results of the second experiment over IDEA database. As said in
Section 4.1, the first eight subjects do not have shown any improvement by using the OR. Despite this, the WER values are significantly low, thus we can assume that the performances are comparable to each other. For speakers 401, 311, 206, and 323 the WERs assume similar values. Therefore, for these speakers, using the OR instead of the baseline is not relevant. For the other speakers, there exists an improvement in the ASR performance by using the OR parameters. These improvements become much more significant in the speakers in whom the baseline has worse performance. For instance, for speaker 314 the baseline WER is 80.67%, which means that the ASR system is useless. Using the OR instead, the WER goes to 43.15% that makes the ASR system much more usable from the dysarthric user. The same reasoning can be applied to speakers 405, 402, 307, 321, and 306.
From
Table 3, as in the case of the speakers with dysarthria, we can infer that the window size is not so important for the ASR performance. This is because the WER values do not show relevant differences with respect to state-of-the-art/baseline values. This is also due to the large standard deviation of the window size parameter, thus indicating that this parameter can be freely chosen within a large interval without affecting optimal performances. This could be inferred also from the
Figure 4. For the shift size parameter, the mean value tends to be similar to the baseline one and the standard deviation is very tiny. This result confirms that the baseline shift size is a good value for unimpaired speech. From the WER values comparison point of view, using OR parameters instead of baseline for unimpaired speech is not so worthy.
From the second experiment, regarding window size, the correlation analysis showed high correlation coefficients with some speech features related to speech spectral content. Specifically, increasing window sizes were increasing at increasing levels of SCG, spectrum standard deviation, and BED. Increasing BED value means that the energy of low frequencies (from 0 Hz to 500 Hz) become more meaningful than high frequencies (above 500 Hz). On the other hand, the SCG value varies between 21,337 Hz and 75,729 Hz with an average of 46,099 Hz. So it is reasonable to say that the high correlation between BED and SCG with window size might indicate the great importance of the vocalised signal, typical of frequencies around 500 Hz, for the window size parameter.
Regarding window shift, correlations that are more significant were highlighted. Specifically, jitter-related measures seem to be the best feature candidates to predict the window shift. Specifically, at an increasing level of jitter, a decreasing window shift is required. Since jitter is related to cycle-to-cycle variations of the fundamental period thus representing the deviation from the periodicity of the speech signal. Thus, it is interesting that window shift is negatively correlated with such a measure. We have to point out that this measure is estimated from words, so it is not so straightforward to relate it to its original meaning, which is more correctly obtained from sustained vowels. However, when this measure is applied to words we were still getting interesting information about voice quality. In this scenario, a higher level of perturbations of fundamental frequency implies a shorter time window time shift and thus finer temporal information. The obtained results show a significant correlation with the speech spectral information contained in the LTAS. Specifically, an increase of speech power between 20 Hz and 5000 Hz is correlated with an increased window time shift. This might be related to the fact that an increase of speech power might be related to a better speech production quality, from a sound intensity point of view. In this case, coarser information is needed. This seems to be confirmed by the correlation values obtained by RMS energy and Mean Intensity features that show significant or close-to-significance p-values.
6. Conclusions
In this work, we conducted two experiments regarding speech analysis of Italian speakers with and without dysarthria.
The first experiment extends the work done in [
13] by adding more data. Mainly, we decided to use the voices of 30 speakers with dysarthria taken from the IDEA database [
15], and the voices of 10 unimpaired speakers taken from the CLIPS database.
The aim of the first experiment was to validate the existence of an Optimal Region in the field of window and shift sizes, where the performance (measured in WER) of a speaker-dependent ASR system is minimised for a specific speaker. Specifically, we were interested to analyse how much the OR mismatch the baseline values for speakers with dysarthria. The results of the first experiment confirm the finding of [
13], so there exists an OR for each speaker. In general, the WER is sensible of the shift size, while the window is not so important to optimise ASR performance. For some speakers, especially for those who have high WER, using OR parameters instead of baseline ones can increase the performance up to 58%. The unimpaired speaker ORs match the shift baseline value, while for the speakers with dysarthria the shift value may be different even far from the shift baseline value.
The aim of the second experiment was to find out if there is some correlation between some speaker’s voice features and their OR. This could be useful to locate the OR starting from few speaker recordings, avoiding recording a high number of speaker voice samples. There are selected 24 voice features, which were analysed in terms of correlation coefficients with OR mean values (window and shift). Spectral voice features showed a high correlation with the window size parameter. In particular, the features that emphasise the typical frequency components of the vocalised signal. On the other hand, jitter measures and LTAS information seems to be very significant to estimate the value of the best speaker window shift.