Dysarthria Speech Detection Using Convolutional Neural Networks with Gated Recurrent Unit

In recent years, due to the rise in the population and aging, the prevalence of neurological diseases is also increasing year by year. Among these patients with Parkinson’s disease, stroke, cerebral palsy, and other neurological symptoms, dysarthria often appears. If these dysarthria patients are not quickly detected and treated, it is easy to cause difficulties in disease course management. When the symptoms worsen, they can also affect the patient’s psychology and physiology. Most of the past studies on dysarthria detection used machine learning or deep learning models as classification models. This study proposes an integrated CNN-GRU model with convolutional neural networks and gated recurrent units to detect dysarthria. The experimental results show that the CNN-GRU model proposed in this study has the highest accuracy of 98.38%, which is superior to other research models.


Introduction
Speech is an essential medium of communication between people. Once the medium of communication is abnormal, it increases the difficulty of communication. Furthermore, many people with neurological diseases often have this condition, which is called dysarthria. Dysarthria is mainly a symptom caused by neuromuscular control disorders that affect breathing, vocalization, resonance, articulation, and prosody [1]. Sounds can be too loud or too low due to damage to the central and peripheral nervous system, such as stroke, Parkinson's disease, brain trauma, brain tumors, cerebral palsy, amyotrophic lateral sclerosis, multiple sclerosis, muscular dystrophy, and other neurological diseases. The voice will also appear hoarse and lack tone changes. Hence, dysarthria patients are more likely to have abnormal speech characteristics [2].
Dysarthria may lead to social difficulties, a sense of isolation, and even worldweariness, depression, and other psychological problems [3]. Therefore, if there is no timely intervention and early rehabilitation training, it is easy to cause difficulties in disease course management, and the disease will continue to worsen. Doctors can subjectively diagnose dysarthria, but it is generally considered an expensive, laborious, and time-consuming test [4]. Therefore, having an objective and immediate automatic backing test is extremely important.
Deep learning has recently been popular and widely used in medical treatment. In order to objectively and accurately diagnose patients with dysarthria, more and more researchers are using deep learning to develop automatic detection of dysarthria. Many researchers use words for speech detection and different feature extraction methods to extract features from speech signals. For example, Vashkevich et al. [5] use pitch period entropy (PPE) based on acoustic features. Muhammad et al. [6] use glottal to noise excitation (GNE) and formant frequency or use spectrum and cepstrum for feature extraction. Other examples are mel-frequency cepstral coefficients (MFCC) [7], perception linear predictive coefficients (PLP), etc. [8]. After that, deep learning methods are used to detect dysarthria, such as convolutional neural network (CNN), CNN-LSTM (long short-term memory), and other models [9,10].
In the previous studies on the detection of dysarthria using the UA-Speech database, Narendra [10] selected the CNN-LSTM hybrid model as the classification model, but the accuracy of this model was only 77.57%. In order to improve the accuracy of the dysarthria detection model, this study used speech signals recorded by dysarthria patients and healthy people to undergo a short-time Fourier transform (STFT) and then convert the signals into spectrograms. After that, the signals were transformed into a spectral map, and mel-frequency cepstral coefficients (MFCC) were used to select the features. Finally, the accuracy in detecting dysarthria of the proposed CNN-GRU (gated recurrent unit) deep learning model was compared with three other models (CNN, LSTM, and CNN-LSTM).

Data Collection
Schlauch et al. [11] pointed out in their study that patients with dysarthria use words to make judgments with high recognition and low error rates. Therefore, this study chose words as the input audio samples for the subsequent studies. Our dataset was collected from the UA research database [12] (http://www.isle.illinois.edu/sst/data/UASpeech/, accessed on 18 February 2022). This database mainly contains the voice recordings of 15 dysarthria patients (4 women and 11 men) and 13 healthy subjects (4 women and 9 men), all of which were recorded by microphone and processed by noise removal. The subjects ranged in age from 18 to 58. A total of 455 words were recorded for each subject in the database, consisting of the numbers 1 to 10, the 26 letters, 19 computer command words, the 100 most common words from the Brown Corpus, and 300 words selected from the Project Gutenberg novel.

Method
The method proposed in this study consists of three stages, as shown in Figure 1. In the first stage, the original speech signal is transformed from the time domain to the frequency domain by a short-time Fourier transform. Second, the frequency domain data are extracted by mel-frequency cepstral coefficients. In the third stage, the features extracted from the mel spectrogram are used to detect and classify dysarthria patients and healthy people using the CNN-GRU model used in this study. In order to verify the excellence of the CNN-GRU deep learning model, this study also used the CNN model, LSTM model, and CNN-LSTM model to detect dysarthria and compare their results. tract features from speech signals. For example, Vashkevich et al. [5] use pitch period entropy (PPE) based on acoustic features. Muhammad et al. [6] use glottal to noise excitation (GNE) and formant frequency or use spectrum and cepstrum for feature extraction. Other examples are mel-frequency cepstral coefficients (MFCC) [7], perception linear predictive coefficients (PLP), etc. [8]. After that, deep learning methods are used to detect dysarthria, such as convolutional neural network (CNN), CNN-LSTM (long short-term memory), and other models [9,10].
In the previous studies on the detection of dysarthria using the UA-Speech database, Narendra [10] selected the CNN-LSTM hybrid model as the classification model, but the accuracy of this model was only 77.57%. In order to improve the accuracy of the dysarthria detection model, this study used speech signals recorded by dysarthria patients and healthy people to undergo a short-time Fourier transform (STFT) and then convert the signals into spectrograms. After that, the signals were transformed into a spectral map, and mel-frequency cepstral coefficients (MFCC) were used to select the features. Finally, the accuracy in detecting dysarthria of the proposed CNN-GRU (gated recurrent unit) deep learning model was compared with three other models (CNN, LSTM, and CNN-LSTM).

Data Collection
Schlauch et al. [11] pointed out in their study that patients with dysarthria use words to make judgments with high recognition and low error rates. Therefore, this study chose words as the input audio samples for the subsequent studies. Our dataset was collected from the UA research database [12] (http://www.isle.illinois.edu/sst/data/UASpeech/, accessed on 18 February 2022). This database mainly contains the voice recordings of 15 dysarthria patients (4 women and 11 men) and 13 healthy subjects (4 women and 9 men), all of which were recorded by microphone and processed by noise removal. The subjects ranged in age from 18 to 58. A total of 455 words were recorded for each subject in the database, consisting of the numbers 1 to 10, the 26 letters, 19 computer command words, the 100 most common words from the Brown Corpus, and 300 words selected from the Project Gutenberg novel.

Method
The method proposed in this study consists of three stages, as shown in Figure 1. In the first stage, the original speech signal is transformed from the time domain to the frequency domain by a short-time Fourier transform. Second, the frequency domain data are extracted by mel-frequency cepstral coefficients. In the third stage, the features extracted from the mel spectrogram are used to detect and classify dysarthria patients and healthy people using the CNN-GRU model used in this study. In order to verify the excellence of the CNN-GRU deep learning model, this study also used the CNN model, LSTM model, and CNN-LSTM model to detect dysarthria and compare their results.

Data Preprocessing
The audio could identify the amplitude waveform differences from the audio images of patients with dysarthria and healthy people through waveform images because people with Healthcare 2022, 10,1956 3 of 14 dysarthria pronounce words more slowly and with a less steady pitch than healthy people. In general, the waveforms of the dysarthria patient (ID: dysarthria01) in Figure 2a are more irregular than the healthy subject (ID: healthy01) in Figure 2b. Audio waveforms can only show the relationship between amplitude and time. This study used Python Librosa to perform a short-time Fourier transform (STFT) of the audio. The short-time Fourier spectrograms of a dysarthria patient and a healthy subject are shown in Figure 3. From the spectrograms in Figure 3, it can be observed that the spectrum of subject dysarthria01 (Figure 3a) has more irregular frequencies and sudden higher decibels than the spectrum of subject healthy01 (Figure 3b).

Data Preprocessing
The audio could identify the amplitude waveform differences from the audio images of patients with dysarthria and healthy people through waveform images because people with dysarthria pronounce words more slowly and with a less steady pitch than healthy people. In general, the waveforms of the dysarthria patient (ID: dysarthria01) in Figure 2a are more irregular than the healthy subject (ID: healthy01) in Figure 2b. Audio waveforms can only show the relationship between amplitude and time. This study used Python Librosa to perform a short-time Fourier transform (STFT) of the audio. The short-time Fourier spectrograms of a dysarthria patient and a healthy subject are shown in Figure 3. From the spectrograms in Figure 3, it can be observed that the spectrum of subject dysar-thria01 ( Figure 3a) has more irregular frequencies and sudden higher decibels than the spectrum of subject healthy01 (Figure 3b).  Short-time Fourier transform (STFT) was used to transform speech signals from the time domain to the frequency domain. The frame length of the speech in this study was between 10 and 30 ms, the sampling frequency was set to 8 KHz, and the window length was set to 128 to improve the resolution. The transformation of the STFT voice signal x(T) to the frequency domain is shown in Equation (1).

Feature Selection
Mel-frequency cepstral coefficients (MFCC) are widely used in speech recognition. Mel is the scale of tone frequencies picked up by the human ear. The relationship between the mel spectrum (M) and frequency (Hz) is shown in Equations (2) and (3).
The power of spectrum, P(k), can be obtained from Equation (4).

Data Preprocessing
The audio could identify the amplitude waveform differences from the audio images of patients with dysarthria and healthy people through waveform images because people with dysarthria pronounce words more slowly and with a less steady pitch than healthy people. In general, the waveforms of the dysarthria patient (ID: dysarthria01) in Figure 2a are more irregular than the healthy subject (ID: healthy01) in Figure 2b. Audio waveforms can only show the relationship between amplitude and time. This study used Python Librosa to perform a short-time Fourier transform (STFT) of the audio. The short-time Fourier spectrograms of a dysarthria patient and a healthy subject are shown in Figure 3. From the spectrograms in Figure 3, it can be observed that the spectrum of subject dysar-thria01 ( Figure 3a) has more irregular frequencies and sudden higher decibels than the spectrum of subject healthy01 (Figure 3b).  Short-time Fourier transform (STFT) was used to transform speech signals from the time domain to the frequency domain. The frame length of the speech in this study was between 10 and 30 ms, the sampling frequency was set to 8 KHz, and the window length was set to 128 to improve the resolution. The transformation of the STFT voice signal x(T) to the frequency domain is shown in Equation (1).

Feature Selection
Mel-frequency cepstral coefficients (MFCC) are widely used in speech recognition. Mel is the scale of tone frequencies picked up by the human ear. The relationship between the mel spectrum (M) and frequency (Hz) is shown in Equations (2) and (3).
The power of spectrum, P(k), can be obtained from Equation (4). Short-time Fourier transform (STFT) was used to transform speech signals from the time domain to the frequency domain. The frame length of the speech in this study was between 10 and 30 ms, the sampling frequency was set to 8 KHz, and the window length was set to 128 to improve the resolution. The transformation of the STFT voice signal x(T) to the frequency domain is shown in Equation (1).

Feature Selection
Mel-frequency cepstral coefficients (MFCC) are widely used in speech recognition. Mel is the scale of tone frequencies picked up by the human ear. The relationship between the mel spectrum (M) and frequency (Hz) is shown in Equations (2) and (3).
The power of spectrum, P(k), can be obtained from Equation (4).
Healthcare 2022, 10, 1956 4 of 14 The power of the spectrum, P(k), is passed through a series of mel-scale triangular filter windows to obtain the mel spectrum. The frequency, H m (k), of the triangular filter is calculated as shown in Equation (5).
f(m) is the central frequency of the mel triangle filter. The logarithmic energy spectrum of each frame is S(m), which is obtained using a logarithmic process, as shown in Equation (6).
P(k) is the power spectrum, H m (k) is the filter window, and M is the number of filter windows.
This study used Librosa in Python software to extract the feature of the inverse coefficient of the mel frequency. Figure 4a is the voice signal. After extracting the speech signal samples and features, the mel spectrum is shown in Figure 4b.
Healthcare 2022, 10, x FOR PEER REVIEW 4 of 16 The power of the spectrum, P(k), is passed through a series of mel-scale triangular filter windows to obtain the mel spectrum. The frequency, , of the triangular filter is calculated as shown in Equation (5).
f(m) is the central frequency of the mel triangle filter. The logarithmic energy spectrum of each frame is S(m), which is obtained using a logarithmic process, as shown in Equation (6).
is the power spectrum, is the filter window, and M is the number of filter windows.
This study used Librosa in Python software to extract the feature of the inverse coefficient of the mel frequency. Figure 4a is the voice signal. After extracting the speech signal samples and features, the mel spectrum is shown in Figure 4b.

CNN Model
The CNN model can be used to detect the critical features of the audio in the audio message [13]. The CNN model's output and input architecture is shown in Figure 5, and the core CNN model is explained as follows. The CNN model can be used to detect the critical features of the audio in the audio message [13]. The CNN model's output and input architecture is shown in Figure 5, and the core CNN model is explained as follows. The CNN model uses the convolution layer to retain the original feature arrangement of the image and obtain some essential features from the image. Then, the max pooling layer is used to select the more intense feature values from the essential features and shave the weak ones. This study adopted a rectified linear unit (Relu) to shave off the eigenvalues less than 0 at the site to speed up model training between the convolution layer and The CNN model uses the convolution layer to retain the original feature arrangement of the image and obtain some essential features from the image. Then, the max pooling layer is used to select the more intense feature values from the essential features and shave the weak ones. This study adopted a rectified linear unit (Relu) to shave off the eigenvalues less than 0 at the site to speed up model training between the convolution layer and the max pooling layer. Then, the feature values are converted into one-dimensional data through the flatten layer to facilitate the subsequent use of the fully connected layer. Finally, the activation function of softmax is connected to the classification output. Table 1 shows the parameter settings of the CNN model in this study.

LSTM Model
Speech is a typical temporal signal because the LSTM (long short-term memory) model has a solid temporal ability [14]. The output and input architecture of the LSTM model is shown in Figure 6. In this study, a four-layer LSTM was used as the input layer, and a four-layer dropout was added to prevent the over-fitting problem of the model in the training process. A dense layer was used for dimensional transformation, and softmax was used for the classification output. Table 2 shows the parameter settings of the LSTM model in this study.   [15]. The CNN-LSTM model uses a CNN convolution layer to retain the original   [15]. The CNN-LSTM model uses a CNN convolution layer to retain the original feature arrangement of an image and obtain some essential features from the image. Then, the max pooling layer is used to select the more intense feature values from the essential features and shave the weak ones. Between the convolution layer and the max pooling layer, the rectified linear unit (Relu) is provided to shave off feature values less than 0 to speed up model training. Then, the LSTM is connected to capture the temporal dynamics of the sequence, and the flatten layer is connected to convert the feature values into onedimensional data. Finally, the activation function of softmax is connected for classification output. The output and input architecture of the CNN-LSTM model is shown in Figure 7. Table 3 shows the parameter settings of the CNN-LSTM model in this study.  [15]. The CNN-LSTM model uses a CNN convolution layer to retain the original feature arrangement of an image and obtain some essential features from the image. Then, the max pooling layer is used to select the more intense feature values from the essential features and shave the weak ones. Between the convolution layer and the max pooling layer, the rectified linear unit (Relu) is provided to shave off feature values less than 0 to speed up model training. Then, the LSTM is connected to capture the temporal dynamics of the sequence, and the flatten layer is connected to convert the feature values into onedimensional data. Finally, the activation function of softmax is connected for classification output. The output and input architecture of the CNN-LSTM model is shown in Figure 7. Table 3 shows the parameter settings of the CNN-LSTM model in this study.

CNN-GRU Model
CNN combined with GRU was used as a classifier in the study of speech enhancement [16] and android botnet detection [17]. In GRU architecture, fewer parameters need to be set, and it is simpler than LSTM architecture [18]. Therefore, it becomes natural to use GRU to optimize the CNN model. However, the combination of CNN and GRU is not always the same. This study combines the studies by Hasannezhad et al. [16] and Yerima et al. [17] into a different CNN-GRU model and writes programs through Python's Keras package for experiments. The output and input architecture of the CNN-GRU model proposed in this study is shown in Figure 8. ment [16] and android botnet detection [17]. In GRU architecture, fewer parameters need to be set, and it is simpler than LSTM architecture [18]. Therefore, it becomes natural to use GRU to optimize the CNN model. However, the combination of CNN and GRU is not always the same. This study combines the studies by Hasannezhad et al. [16] and Yerima et al. [17] into a different CNN-GRU model and writes programs through Python's Keras package for experiments. The output and input architecture of the CNN-GRU model proposed in this study is shown in Figure 8. The CNN-GRU model proposed in this study uses a CNN convolutional layer to retain the original feature arrangement of an image and obtains some essential features from the image. In addition, the max pooling layer is used to select more intense feature values from important features and shave the weak feature values, which can prevent the problem of over-fitting the model. Between the convolutional layer and the max pooling layer, this study also used a rectified linear unit to shave off the eigenvalues less than 0 to accelerate model training. Then, the eigenvalues are passed through the update gate and reset gate of the gated recurrent unit (GRU) to increase the calculation speed of the model so that the model can be more accurate. Then, the flattened layer is connected to convert the feature value into one-dimensional data, which is convenient for the subsequent use of the fully connected layer. Finally, softmax's activation function is connected as the output to determine whether the speech audio is dysarthria. Table 4 shows the parameter settings of the CNN-GRU model in this study.  The CNN-GRU model proposed in this study uses a CNN convolutional layer to retain the original feature arrangement of an image and obtains some essential features from the image. In addition, the max pooling layer is used to select more intense feature values from important features and shave the weak feature values, which can prevent the problem of over-fitting the model. Between the convolutional layer and the max pooling layer, this study also used a rectified linear unit to shave off the eigenvalues less than 0 to accelerate model training. Then, the eigenvalues are passed through the update gate and reset gate of the gated recurrent unit (GRU) to increase the calculation speed of the model so that the model can be more accurate. Then, the flattened layer is connected to convert the feature value into one-dimensional data, which is convenient for the subsequent use of the fully connected layer. Finally, softmax's activation function is connected as the output to determine whether the speech audio is dysarthria. Table 4 shows the parameter settings of the CNN-GRU model in this study.

Experimental Design
In this study, the word audio of dysarthria patients and healthy subjects was converted into the frequency domain by short-time Fourier transformation, and the mel spectrum image was extracted by the mel-frequency cepstral coefficient as the input of the four models, including CNN, LSTM, CNN-LSTM, and CNN-GRU, proposed in this study. The pros and cons of each model of dysarthria detection were compared by the training and validation sets and the test results. The dataset was divided into the training set, validation set, and test set, which were used for the training and testing of the four deep learning models. The distribution of data was based on the ratio of 0.7:0.15:0.15. In this study, after comprehensive area testing, these parameters had different batch sizes and learning rates to obtain the ideal solution. In the case of batch sizes of 32, 64, and 128, learning rates of 0.01, 0.001 and 0.0001, and Epoch = 10, the experimental results are described in detail in Section 3.

Model Evaluation
In this study, the effectiveness of the deep learning models was evaluated by the following evaluation indicators, which are generally divided into four types: (1) true positive (TP); (2) true negative (TN); (3) false positive (FP); and false negative (FN). The following evaluation metrics of the models can be calculated and learned based on those four results: accuracy, precision, recall, f1-score, and ROC curve [19,20].

Experimental Results of CNN Model
The CNN model in this study adopted Keras in Python software for model training [21], and the classification results of the CNN model are shown in Table 5. If the CNN model parameter value of the batch size was set as 128 and the learning rate was set as 0.01, the highest accuracy of 94.36% of the CNN model could be obtained. In this study, Scikit-Learn [20] in Python software was used to draw the ROC curve of the CNN model in Figure 9, from which it can be seen that the AUC of the CNN model was 0.871 and the classification result of the model was good.

Model Evaluation
In this study, the effectiveness of the deep learning models was evaluated by the following evaluation indicators, which are generally divided into four types: (1) true positive (TP); (2) true negative (TN); (3) false positive (FP); and false negative (FN). The following evaluation metrics of the models can be calculated and learned based on those four results: accuracy, precision, recall, f1-score, and ROC curve [19,20].

Experimental Results of CNN Model
The CNN model in this study adopted Keras in Python software for model training [21], and the classification results of the CNN model are shown in Table 5. If the CNN model parameter value of the batch size was set as 128 and the learning rate was set as 0.01, the highest accuracy of 94.36% of the CNN model could be obtained. In this study, Scikit-Learn [20] in Python software was used to draw the ROC curve of the CNN model in Figure 9, from which it can be seen that the AUC of the CNN model was 0.871 and the classification result of the model was good.  In this study, the epoch parameter value of the CNN model was set as 10, and the execution time, loss function, and accuracy of the CNN model can be observed from the training process in Table 6. The accuracy of the test set was 94.36%. It only took about 3 ms/epoch to train the CNN model. The accuracy of the final training set was 97.88%, and the loss function was 0.0638. In this study, the epoch parameter value of the CNN model was set as 10, and the execution time, loss function, and accuracy of the CNN model can be observed from the training process in Table 6. The accuracy of the test set was 94.36%. It only took about 3 ms/epoch to train the CNN model. The accuracy of the final training set was 97.88%, and the loss function was 0.0638.

Experimental Results of LSTM Model
The LSTM model in this study adopted Keras in Python software for model training [21]. The classification results of the LSTM model are shown in Table 7. In the LSTM model, if the parameter value of the batch size was set as 64 and the learning rate was set as 0.001, the LSTM model could achieve the highest accuracy of 56.61%. In this study, Scikit-Learn [22] in Python software was used to draw the ROC curve of the LSTM model in Figure 10

Experimental Results of LSTM Model
The LSTM model in this study adopted Keras in Python software for model training [21]. The classification results of the LSTM model are shown in Table 7. In the LSTM model, if the parameter value of the batch size was set as 64 and the learning rate was set as 0.001, the LSTM model could achieve the highest accuracy of 56.61%. In this study, Scikit-Learn [22] in Python software was used to draw the ROC curve of the LSTM model in Figure 10

Experimental Results of CNN-LSTM
The CNN-LSTM model in this study adopted Keras in Python software for model training [21]. The classification results of the CNN-LSTM model are shown in Table 9. In the CNN-LSTM model, if the parameter value of the batch size was set as 128 and the learning rate was set as 0.01, the CNN-LSTM model could obtain the highest accuracy of 78.57%. In this study, Scikit-Learn [22] in Python software was used to draw the ROC curve of the CNN-LSTM model in Figure 11. It can be seen from Figure 11 that the AUC of the CNN-LSTM model was 0.758 and the classification result of the model was above medium.

Experimental Results of CNN-GRU Model
The CNN-GRU model in this study adopted Keras in Python software for model training [21]. The classification results of the CNN-GRU model are shown in Table 11. In the CNN-GRU model, if the parameter value of the batch size was set as 128 and the learning rate was set as 0.001, the highest accuracy of 98.88% of the CNN-GRU model could be obtained. In this study, Scikit-Learn [22] in Python software was used to draw the research results of the ROC curve of the CNN-GRU model in Figure 12. It can be seen that the AUC of the CNN-GRU model was 0.916 and the model classification results were excellent. In this study, the epoch parameter value of the CNN-GRU model was set as 10, and the execution time, loss function, and accuracy of the CNN-GRU model can be observed from the training process data in Table 12. It only took about 2 ms/epoch to train the CNN- In this study, the epoch parameter value of the CNN-GRU model was set as 10, and the execution time, loss function, and accuracy of the CNN-GRU model can be observed from the training process data in Table 12. It only took about 2 ms/epoch to train the CNN-GRU model, and the accuracy of the final training set was 98.14%. The loss function was 0.1621, and its test set accuracy was 98.38%.

Discussion of Results
According to the experimental results in Section 3, the accuracy values of the CNN model training set and test set were 97.88% and 94.36%, respectively ( Table 6). The accuracy of the LSTM model training set and test set was 56.61% (Table 8). The accuracy of the CNN-LSTM model training set was 84.21%, and the accuracy of the test set was 78.57% (Table 10). Finally, the accuracy values of the proposed CNN-GRU model training set and test set were 98.14% and 98.38%, respectively (Table 12). Regardless of the perspective of the training set and test set, the CNN-GRU model had the highest accuracy. In the judgment of the AUC value, the AUC = 0.916 of the CNN-GRU model was also the highest, which was better than the other three models. Various evaluation metrics show that the proposed CNN-GRU model can obtain more accurate judgment results in dysarthria detection.
The results of this study are compared with other methods used in previous studies and summarized in Table 13. Hernandez et al. [23] used a method based on fricative sounds in audio messages and machine learning to detect dysarthria. The average spectral peak in the spectral moment was used to extract the fricatives in the audio as the input features of the SVM model, and the final SVM accuracy was 72%. Narendra et al. [24] trained an SVM with acoustic and glottic features extracted from coded speech utterances and their corresponding dysarthria/health labels and finally achieved an accuracy of 96.38% from the SVM. Narendra et al. [25] developed an end-to-end system that mainly used raw speech signals and raw glottal flow waveforms to detect dysarthria in two deep learning architectures: CNN-MLP and CNN-LSTM. The results showed that the original glottal flow waveform is more suitable for model training than the original speech signal, and the accuracy of CNN-MLP and CNN-LSTM were 87.93% and 77.57%, respectively. Rajeswari et al. [26] enhanced the speech by variational mode decomposition and fed the reconstructed signal to CNN for model training, and the final result achieved 95.95% accuracy. The accuracy of the CNN-GRU model proposed in this study was 98.38%, which is the highest in all studies. However, our approach may take a longer time to execute. After a survey, previous studies have not reported the execution times in their articles. Therefore, an execution time of 2 ms for our approach is appended in Table 13 for further investigation or comparison in the future.

Conclusions
Although dysarthria testing can be based on the subjective judgment of doctors, it is also regarded as a costly and time-consuming test, which can easily cause a medical burden. Therefore, if dysarthria testing can be conducted objectively, it can assist doctors in making an immediate judgment. This study used a CNN-GRU classification model for dysarthria detection. The results showed that the proposed CNN-GRU model can achieve the highest accuracy of 98.38%, which is better than the CNN, LSTM, CNN-LSTM models and those of other scholars.
The results can be used as an auxiliary diagnostic procedure for detecting dysarthria in the future. In future studies, it may be possible to take more eigenvalues from audio to analyze the severity level of dysarthria symptoms so that dysarthria detection can be further studied. In addition, others can also use the CNN-GRU model to detect other speech pathologies, such as Parkinson's disease, amyotrophic lateral sclerosis (ALS), and other symptoms of speech detection. The proposed architecture can also be used for image identification, just as Priyanka and Ganesan [26] used different data preprocessing methods combined with machine learning to classify the severity of dementia. Better prediction results may be achieved if the research is conducted through deep learning architecture.
In addition, most of the existing freely available dysarthric speech databases, including [12], contain speech data recorded from a small number of patients [24]. The volume of speech samples recorded in the dataset used in this study is quite large. However, the number of samples included is not immense, and there has been no continuous addition of samples, which makes it challenging to ensure that the results of this study can be adequately transferred to other clinical trials of dysarthria.