Performance Evaluation of RNN with Hyperbolic Secant in Gate Structure through Application of Parkinson’s Disease Detection

: This paper studies a novel recurrent neural network (RNN) with hyperbolic secant (sech) in the gate for a specific medical application task of Parkinson’s disease (PD) detection. In detail, it focuses on the fact that patients with PD have motor speech disorders, by converting the voice data into black-and-white images of a recurrence plot (RP) at specific time intervals and constructing the detection model that combines RNN and convolutional neural network (CNN); the study evaluates the performance of the RNN with sech gate compared with long short-term memory (LSTM) and gated recurrent unit (GRU) with conventional gates. As a result, the proposed model obtained similar results to LSTM and GRU (an average accuracy of about 70%) with less hyperparameters, re-sulting in faster learning. In addition, in the framework of the RNN with sech in gate, the accuracy obtained by using tanh as the output activation function is higher than using the relu function. The proposed method will see more improvement by increasing the data in the future. More analysis on the input sound type, the RP image size, and the deep learning structures will be included in our future work for further improving the performance of PD detection from voice.


Introduction
In recent years, the recurrent neural network (RNN) has been frequently used in time series data processing such as in medical information processing, etc. The RNN has a recursive structure inside and can handle variable lengths of input data.
Generally, since it is difficult for a Simple RNN (Vanilla RNN) [1] with a simple structure to learn the time series data with long-term dependencies, two types of RNNs with complex gated structures to control the required information are proposed; they are long short-term memory (LSTM) [2,3] and gated recurrent unit (GRU) [4], respectively. However, while the performance of RNNs with gated structures is improved, since backpropagation through time (BPTT) used for learning works by unrolling all input time steps, the more parameters there are in the RNN, the more memory is required and the higher the calculation costs. In order to solve this problem, many papers have been published that attempted to simplify LSTM and GRU [5][6][7][8]. In our previous research, we proposed SGR (simple gated RNN) that uses the parts of gate structure of LSTM and GRU for Simple RNN to reduce parameters and realize faster learning [9]. In addition, we also proposed an RNN that introduced a new activation function "hyperbolic secant (sech)" into the gate [10], which is even simpler than Simple RNN (Vanilla RNN).
The objective of this paper is to evaluate this RNN model with sech in the gate through a specific medical application task. In detail, the task here is to detect Parkinson's disease (PD) from the sound information of subjects. PD is the second most common neurodegenerative disease after Alzheimer's disease [11]. The detection of PD is very important for medical treatments as well as improving the patient's quality of live (QOL). Focusing on the fact that patients with PD have motor speech disorders, here we ask subjects to pronounce a voice of /a/ for about 4 to 10 s to acquire data. In order to check the periodicity of the voice, the sound data are then converted into a recurrence plot (RP) at specific time intervals. A recurrence plot (RP) can visualize a periodic nature and chaos in time series. The generated RPs are set as input to a neural network model that combines the convolutional neural network (CNN) and RNN for classification.
In the experiments, in order to evaluate the performance of the RNN with sech gate structure, we compared it with LSTM and GRU with conventional gate structures. We also compared the performance of this RNN when the used output activation function was changed between tanh and relu. For performance evaluation, accuracy, F1-score and Matthews correlation coefficient (MCC) were used. As a result, the proposed model obtained similar results to LSTM and GRU (an average accuracy of about 70%) with less hyperparameters, contributing to faster learning. In addition, in the framework of the RNN with sech in gate, the accuracy obtained by using tanh as the output activation function is higher than using the relu function.
The outline of the paper is as follows: Section 2 discusses related works. Section 3 explains the framework of PD detection models and data preprocessing. The details of the experiments, the results, and discussion are described in Section 4. Section 5 is the conclusions.

Related Work
In this section, we review important points regarding related works on RNNs and CNNs and the recurrence plot used in our study.
As mentioned in Section 1, recent RNNs generally refer to RNNs with weighted gate structures such as LSTM and GRU, rather than Simple RNN (or Vanilla RNN) with simple structures. This is because the RNN with gate structure succeeded in partially solving a vanishing gradient problem. When the vanishing gradient occurs, the error becomes smaller rapidly as it goes back layers, and the learning does not progress well. The gate structure deals with this problem by controlling the vanishing gradient with weight parameters. However, the weight parameters increase the calculation cost, and the analysis is difficult. To address these problems, many papers have been written that attempted to reduce the parameters of the RNN [5][6][7][8]. As in our previous study, we proposed SGR that reduced weighted gates while maintaining the performance. However, due to having weighted gates, the calculation costs were not significantly reduced, and the analysis remained difficult [9]. Therefore, we proposed a new RNN model that reduced parameters by removing the conventional weighted gate and using a new gate structure with scalar value controlling the vanishing gradient and the sech function as the activation function [10]. Regarding this RNN using the sech function, in [10], the task performance was lower than that of conventional gated RNNs (such as LSTM and GRU) due to the characteristics of the specific task and tanh activation function. However, in a binary classification task in natural language processing (NLP), it was found that, despite the parameter of about 1/6 or less, the performance was comparable to that of conventional gated RNNs without using normalization methods such as batch normalization. It is then important to clarify the performance quantity for more tasks such as time series data processing and image processing.
In this paper, we confirm the performance difference in a practical medical application task between this RNN with sech function and conventional gated RNNs and evaluate it. Therefore, in this study, we set the detection of Parkinson's disease as the practical medical application task. In time-frequency analysis for analyzing time series data, it is common to use the short-time Fourier transform (STFT). STFT divides a time signal into short segments of equal length and computes the Fourier transform separately on each segment. However, Fourier transform has the drawback of lower resolution for non-stationary signals. Since the voice of PD patients is non-stationary due to a faint or unstable voice caused by dysarthria, in our study, we attempt voice analysis using a recurrence plot (RP) that can process even non-stationary signals. As the way to represent chaotic time series and visualize a periodic nature, a recurrence plot (RP) was introduced by Eckmann et al. [12]. There are various methods for generating recurrence plots, such as plotting points that are smaller than the arbitrary threshold or generating recurrence plots using percentile. Compared to the Fourier transform, which is not suitable for describing systems in which independent basis functions cannot be properly selected, RP can handle both non-linear and unsteady states [13]. Additionally, RP can detect even faint modulations of animal voice signals that cannot be captured by conventional time-frequency analysis and is a very powerful tool [14].
Recently, there has been increasing research in which recurrence plot images are classified using a CNN [15,16], which is used for the image recognition tasks [17][18][19]. Furthermore, there is a Parkinson's disease identification study which used a CNN and recurrence plots of handwriting dynamics data, too [20]. Therefore, in this paper, we propose a Parkinson's disease voice detection model that combines a CNN and RNN and use it to evaluate RNNs.
Related to other studies on PD detection, in the machine learning approach, there is a study that used embedding extracted from a deep neural network named x vectors and classified it using cosine distance, cosine distance preceded by Latent Dirichlet Allocation (LDA), and Polylingual Latent Dirichlet Allocation (PLDA) [21]. Additionally, in [22], four machine learning methods (k-nearest neighbor, multi-layer perceptron, optimum-path forest, and support vector machine) were used with 18 feature extraction techniques for the detection of PD. In the deep learning approach, in [23], multiple artificial neural networks (ANNs) were used with 26 speech features for PD detection, and principal component analysis (PCA) and self-organizing map (SOM) were applied for feature selection. While in [24], a deep neural network (DNN) was applied for PD severity prediction using 16 biomedical voice measures. In this research, the calculation costs of voice preprocessing are relatively high, and multiple voice features are required. Compared with this research, in our study, we propose to use an RP which only calculates the percentile and absolute distance so as to reduce the calculation cost and make it easy for implementation. By using an RP generated from simple vowel voice, our approach uniquely detects PD with the model combining the CNN and RNN. To the best of our knowledge, detecting PD with the model combining the CNN and RNN using an RP generated from simple vowel voice has not been referred before.

The Model Description and Data Preprocessing
This section describes neural network models used in this paper in detail.

Long Short-Term Memory (LSTM)
As is shown in Figure 1, a layer of typical LSTM is defined by Equations (1)-(6) as follows: where is an input to be added to cell state, is an internal cell state, ℎ is a hidden state at next time. Additionally, , , represent outputs from input gate, output gate, and forget gate, respectively. From here, unless otherwise specified, the operator symbol ∘ represents the Hadamard product, , ∈ ℝ are weighting matrices, ∈ ℝ is bias, regardless of the subscript, and the explanations are omitted hereafter. In addition, is an input at time step , ℎ is a hidden state at time step − 1, and ℎ is a hidden state at time step in this paper. The input gate controls input to cell state, using the value of 0 1 which is the output from Equation (2) so that it can be selectively stored from the input. Additionally, the output gate controls the output from cell state using the value of 0 1 which is the output from Equation (3). Finally, the forget gate controls the internal cell state directly by 0 1 which is the output from Equation (4) [2,3].

Gated Recurrent Unit (GRU)
As is shown in Figure 2, a layer of GRU is defined by Equations (7)-(10) as follows: where , are outputs from reset gate and update gate. The reset gate controls how much the previous hidden state ℎ is considered to create a new hidden state ℎ , using the gate output 0 1 which is the output from Equation (7). Similarly, the update gate is in control of deciding how much the new hidden state ℎ is mixed to generate the next hidden state ℎ , using the gate output 0 1 which is the output from Equation (8) [4]. The performance of GRU compared to that of LSTM depends on the learning task, while the required number of weight parameters for the same number of units is 3/4 that of LSTM.

Our Proposed RNN Model
In the conventional method, there are various problems such as an increase in the amount of calculation, difficulty in the analysis of the gate structure, and dependence on data due to batch normalization. Therefore, in order to solve these problems in the conventional RNNs, we constructed the new structure using the sech function as the gate structure [10]. In the conventional gate structure, if the gate structure has weight parameters, the calculation cost is very high, but if it does not have weight parameters, scaling for negative and positive values does not work properly, because the sigmoid function is not an even function. Hence, instead of the sigmoid function, we used the sech function that is an even function and has an output range from 0 to 1 for the gate structure [10].
The sech function is defined by Equation (11) as follows: Additionally, differentiating (11) is defined by Equation (12) as follows Figure 3 shows a graph of the sech function and sech's differentiated function. Figure 3. A graph of the sech function and sech's differentiated function (cited from [10]). Sech function is even function and has output range from 0 to 1.
As is shown in Figure 4, a layer of the RNN with sech gate structure with the highest performance in the paper [10] is defined by Equations (13)-(15) as follows: = act ℎ , where is a scalar value, which is introduced as a parameter for controlling the degree of vanishing gradient. Additionally, the output to the lower layer is represented by , and act is activation function. Hereinafter, for the sake of simplicity, the RNN with sech gate structure is referred to as "RNN-SH".  [10]). The smaller scalar value is, the larger the hidden state and the gradient are after passing through the gate, and the larger is, the smaller they are. This controls the amount of retained information and the degree of vanishing gradient. The biases are omitted here.

Voice Data Preprocessing and Recurrence Plot Creation
Here, we explain how to preprocess voice data and create recurrence plots. The voice data used in our experiments are recordings of the voice/a/ including men and women for about 4 to 10 s. Regardless of the recording, all voice data are converted to monaural, 16,000 Hz, and then, processing to create RPs is started, because voice data are recorded in various environments. Additionally, all voice data are quantized in 16 bits.
The preprocessing of voice data before creating RPs is as follows: 1. Delete the silent section before and after voice data: In order to avoid the influence of voice volume, all voice data are normalized so that the maximum amplitude is the expressible maximum value, and the sections of −40 dB or less are deleted only from before and after the sound. 2. Furthermore, delete the first and last of the above voice data for 0.1 s: This was because the first and last sounds after deleting the silent sections were often unstable.
Next, the procedure of creating recurrence plots is as follows: 1. Divide the above preprocessed voice data into 0.01 s sections. 2. Immediately before creating the recurrence plot, normalize the sound of 0.01 s section so that it becomes [−1, 1], and then plot points with a distance smaller than the 35th percentile in 0.01 s section. The reason for using the 35th percentile is that the accuracy was highest when the 35th percentile was used in the experiments, which is described later.
3. Compress the generated black-and-white RP with 160 × 160 image size to 20 × 20 using bilinear interpolation. This may deteriorate the accuracy, but it is carried out in consideration of memory efficiency of Video Random Access Memory (VRAM) when inputting to the neural network. 4. Repeat above 1-3 until all divided voice data become recurrence plots. However, if a fraction less than 0.01 s appears in the last section, it will be rounded down. Figure 5 shows the structure of the Parkinson's disease detection model. The CNN model, RNN model, and output layer are processed in this order for detection. Relu is used as the activation function in the CNN model, but for each RNN, a suitable activation function is used (tanh or relu). is 20 × 20, and the relu was used as the activation function after each output of the convolutional layer. In order to prevent overfitting, dropout is applied immediately after max pooling and before output layer. If the RNN has two layers, dropout is applied before the RNN of the second layer too. The size input to RNN is 64 × 5 × 5 (= 1600) because the output from the CNN is flattened. The model output is binary (2 classes), and the output is generated after all RPs have been input.

Outline of Experiments
In order to evaluate the performance of each RNN using the recurrence plots and the model constructed in Section 3, experiments were performed using a computer. Here, PyTorch 1.7.1 which is a machine learning library, cuda11 and Python 3 (version 3.8) was used. The execution environment was Ubuntu 18.04.4 (64 bit) and we used i7 8700 (RAM 16 GB) and gtx1080 of GPU (VRAM 8 GB) for single precision calculations. For the processing of voice data, a python library named Pydub [25] was used. Some of Pydub functions depend on FFmpeg [26]. In addition, when using GPU on PyTorch, it was necessary to set CUBLAS_WORKSPACE_CONFIG =: 16:8 as an environment variable for reproducibility due to use of cuda11. In the experiments, all random seeds were initialized to 10.

Input Voices and Preprocessing
As mentioned in Section 3, the voice dataset used in the experiments are recordings of the voice/a/ for about 4 to 10 s. This dataset contains data for 22 healthy people (HP) and 30 PD patients, and three times data for each person is recorded. Therefore, the total number of data is 156 ((22 + 30) × 3). The dataset contains 43 subjects (13 HP and 30 PD cases) who were hired as volunteers by the GYENNO SCIENCE Parkinson Disease Research Center 1 , and nine subjects (nine HP) who were requested by the authors in order to alleviate the data imbalance. The breakdown of gender was 27 females and 25 males. PD patients consisted of patients with HY (Hoehn and Yahr) stage 1-5. A total of 13 PD patients were in stage 3-5, and 17 PD patients were lower than stage 3. 1 Ethical Approval: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and the "Law of the People's Republic of China on Medical Practitioners" (1998) declaration and its later amendments or comparable ethical standards. Figure 6 shows the examples of preprocessed voice data and generated RPs in the procedure of Section 3. If the value of the 35th percentile is for the time series { } = { , , ⋯ , }, the recurrence plot matrix is plotted by Equation (16) as follows: where , ∈ {1, 2, ⋯ , } are the numbers of the components. , = 1 means that black is plotted, and , = 0 means that white is plotted. We divided dataset into, train: 28 × 3 (HP: 10 × 3, PD: 18 × 3, about 54%), validation: 12 × 3 (HP: 6 × 3, PD: 6 × 3, about 23%), and test: 12 × 3 (HP: 6 × 3, PD: 6 × 3, about 23%). However, the three data of the same person were divided so that they belonged to the same group (train or validation or test) in the experiments. Additionally, at the time of training, the voices of healthy person data shifted by 2 s were used as oversampling data so that the total number of data did not become imbalanced. Furthermore, in order to avoid the influence of dispersion due to data division, the data were shuffled, experiments were performed five times consecutively for each RNN, and the performance was evaluated based on the average.

RNN Model Configuration and Hyperparameters
The RNNs used in the experiments were LSTM, GRU, and RNN-SH. Additionally, the number of hidden units was set to 64, 128, and 256. RNN-SH used tanh or relu as the output activation function. We also experimented when the RNNs are stacked in two layers. We used RAdam [27], which is a kind of stochastic gradient descent algorithm for learning, and set RAdam parameters to recommended values in Adam [28]. The dropout ratio and parameter of weight decay were set to 0.5 and 0.01, respectively, to prevent overfitting. The number of epochs was 50, and the mini-batch size was 27. Regarding the initialization of weight parameters, they were initialized to a Gaussian distribution with a mean of 0 and a variance of 1/ ( : the number of input units) [29], and all biases were initialized to 0. Additionally, CNN weights were initialized to a uniform distribution [− , ] where = 3/ [29]. The scalar value of gate structure of RNN-SH was initialized with 1.0. Finally, Softmax cross entropy was used as the loss function. This loss function ( ) is expressed by the following Equation (17): where is the batch size, is the output size, is the output from output layer, and is the correct answer label. In addition, is the data number, is the dimension, and is the correct answer label of j-th dimension in the i-th data.

Experimental Results
The results of the experiments are shown in Table 1. In addition, Figure 7 shows graphs of the average values in Table 1 of each model. Table 1 shows the results of the validation set and test set at the time when the loss was the lowest with respect to the validation set in 50 epochs. Since the experiments were performed five times each, the mean and standard deviation of the validation set and the test set are respectively shown. Additionally, the overall mean and standard deviation of the validation set and test set are shown too. "Average (validation/test)" is the average of five times for each validation set or each test set, respectively. "Total Average" is the all average of the validation set and test set. Here, we used accuracy, F-score and Matthews correlation coefficient (MCC) as performance evaluation indicators. Each formula is defined by Equations (18)-(20) as follows: where , , , are the number of true positives, true negatives, false positives, and false negatives, respectively. Accuracy and F-score is a value between 0 and 1, the closer they are to 1, the better the performance is. For unbalanced data, F-score can perform more accurate performance evaluation than accuracy. Note that the F1-score shown in the experiments is macro-average. The MCC is a correlation coefficient value between −1 and +1, and +1 indicates a perfect prediction, 0 an average random prediction and −1 an inverse prediction. The MCC was calculated for more accurate evaluation, including inverse predictions. In Table 1, accuracy and F-score show the value rounded off the third decimal place, and MCC shows the value rounded off the fourth decimal place.   Table 1. The left is the validation average, the middle is the test average, and the right is the total average. Here, the standard deviation is omitted to make the graphs easier to see.
As a result, from Table 1, there was no significant difference in the performance between RNNs. Statistically significant difference was not found between the top three models (GRU: two layers and 256 units, LSTM: two layers, 64 units, and RNN-SH (tanh): one layer and 256 units) by the Friedman test. Here, Figure 8 shows the average ROC curve with AUC values of RNN-SH and GRU, and Figure 9 shows a graph of the relationship between the number of units and the parameters of the PD detection model. From Figure  8, it is clear that the AUC value of GRU is higher than the value of RNN-SH (tanh); however, from Figure 9, the parameters of RNN-SH (tanh) with one layer and 256 units are less than a third of the parameters of GRU with two layers and 256 units, although the difference in the performance is small. In addition, the parameter amount of LSTM with two layers and 64 units is not much different from that of RNN-SH with one layer and 256 units. However, in the case of the LSTM, though the total average is high, there is a difference between the validation average and the test average, and it can be confirmed that overfitting has occurred. On the other hand, for the RNN-SH, the validation average and test average are about the same value. Figure 10 shows the examples of confusion matrix in the first trial of LSTM and RNN-SH (tanh). In Figure 10, LSTM has a better performance for the validation set than RNN-SH, but not so much for the test set.  From Table 1 and Figure 7, in RNN-SH, the performance when the output activation function was tanh was higher than that of relu. Figure 11 shows the execution time of each RNN in the case of one layer and 256 units in the first execution. Here, since we used LSTM and GRU that were pre-implemented in pytorch, accurate speed comparison with RNN-SH, which was implemented by the combination of pytorch functions, cannot be performed. Therefore, in Figure 11, we used LSTM and GRU that were individually implemented by the combination of pytorch functions, and the speeds are compared. In order to perform speed comparisons fairly, we considered the parallelism of LSTM and GRU, and implemented LSTM and GRU so that they could be calculated at high speed by splitting the result of parallel calculation for each weight of gate, etc. From Figure 11, since RNN-SH has fewer parameters and a lower calculation cost than LSTM and GRU, the learning of RNN-SH was faster than them.  Figure 12 shows learning curves of loss for each RNN with 256 units and one or two layers. From Figure 12, it can be seen that the losses oscillate due to the small amount of dataset, but the learning of RNN-SH progresses relatively gently compared to that of LSTM and GRU. However, RNN-SH (tanh) with two layers oscillate violently and could not learn well.

Discussion
Statistically significant difference was not found between RNN models, but the total number of parameters of the PD detection model using RNN-SH is about 1/4 of the model using LSTM and about 1/3 of the model using GRU when the number of units and layers are the same. This difference becomes even wider when the units increase or multiple layers are used. Hence, it is very effective in terms of memory efficiency and faster learning. When using RNN-SH, it is easy to take measures against overfitting because there are few parameters, slow learning progresses, and only weight parameters for the input.
Regarding the comparison when tanh and relu are used in RNN-SH, taking the case of one layer and 256 units for example, the scalar value , which is the gate parameter, is about 1.007 for tanh and about 1.012 for relu, and they are very close with each other. This was also the case under other conditions in these experiments. Therefore, it is considered that the difference in the performance of tanh and relu is caused by the difference only in activation function. From Figure 12, RNN-SH (tanh) with 256 units and two layers oscillate violently, and the reason why it could not learn well comes from the vanishing gradient at the output due to tanh. On the other hand, RNN-SH (relu) with 256 units and two layers could be learned smoothly; however, the accuracy was lower than that of tanh. Here, Figure 13 shows the examples of confusion matrix in the first trial of the tanh and relu RNN-SH. From Figure 13, the FP rate of relu was higher than that of tanh, and TN rate of relu was lower than that of tanh, indicating that the relu model could not identify PD patients properly and could not be learned well. Relu has gathered attention as the activation function that contributes to the multi-layering of neural networks. However, relu is not symmetric with respect to the origin and has problems such as a dying relu problem [30]. In our experiments, we did not use optimal parameter initialization or normalization methods that optimize for relu. Thus, dying relu may have occurred and caused the accuracy to deteriorate. Since relu is the function that is important for multilayering, it is necessary to analyze in the future what is needed to improve the performance when using the relu. In first trial case, although the model could be learned relatively well when tanh is used, it could not be learned well when relu is used, and the PD detection ability was low.
In terms of PD detection using an RP, the accuracy is around 70%. This is because some of the voice data of healthy people had hoarse voices and uneven voice volume. In our experiments, the voice/a/ was used; however, it was difficult for even a healthy person to utter the voice/a/ at a constant volume for a long time. Therefore, from the results of our experiments, the voice/a/ was not sufficient for PD detection. In order to improve the accuracy of PD detection, it is necessary to consider what pronunciation is more suitable for the detection in the future. Additionally, in our experiment, the RP images were resized by bilinear interpolation in consideration of the memory usage. However, when the RP image is compressed, features different from the original features may appear or important features may disappear. Since it is possible that the accuracy had deteriorated due to resizing, it is necessary to further study the image resizing method and the architecture of the CNN in the future.

Conclusions
This paper evaluated the effectiveness of our RNN-SH model in a practical medical application of PD detection using RP with less parameters than other gated RNNs such as LSTM and GRU. RNN-SH can greatly reduce parameters, more so than other RNNs, with comparable accuracy for time series data processing although the difference in accuracy was small and a statistically significant difference was not found between RNN models.
Since the gate parameter of RNN-SH is scalar, it is easy to analyze the result. In addition, another advantage is that the activation function can be easily changed according to the tasks. However, it turned out that it is difficult to improve the accuracy by simply replacing tanh with relu as the activation function in RNN-SH. From our experiment, the PD detection ability is lower using relu in comparison to using tanh, and we consider that dying relu might be the cause. It is necessary to analyze and make further improvements by considering parameter initialization and normalization methods, etc.
Since the dataset size in our experiment was relatively small, the proposed method will see more improvement by increasing the data in the future. More analysis on the input sound type, the RP image size, and the deep learning structures will be included in our future work for further improving the performance of PD detection from voice. In regard to RP image compression, we would investigate an image size or a CNN structure that does not deteriorate the accuracy of PD detection in more detail to improve the current model. Institutional Review Board Statement: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and the "Law of the People's Republic of China on Medical Practitioners" (1998) declaration and its later amendments or comparable ethical standards.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Due to the nature of this research, participants of this study did not agree for their data to be shared publicly at present, so supporting data is not available.