An Optimized Brain-Based Algorithm for Classifying Parkinson’s Disease

: During the last years, highly-recognized computational intelligence techniques have been proposed to treat classiﬁcation problems. These automatic learning approaches lead to the most recent researches because they exhibit outstanding results. Nevertheless, to achieve this performance, artiﬁcial learning methods ﬁrstly require ﬁne tuning of their parameters and then they need to work with the best-generated model. This process usually needs an expert user for supervising the algorithm’s performance. In this paper, we propose an optimized Extreme Learning Machine by using the Bat Algorithm, which boosts the training phase of the machine learning method to increase the accuracy, and decreasing or keeping the loss in the learning phase. To evaluate our proposal, we use the Parkinson’s Disease audio dataset taken from UCI Machine Learning Repository. Parkinson’s disease is a neurodegenerative disorder that affects over 10 million people. Although its diagnosis is through motor symptoms, it is possible to evidence the disorder through variations in the speech using machine learning techniques. Results suggest that using the bio-inspired optimization algorithm for adjusting the parameters of the Extreme Learning Machine is a real alternative for improving its performance. During the validation phase, the classiﬁcation process for Parkinson’s Disease achieves a maximum accuracy of 96.74% and a minimum loss of 3.27%.


Background
In the last decade, interesting works had been proposed on artificial neural networks for classification problems. For instance, in [23] a classification system to identify voice dysphonia via Wavelet Packet Transform and the Best Basis Algorithm is designed. Outstanding results were reported by reaching from 87.5% to 96.8% of accuracy. Recently manuscripts study the deep learning networks for classifying the environmental sounds. In [24], a convolution neural network is applied to sound classification datasets. Results are encouraging due to during the learning phase, an accuracy greater than 77% is achieved. In [25], the authors provide a review of the state-of-the-art deep learning techniques for audio signal processing. Analyzed works range from variants of the long short-term memory architecture, audio-specific neural network models, and also it includes convolution neural networks. Other works detail traditional algorithms for pathological voices, using the auto-correlation method [26], the cepstrum technique [26], and the data reduction procedure [27]. For all of them, the accuracy is between 72.72% and 84.09%. In [28], a Mel Frequency Cepstral Coefficients is applied to identify features in voices signals, which are then used for training an ANN with 10 neurons in the hidden layer. Results are presented in terms of the generated loss. The more precise result is met with nine neurons, a square mean loss of 1.05E-3, and a loss percentage equals to 12.8%. Now, if we study the integration of bio-inspired metaheuristics in machine learning techniques, an interesting set of works emerge. For example, hybrid methods were proposed in [29,30]. Here, the parameters of a Support Vector Machine are optimized by the Genetic Algorithm and the Particle Swarm Optimization (PSO), respectively. The first one is developed to classify voice disorders. This approach shows an increase in accuracy from 79.16% to 87.5%. The second one was designed to predict the use of electrical energy. The improved algorithm works in iterations. In each iteration, the metaheuristic based on the movement of particles finds the best configuration for the classification method. PSO was also integrated to an ELM. In [31], a mixed procedure based on the machine learning approach is proposed. This technique includes the successful self-regulated learning capability of the particle swarm optimization algorithm with an ELM classifier. The enhanced metaheuristic was applied to define the optimum parameter setting for the ELM in order to reduce the number of hidden layer neurons. The hybrid method was evaluated on five medical dataset classification problems: Wisconsin Breast Cancer, Pima Indians Diabetes, Heart-Statlog, Hepatitis, and Cleveland Heart Disease. Results illustrate that the integration of the self-regulated learning PSO to ELM works better than the native ELM, and it also operates better than the original PSO on ELM.
Finally, if we analyze multi-objective optimization algorithms combined with the extreme machine leaning method, we can find manuscripts such as [32][33][34]. The first one describes how the physical programming that allows selecting the number of hidden nodes and the activation function for an extreme learning machine by optimizing multiple performance objectives of networks. The second one simultaneously considers the ELM model loss and the sparsity constraint of the hidden layer into optimization. The last work details a new model selection method of ELM based on multi-objective optimization to obtain dense networks with good generalization ability.

Methodology
In this section, we present the methodology used for the work global process. We begin with a brief description of Parkinson's Disease and we show how signal voices is transformed to spectrograms. Finally, we describe the used experimental design.

Parkinson's Disease
Parkinson's Disease is a degenerative disorder of the central nervous system characterized by tremors (shaking), rigidity, slow intentional movement as well as changes in memory and cognition [35]. The diagnosis of Parkinson's Disease is frequently difficult, especially in early disease stages because tremors may occur not only at rest, but also at posture and/or during action [36]. The vocal impairment is considered one of the earliest indications that this disease begins. In this line, recent works have endeavored in handling voice signals to support the diagnosis [22,37]. In this work, we forward another step in the study of automatic classification of complex diseases by improving the learning machine.
We use Parkinson's Disease Classification Dataset taken from UCI Machine Learning Repository [22,37,38] for testing our proposal. Data was generated from 188 patients with Parkinson's Disease (107 men and 81 women) with ages ranging from 33 to 87 (65.1 ± 10.9) at the Department of Neurology at Istanbul University. Only 64 healthy individuals (23 men and 41 women) of 188 patients were considered into the control group. Their ages varied between 41 and 82 (61.1 ± 8.9). In [22], the authors inform that during the data collection process, the microphone was set to 44.1 kHz. After the physician's examination, three repetitions for each subject were extracted. The signal contains the sustained phonation of the vowel a.

Signal's Transformation
In [35,36], speech features have been successfully employed to the Parkinson's Disease diagnosis, such us jitter, shimmer, fundamental frequency parameters, harmonicity parameters, recurrence period density entropy, detrended fluctuation analysis and pitch period entropy. In [22], some features are referred as "baseline features" and they are employed for comparing the performance of the different feature extraction methods.
Dataset was created from firstly voice signal. Next, these signals were transformed in spectrograms using the Mel-Frequency Cepstral Coefficients technique [22]. In literature, Mel-Frequency Cepstral Coefficients is described as a method that emulates the effective filtering properties of the human ear [39]. This approach has been used as a robust feature extraction method in the context of speaker identification, automatic speech recognition and Parkinson's Disease diagnosis [40,41]. After to apply this method for feature extraction, 23 features were discovered/calculated, detailed in [22]. In our study, we use a feature vector with 23 elements. This vector is employed in the input layer (variables) of the ELM to classify if a patient presents the Parkinson's Disease or not.

Experimental Design
The optimized ELM is evaluated via quantitative approach. Results will be compared by using statistical metrics and a hypothesis contrast.
Firstly, we will separate the dataset in two group: 80% for training and 20% for validation. Next, we will run the native ELM to know its performance. Results will be considered as gold standard to compare future BA-ELM results. After, we will test BA-ELM for identifying whether the training phase is under/over fitting. These concepts refer to the fails when the generated models try to generalize the knowledge that machine learning intends to acquire. When a machine learning model is trained with a set of input data, it generates computational models able to generalize a concept. In our case, trained models generalize if a patient presents the Parkinson's Disease or not. Thus, when a the machine learning will use a new unknown dataset, it should be able of synthesizing it, understanding it and giving us a reliable result.
Nevertheless, there is a key issue that should always be considered. If our training data is too few, our learning algorithm will not be able to generalize the knowledge and will be incurring under f itting.
On the other hand, if we train the machine learning with a homogeneous dataset, the learning approach will neither be able to generalize the knowledge and will be incurring over f itting. To identify whether our generalized models suffer under/over fitting, we propose a simple metric: the sum of differences between curves of the learning and the validation of training phase tends to zero. If under/over fitting exists, we must check: minimum input sample, validation dataset, and feature selection, among others factors in order to minimize these issues [42].
Next, to show the robustness of the proposal and to present a better significant difference between the ELM and the optimized neural network, we will perform a contrast statistical test via the Kolmogorov-Smirnov-Lilliefors to confirm the distribution of samples [43] and Wilcoxon ssignedrank [44] to compare statistically the results. For both tests, a hypothesis evaluation is considered, which is analyzed assuming a p-value of 0.05-uncertainty-i.e., smaller values that 0.05 determine that the corresponding hypothesis cannot be assumed. Both tests were conducted using GNUOctave. The first test allows us to analyze the distribution of samples by determining whether the best value (maximum accuracy and minimum loss) achieves from the 31 executions follow a normal distribution. To proceed, the hypothesis h 0 states that the best value given by an algorithm (original or improved) draws a normal distribution. Finally, if samples do not follow a normal distribution and they are independents, we can assume a non-parametric evaluation for appraising the heterogeneity of them. We will use the Wilcoxon ssignedrank test. We will propose hypotheses h a 0 where the median accuracy achieved by ELM is great or equals than the median value reached by the optimized ELM and we will, in parallel, define h e 0 where the median loss generated by ELM is less or equals than the median value given by the optimized ELM. This method will allow us guarantee that BA-ELM effectively works better than native ELM.

Extreme Learning Machine
In artificial intelligence, machine learning approaches have gained popularity in the last years, mainly for their contributions in different disciplines. The extreme learning machine is a particular case of machine learning methods [17]. ELM can be classified as a supervised learning algorithm able to solve linear and non-linear classification problems. Traditional artificial neural network architectures describe ELM as Single Layer Feedforward Neural Network (SLFN) [45], where input weights and hidden layer biases do not need to iteratively be computed. One of the conventional ELM implementations is about the randomly calculated nodes in the hidden layer independently of the training data [46]. ELM has also been designed for working under the linear algebra operators, and thus to achieve the optimal weights in the output layer.
A formal description of an ELM can be described from a set of N independent and identically distributed random variables as a training sample ( Thus, standard SLFNs withÑ hidden nodes and activation function g(x) = Sig(x) = 1 1 + e −x , they are mathematically modeled by: where [w i = w i1 , w i2 , . . . , w in ] T is the weight vector connecting the ith hidden node and the input nodes, [β i = β i1 , β i2 , . . . , β im ] T is the weight vector connecting the ith hidden node and the output nodes, and β i is the bias of the ith hidden node. The inner product between w i and x j is denoted by w i · x j (see Figure 1). Huang et al. have rigorously proved in [17] that for N arbitrary distinct samples and any (w i , β i ) randomly chosen from R n × R m according to any continuous probability distribution, the hidden layer output matrix H of a standard SLFN with N hidden nodes, and it is invertible. Moreover, ||Hβ − T|| = 0 with probability one if the activation function g : R → R is infinitely differentiable in any interval.
Then, if we assume Hβ = T, the N-equations can be defined by: where The solution is given by computing the norm of weightsβ = H † T, where H † is the Moore-Penrose generalized inverse of matrix H.
ELM algorithm requires a training sample (x i , t i ) ∈ R n × R m ∀ i ∈ {1, 2, . . . ,Ñ}, an activation function g and set of hidden nodes H, to ensure a weight output vector. These weights connect hidden nodes to the hidden output. The procedure can be summarized in three stages: 1.
Randomly generate input weight and bias Calculate the hidden layer output matrix H.
The original version of ELM works properly, however, we firmly believe that its performance can be even better [47]. For this reason, we focus on stage one and we propose an online parameter control approach for ELM in order to find the best configuration of (w i , β i ) and to generate the best classification model.

Proposed Approach
In optimization, the parameter setting is known as a strategy for providing larger flexibility and robustness to the solvers but requires an extremely careful initialization [48,49]. Indeed, parameters of algorithms influence on the efficiency of the solving process. It is not obvious to a priori define which parameter setting should be used. The optimal values for the parameters mainly depend on the problem and even the instance to deal with and on the search time that the user wants to spend in solving the problem. A universally optimal parameter value set for a given computational intelligence algorithm does not exist [49,50].
In evolutionary computation, automatic parameter tuning is mainly divided into two key approaches: the o f f line parameter tuning and the online parameter control. In the first one, the values of different parameters are fixed before the run of the solver. In this situation, no interaction between parameters is studied. This sequential optimization strategy does not guarantee to find the optimal setting, even if an exact optimization setting is performed. In the second one, the parameters are handled and updated during the run of the algorithm. This approach has widely been studied due to the search for the best tuning of parameters that can be considered as an optimization problem itself [50]. It is a complex and extensive task, and the responsibility falls in the capacity and the experience of an expert user. Under the paradigm of the online parameter control, we use a bio-inspired solver to finely determine the most successful set of weights and hidden layer biases for an ELM.
The main idea is to define the best configuration of ELM (w i , β i ) using the optimization solver for generating solutions in order to properly train the ELM. This process operates as a loop statement by transferring the feedback or fitness (accuracy and loss) given by the ELM towards the optimizer, who tries to improve generated results. Finally, the best configuration of the ELM reaches the best classification model (see Figure 2).
As a bio-inspired solver, we propose the bat algorithm. Bat optimization, also known as the bat algorithm, is a swarm intelligence optimization technique for global numerical optimization and it was proposed by Yang in 2010 [19,51]. It is inspired by the echolocation behavior of bats that allows them to avoid obstacles while flying and it locates his food or shelter. In nature, only the species of the micro-bats exhibit this characteristic, which limits the technique [52]. To overcome this problem, the concept of a virtualbat is proposed as an artificial bat indifferent to any species. The bat algorithm has been developed following three simple rules:

1.
It is assumed that all bats use echolocation to determine distances, and all of them are able to distinguish food, prey, and background barriers.

2.
A bat b i searches for prey with position x i that initially is random. Bats changes its frequency depending on the proximity of their target, then affecting velocity. Thus, to change their position, all bats used frequency f i calculated by Equation (4) and velocity v i computed by Equation (5).
The new position is defined by Equation (6). The bat algorithm is considered a frequency-tuning algorithm that provides a balanced combination of exploration and exploitation. While more (positive) velocity, more exploration, less (positive) velocity, more exploitation. 3.
Finally, the variability of solutions is given by loudness A 0 and a rate of pulse emission r ∈ (0, 1), determined by Equations (7) and (8), respectively. Although the loudness can vary in many ways, it is assumed that the loudness varies from a large (positive) A 0 to a minimum constant value A min . A (i,t+1) = αA 0 0 < α < 1 (7) Algorithm 1 illustrates the pseudo-code for bat optimization. At the beginning, a population of m bats is initialized with position x i and velocity v i . The position of a bat b i is a vector composed by the set of weights and biases. We instance each bat solution as the position b i . This position corresponds to a vector composed by the set of weights and biases. This vector is used by tuning and training the ELM. This training takes only one epoch and return the achieved accuracy and generated loss. After, the frequency f i at position x i is set followed by pulse rates and loudness, both through uniform distribution between zero and one. Finally, the accuracy and the loss, both are taken by the objective function to be evaluated them as fitness. In this study, we use the objective function as follows: where k is a constant that weights the objectives of accuracy and loss. For computational experiments, k ∈ {0, 10, 20, . . . , 100}, therefore each evaluation of the the objective function is eleven times calculated. The best (greatest) value is returned as fitness and it is stored in ith position of the fitness vector f it. Next, a while loop is used to enclose a set of actions for being performed t times until the fixed number T of iterations is reached. Lines 7-11, the loudness of each bat (solution) is compared with a random value. If the loudness of the ith bat surpasses any other bat, then it decreases via Equation (7) and the ratio of pulse emission increases using Equation (8). This condition handles the variability of the potential solutions through the exploration process and the exploitation process. Small values for loudness suggest intensified solutions and large values of the rates of pulse emission. After, best f it and bestindex are obtained by the evaluation o the objective function. If best f it is better than global f it, then best f it is stored in global f it.
Next, the loop statement between lines 16 and lines 25, is an equation representing the movement of bats. A solution is firstly selected among the current best solutions, and a new solution is generated via random walks (Equation (10)): where ∈ {−1, 0, 1} andĀ represents the loudness average of all bats. Then, bats move according to Equations (4)- (6). Equation (4) controls the pace and range of bats movements, where β provides variability to the frequencies and it is randomly generated from an uniform distribution within the interval (0, 1). Equation (5) defines the velocity held by ith bat in time t, where x best represents the current global best position encountered from the m bats. Finally, the Equation (6) determines the new position held by ith bat. At the end, the bats are ranked in order to find x best , i.e., the best configuration (w i , β i ) for the ELM, and thus it generates the best classification model. Effectiveness of bat algorithm has been proved in a large instance of combinatorial and optimization problems. For example, in [53] an adaptive version of the metaheuristic was used as an optimization strategy of the observation matrix. In this line, a compact bat algorithm is proposed for the class of optimization problems involving devices that have limited hardware resources [54]. Finally, if we focus on machine learning techniques enhanced by the bat algorithm, we can find [55,56]. In these works, an improved bat algorithm is proposed to optimize artificial neural networks.

Result:
The best solution x best . // Initialize the bat population: // velocity v i , position x i , loudness A i , and rates of pulse emission r i .

end
// global f it is assigned to the worst case value.

Computational Experiments
In this section, the proposed hybrid method is compared with the native version of the Extreme Learning Machine. We use the Python programming language to implement the neural artificial network. Experiments have been launched on a 2.5 Ghz Intel Core i5 7300HQ with 8 GB RAM machine running Windows 10. ELM was mainly managed by Anaconda and its packages: Tensorflow, Numpy, and Scipy. We employ 23 neurons in the input layer related directly to 23 features, i.e., N = 23, each variable associated to one and only one neuron. In hidden layer, we use 12 nodes, i.e.,Ñ = 12. We apriori define the the number of neurons, i.e., N andÑ, both remain unchanged during the computational experiments. All the inputs (attributes) have been normalized into the range [0, 1]; similarly, as we work a binary classification problem, the output (target) has been normalized into [0, 1]. ELM only requires one epoch for training data. In this moment, ELM computes the Moore-Penrose generalized inverse H † of the hidden layer output matrix H. Respect to the bat algorithm, it was coded by using native Python version 3, and we employ an initial configuration suggested in [57]: m = 20, f min = 0.75, f max = 1.25, t max = 100, α = γ = 0.9, and = 1.
In the training phase of the ELM, a set of models is generated and improved during the run of the bat algorithm. We perform multiple simulations with different discrete time intervals, in order to demonstrate the robust behavior of the proposed approach. The main idea is to know how the neural network works in its training phase. In this line, two key issues appear: underfitting and overfitting. Both topics can be considered as problems because do not allow that the machine learning properly generalizes the knowledge. In this case, will not give a good classification. To illustrate the performance of the BA-ELM, we evaluate scenarios using: where T represents the maximum iteration, Acc x t and Acc y t describe given accuracy, and finally, Loss x t and Loss y t depict generated loss, both in the training phase and the validation phase, respectively. Moreover, we report four charts of accuracy and loss during the training phase.
From Figures 3-6, it is indicated how the accuracy and loss both vary during the training phase. For instance, curves generated by the accuracy data allow us assume that there is not overfitting and underfitting because difference computed by Equation (11) trends to 0 (see Figures 3 and 4). On the other hand, if we analyze drawn curves by loss data, we note that few iterations BA-ELM produces overfitting and underfitting calculated by Equation (12) (see Figure 5). However, as iterations pass, overfitting and underfitting begin to decrease (see Figure 6). This performance can be explained by converges of parameter setting given by bat algorithm because a properly tune provides models that better generalize knowledge [58].  In the classification phase, we run ELM and the optimized ELM both 31 times. Table 1 shows generated results. We can observe that the non-optimized ELM shows an excellent performance in the classification phase, reaching a maximum accuracy of 91.61%, generating an average accuracy close to 90.7%, and median value equals to 90.87%. These results expose that the ELM is able to properly work when patients present Parkinson's Disease. Nevertheless, when we analyze the results generated by the optimized neural network, we can note an outstanding performance. For instance, the computed mean value and the calculated median value, both are similar and close to the maximum value (96.74%). This distribution can be seen in Figure 7, where we can observe that achieved values are homogeneously dispersed, so much that the difference between the best value and the worst value is evidently small (0.93%).
Regarding results for generated loss, again the optimized technique behaviors better than the static version, as shown in Figure 8. If we study the loss produced by ELM, we can note that these values are localized between 5.36% and 7.29%, while loss given by the BA-ELM are distributed between 3.27% and 4.25%. If we analyze 2-IQR for the generated loss, we can observe that BA-ELM produces fewer loss than the simple ELM when it classifies Parkinson's Disease. Similarly, when the mean value is evaluated, the appeared difference between BA-ELM and ELM is abysmal. ELM exceeds 6.35% while BA-ELM does not surpass 3.73%.   To deep the generated results, we employ the statistical tests Kolmogorov-Smirnov-Lilliefors [43] and Wilcoxon ssignedrank, both described in the methodology section. For the first test, we found p-values equal to 0.001278 (the accuracy sample) and 0.000728 (the loss sample), both were smaller than 0.05, therefore do not allow to assume that samples follow a normal distribution. The non-parametric test computed p-values equal to 6.665 × 10 −17 and 1.644 × 10 −18 for h a 0 and h e 0 , respectively. Again, both were smaller than 0.05. These results indicate that h a 0 and h e 0 , they cannot be assumed. By contradiction, we can assume that BA-ELM works better than ELM, in precision as well as in loss.
By summarizing, if we consider the two approaches (accuracy and loss), the homogeneity of the results, and the hypothesis contrast, we can conclude that the optimized artificial method presents a robustness yield and it overcomes its static version. Then, we establish that bat algorithm is a suitable bio-inspired solver to optimize the ELM parameters and it shows outstanding results to classify voice signals in patients with Parkinson's Disease.

Conclusions
Parkinson's Disease is a degenerative disorder of the central nervous system and it is the most common movement disorder diseases. Its diagnosis is frequently difficult especially when it is at an early stage. An early indication of patients suffer the Parkinson's Disease is the vocal impairment. Recent works have proposed mechanics to support the diagnosis task. In this paper, we propose an optimized ELM through the bio-inspired algorithm to properly classify patients with Parkinson's Disease. The approximate method defines an optimal vector with input weights and biases values, at the same time. The idea is optimizing the training phase of the ELM to generate the best classification model. To solve this problem, we first propose the bat algorithm to compute the parameter values in order to find the best attainable configuration for the ELM, and then, we test this proposal on voice signals in patients with Parkinson's Disease. Results suggest that the bat solver is an efficient optimizer for improving the yield of the ELM For this reason, we can guarantee that whether ELM is fine-tuned, efficient models can be developed by solving classification problems, and thus, BA-ELM become a real alternative to help potential expert users.
As future work, we plan to update this architecture in order to classify ultrasound medial images using different ANNs. Moreover, we will study the learnheuristic approach to take advantage of the machine learning world on optimization algorithms, mainly during the resolution processes, and thus to improve local and global searches. Finally, we consider use a hyper-metaheuristic approach to firstly find the best feature vector using a bio-inspired optimization algorithm and then, training the the already optimized ELM.