An Optimisation-Driven Prediction Method for Automated Diagnosis and Prognosis

: This article presents a novel hybrid classiﬁcation paradigm for medical diagnoses and prognoses prediction. The core mechanism of the proposed method relies on a centroid classiﬁcation algorithm whose logic is exploited to formulate the classiﬁcation task as a real-valued optimisation problem. A novel metaheuristic combining the algorithmic structure of Swarm Intelligence optimisers with the probabilistic search models of Estimation of Distribution Algorithms is designed to optimise such a problem, thus leading to high-accuracy predictions. This method is tested over 11 medical datasets and compared against 14 cherry-picked classiﬁcation algorithms. Results show that the proposed approach is competitive and superior to the state-of-the-art on several occasions


Introduction
Artificial Intelligence (AI) plays a major role in modern medicine and is key in automating tasks such as image registration [1][2][3], diagnosis [4,5] and prognosis [6,7], thus allowing for doctors and practitioners to make faster and more accurate decisions [8].Indeed, recent advances in AI have made it possible for expert systems [9] and machine learning techniques [1,10] to outperform human decision-making over some repetitive procedures and are definitely more accurate at spotting hidden patterns in medical images or signals.Obviously, this does not mean that human intervention can be completely removed, in particular in the medical diagnosis domain, but the latter can significantly benefit from the informative feedback returned by AI-based diagnoses and prognosis systems.For instance, convolutional neural networks can be fed with EGG signals or magnetic resonance imaging (MRI) images to predict diagnoses of neurodegenerative diseases, for example Alzheimer's [11,12], and accurately classify brain lesions [13][14][15].In this light, applied AI displays a valuable societal impact and carries a promising potential for improving upon decision-making in medicine.However, the use of computer-aided diagnosis is sometimes criticised and despite the medical community being aware of the importance of modernising its methods, some researchers are hesitant due to the possibility of false-positive or false-negative cases, and some others point out moral implications [16].This stimulates and motivates computer scientists, who continually search for new and more accurate techniques.Room for improvement is easily identified for example, in the context of historical data analysis.Large databases are nowadays available and can be used without the need for ethical approval.Hence, exploiting this knowledge optimally for training AI techniques would help generate more reliable automatic diagnosis and prognosis predictions.
To achieve this goal, supervised machine-learning is currently preferred in medicine [10], as data are labelled based on the kind of tissue, prognosis, patient age and several other useful features, to perform prediction.Successful studies using supervised classification strategies, for example., cardiovascular medicine [17] and genetics [18], are present in the literature.However, better performances can be obtained by optimising the classification process.For this reason, this piece of research proposes a novel classification method, referred to as "Optimisation-Driven Prediction" (ODP), to be used for automated medical diagnosis and prognosis.This classifier is designed around a newly designed variant of a metaheuristic for real-valued optimisation, referred to as Particle Swarm Estimation of Distribution Algorithm (PSEDA), originally proposed in Reference [19].
The remainder of this article is structured as follows: • Section 2 gives a brief overview of the metaheuristics optimisation approaches employed in this study and highlights differences between previous classification schemes based on optimisation algorithms and the proposed method; • Section 3 clarifies objectives and methods for this piece of research; • Section 4 describes the proposed ODP method, including an explanation of the working principle behind the novel PSEDA variant, the definition of the classification process in the fashion of an optimisation problem and the formulation of five different objective functions, aka fitness function, to strengthen the validity of the proposed approach; • Section 5 gives details of the experimental phase to make the presented results reproducible; • Section 6 presents and discuss the numerical results; • Section 7 concludes this work by summarising the key research outputs and drawing some considerations for possible future developments.
Similar to common population-based Evolutionary Algorithms (EAs) [24], it employs a set of candidate solutions to first explore the search space to then converge to the most promising basin of attraction.On the other hand, the PSO lacks selection mechanisms, which are usually replaced with the 1-to-1 spawning mechanisms in SI algorithms and features a unique perturbation strategy mimicking starling flocks' behaviour.Hence, solutions move and interact, rather than being evolved as in EAs, according to the dynamics outlined in Reference [25].Several PSO variants have been designed to deal with a wide range of problems [26], including large scale ones [27,28], as well as challenging engineering applications [29], and hybrid versions were also designed thus generating effective PSO based multi-strategy approaches [30] and Estimation of Distribution Algorithms (EDAs) [31,32].The EDA framework is quite interesting and has proven to be successful over different fields such as Robotics [33] and combinatorial domains [21].These algorithms have a probabilistic representation of the population, thus are not required to store individuals since they are sampled on-demand from distribution of probabilities.The latter is continually adapted to the problem at hand so that sampled candidate solutions have a higher probability of falling in a neighbourhood of the optimal solution.Some examples of EDAs, referred to as "compact" algorithms, can be found in Reference [32].It is also worth mentioning the so-called BOA-PSO algorithm, proposed in Reference [34], where a multivariate distribution is used instead of a set of univariate distributions (i.e., one per design variable), the EDPSO algorithm designed in Reference [35], which uses a mixture of univariate Gaussian distributions, as well as the more recent variant in Reference [36], where an EDA is used to model the "historical memory" of successful solutions in a PSO framework.Similarly, the Particle Swarm Estimation of Distribution Algorithm (PSEDA) introduced in Reference [19] is based on adjusting a probability distribution to simulate the behaviour of a classic PSO.PSEDA has proven to be efficient and to outperform classic PSO algorithms over several optimisation problems.In this light, the PSEDA framework was further improved in this piece of research and then used in the proposed ODP approach.
It must be said that the literature already presents classification methods using metaheuristic for numerical optimisation but these differ from ODP.As an example, a genetic algorithm is used in Reference [37] for finding optimal coefficients of a wavelet kernel extreme learning classifier, while a memetic algorithm is used in Reference [38] to optimise a model for gene selection in microarray data.Moreover, past studies made use of a standard PSO in Reference [39] as well as the Artificial Bee Colony paradigm (ABC) [40] for data classification and clustering, respectively.Conversely, in the ODP method optimisation is used at the training level.The PSEDA variant described in Section 4.1, and schematically shown in Algorithm 1, is embedded in ODP and used to automatically tailor a classifier to the dataset.

Objectives and Methods
This research intends to design a novel prediction method for medical diagnosis and prognosis capable of classifying multivariate healthcare datasets with high accuracy and, therefore, a low rate of false negatives and positives.
In light of what was described in Section 2, the PSEDA algorithm was chosen for its outstanding performances [19] and for its algorithmic features, which make it behaves similarly to SI algorithms, as the PSO successfully used for the classification problem in Reference [39], and EDAs (which are a kind of EAs), simultaneously.Technical details on the working mechanisms of PSEDA are given in Section 4.1.To further improve upon the classification process, the PSEDA algorithm was not simply employed but was improved by introducing a self-adaptive mechanism to adjust the exploratory step-size on-the-fly.By using this novel variant, referred to as self-adaptive PSEDA (sa-PSEDA), the proposed method always returns optimal, or near-optimal, predictions.Indications on how to use sa-PSEDA for classification purposes are in Section 4.2.
In a nutshell, the proposed methodology consists of defining the classification process in the form of an optimisation problem, which requires the formulation of an objective function, aka a fitness function, within the metaheuristic optimisation community.To perform a thorough comparative analysis and give more validity to our results, five different fitness functions are considered in this study.Three of them are cherry-picked from the literature [39], while two more are originally proposed for this specific piece of research.A rigorous analytical formulation of the employed fitness function is given in Section 4.2.
Finally, the methodology for evaluating the performances of sa-PSEDA consists of using 11 different multivariate medical datasets from the "UCI machine learning repository" [41] and generating predictions with each fitness function.To perform a fair comparison, classification results are also produced with the previously mentioned PSO and ABC methods from References [39,40].Their implementation is made available in an online repository reachable from https://bit.ly/2VK9CS3.Moreover, a comparison against popular classifiers, that is, a naive nearest centroid classifier, and 11 more state-of-the-art methods in their Weka [42] implementation, is also carried out.The statistical tests in Reference [42] are used to validate the comparative analysis and draw rigorous conclusions.

The Optimisation-Driven Prediction Method
The ODP method is a supervised classification approach requiring training data to generate a prediction function.To perform the training process optimally, a metaheuristic for continuous optimisation is used.This means that the classification task must be formulated as an optimisation problem.This process requires the definition of at least a fitness functions, a formal description of the search space and an appropriate encoding strategy to represent class representatives as candidate solutions.These details are clarified in the following sections.

The sa-PSEDA Algorithm
The original PSEDA in Reference [19] simulates the internal dynamics of a PSO algorithm with a probabilistic model.Unlike other EDAs, this is achieved by associating each "particle" with its own probabilistic model.The latter are update and moved in the search space according to the PSO logic, which requires attractors as the local (i.e., a neighboured of a particle must be considered) and the global (i.e., in the entire swarm is considered) best solutions, to then sample a new particle in a different location of the search space.Hence, these distributions are used instead of the so-called "velocity" vector, which is the PSO perturbation vector added to the particles to move them towards most promising regions.Advantages of having a stochastic perturbation are evident over multimodal problems and reduce the rise of deleterious structural biases [25,43,44].
Without loss of generality, let us suppose to have a generic minimisation problem in the domain During an iteration cycle of PSEDA, the joint probability distribution P i,t of each x i,t vector is updated to then produce x i,t+1 ∼ P i,t .This is done on the assumption that the d components of each position vector can be modelled with independent and identically distributed (iid) random variables.Under this condition, P i,t can be factorised in d unidimensional probability density functions which facilitates the implementation of the algorithm and leads to a modest algorithmic time overhead, since each kth component can be efficiently sampled from x i,t+1,k ∼ P i,t,k .Each iid variable is modelled via a weighted finite mixture [45] of the following probability distributions: i,t,k , that is, a truncated (within [l k , u k ]) normal distribution with mean value x i,t,k and standard deviation σ It must be noted that the use of truncated distributions prevents the algorithm from generating infeasible solutions and thus removes the need for correction strategies [44].The distributions are weighted with w x , w p , w g , w u and w m as described in Reference [19].Then, the mechanism in Reference [45] is employed to sample a candidate solution.This requires the roulette wheel tournament scheme (see e.g., Reference [24]) as also described in Reference [19].
It is worth observing that the truncated normal distributions are centered on the particle, its personal best particle and global best particle, in order to mimic the PSO working logic.Since points surrounding the mean value of a normal distribution have a higher probability of being selected, by narrowing the standard deviation PSEDA becomes almost identical to a PSO.On the other hand, higher standard deviation values result in different algorithmic behaviours.In light of this, it is obvious that convergence speed also depends on the standard deviation values of the normal distributions.To make sure that these values are self-adapted on the problem at hand during the optimisation process, the proposed sa-PSEDA variant is equipped with the tuning mechanism, formalised with the equations below: where for all individuals (i = for all individuals i do 13: for all dimensions k do 14: Compute σ Save the previous position x i,t−1

16:
Sample x i,t+1,k from P i,t,k with the method in [45] 17: end for 18: end for 19: end while 21: end procedure Finally, it is also worth noting that the distribution P was designed to reproduce the effect of the velocity vector, while the uniform distribution U k was introduced in order to regulate the exploration/exploitation balance of the evolutionary search [46].
For the sake of clarity, the sa-PSEDA algorithmic framework is shown in the pseudo-code in Algorithm 1.

The Classification Strategy
Classification is a classic application of machine learning [47] consisting of generating a prediction from labelled observations.In this work, such observations are numerical values.This format is quite common in health care data as the vast majority of medical tests return values that are then labelled by finding the range of belonging.
Formally, this format can be expressed by defining instances with real-valued vectors of the kind A ⊆ R m and the class space for the classification problem as an ordered set B = {1, . . ., k} of integers values.The classification process takes place via two main sequential phases- (1) the training of a prediction function h (2) the use of the fine-tuned prediction function over new datasets.
Typical methods for implementing the prediction function h are decision trees [48], aggregations or associations rules [49,50], a Bayesian or neural network [51] or hybrid approaches such as, for example, fuzzy neural systems [52].However, in this work, we preferred the centroid algorithm from Reference [53].This was already used in SI classifiers in the past [39,40] and requires the computation of the centroid for each class, which is not a computationally expensive procedure (i.e., it does not introduce overheads in the ODP method) and it is suitable for real-value datasets.
A formal definition of the adopted prediction function is where and |B| = k, the use of h x (1) ,...,x (k) comes with an acceptable computational complexity (i.e., O (k • m)).
The proposed sa-PSEDA can now be used to "train" h by finding its optimal parameters, that is, the class representatives x (1) , . . ., x (k) .This means that the prediction function is actually "evolved" and tailored to the specific classification task, which is now stated as an optimisation problem.The sa-PSEDA is then a key component of the classifier, operating in a search space Θ ⊆ R k•m where candidate solutions x ∈ Θ are represented in the form x = x (1) , . . ., x (k) with x (i) ∈ R m and i ∈ {1, . . ., k}) since each one of the k classes has m numerical features.
In order to evaluate the quality of a generic solution x on the training set T, five different fitness functions have been taken into account: where By minimising the proposed fitness functions the same optimisation algorithm can be used to perform five different training processes.To distinguish between them, the fitness function is added to the algorithm's name as follows: sa-PSEDA-ψ 1 , sa-PSEDA-ψ 2 , sa-PSEDA-ψ 3 , sa-PSEDA-ψ 4 , and sa-PSEDA-ψ 5 .
It must be remarked that the first three fitness functions have been proposed in previous studies [39,40], while the remaining two are originally designed to perform a wider comparative analysis.The ψ 1 function represents the error rate on the training set and has a complexity of O(|T| • k • m), while ψ 2 computes the average distance of training instances from the class representative of the known classes.The latter comes with a lower complexity of O(|T| • m) but not necessarily with an inferior quality of the results.ψ 3 , ψ 4 , ψ 5 are convex combinations of the first two fitness functions and only differ by the adopted weights (i.e., ψ 3 is balanced with equal weights while ψ 4 and ψ 5 are unbalanced toward ψ 1 and ψ 2 respectively).The last three fitness functions have the same complexity of ψ 1 and are obtained by composing ψ 1 and ψ 2 .

Experimental Setup
The ODP method was run over 11 well-known medical datasets for diagnosis and prognosis, taken from the UCI machine learning repository [41], and evaluated with 10 repetitions of the stratified 10-folds cross-validation technique.During each repetition, the dataset is split into 10 batches, each one following the same class distribution of the entire dataset.Then, instances from 9 batches are used to train the supervised classifier while the 10th batch is used for testing the learned model.In total, 100 test rounds are performed for every dataset, thus the performances of every algorithm are averaged over the 100 tests and the corresponding results are provided in Section 6.
A detailed description of the 11 employed datasets is given in Section 5.1.Moreover, it must be stressed out that the presented results are obtained by using sa-PSEDA with the parameter setting reported in Section 5.2, where comparison algorithms are also listed.These values for the parameters are the outcomes of a fine-tuning empirical process carried out to train the classifier optimally.

The Datasets
The 11 classification datasets are described in Table 1, which schematically shows the number of instances, class labels, attributes, the attributes' structure (i.e., the number of real-valued, integer, and logical attributes) and a warning to flag missing information.Indeed, it is quite common to find datasets with missing information in medicine.However, this problem can be easily dealt with as data are numerical and missing fields can be replaced with the average value of the present values [40].From Table 1 it can be easily noticed that the chosen problems are diverse and are adequate to test the versatility of the proposed method.Moreover, each dataset has a multivariate nature and presents strong dependencies amongst the attributes.For the sake of completeness, a concise description of each dataset is given below.
Breast-Real and Breast-Integer are based on the "Breast Cancer Wisconsin (Diagnostic)" and "Breast Cancer Wisconsin (Original)" datasets from UCI.In this article, instances are classified in two diagnosis class, that is, benign and malign breast cancer, respectively.Both the datasets come with an ID field that was removed to perform our experiments.It must be noted that, while the first dataset is real-valued, the second one has integer attributes in the range [1,10].However, this is not an issue as the ODP method can indifferently work on real-valued sets and its subsets.
Dermatology contains diagnoses of erythematous-squamous diseases and is the largest dataset considered in this study with its 33 integer attributes, in the range [0, 3], and 1 logical attribute (i.e., Boolean) per instance.The six possible diagnoses are-Psoriasis, Seborrheic Dermatitis, Lichen Planus, Pityriasis Rosea, Chronic dermatitis, and Pityriasis Rubra Pilaris.
Diabetes is based on the "Pima Indians Diabetes" dataset for diabetes diagnoses of female patients.It has six integer attributes, each with a large cardinality, and two real-valued attributes.
Haberman is a database of long-term survivors who undergone breast surgeries and provides information on the survival time, for example, whether or not the patient died within 5 years from the surgery.It represents the smaller dataset considered in this study in terms of the number of attributes.The latter are three integer values with a large cardinality.
Heart-2C and Heart-5C contain data from heart disease diagnoses from the "Cleveland Processed Heart Disease" dataset in UCI.The difference between the two datasets is in the number of classes.Heart-2C can be used as a binary classification problem (i.e., the patient is diseased or not), while Heart-5C has four possible classes defined by specific vascular parameters.These two datasets have one real-valued, logical and integer attributes.
Liver is a database of liver disorders, mainly due to excessive alcohol consumption, in male patients.Attributes have both real-valued and integer formats and refer to blood test results.One must be aware that this dataset contains duplicated instances that need to be removed before being used.This issue is pointed out in the UCI website.
Parkinsons consists of a range of biomedical voice measurements from healthy individuals and patients affected by Parkinson's disease.This dataset was downloaded from the UCI repository and the field containing the patient's name was removed from each attribute as not relevant to this study.
Thyroid is based on the "new-thyroid" dataset from UCI and contains diagnoses of thyroid gland diseases.It has three classes-normal, hyperthyroidism and hypothyroidism.
Vertebral, aka "Vertebral Column" dataset, contains values for six biomechanical features used to classify orthopaedic patients into the three "normal," "disk hernia" and "spondylolisthesis" classes.Also in this case, the dataset is multivariate.

Parameter Settings and Comparison Algorithms
The suggested parameters configuration for the sa-PSEDA predecessor in Reference [19] indicates a swarm size of 50 particles and variation operator's parameters equal to ω = 0.7298 and ϕ 1 = ϕ 2 = 1.49618 respectively.This leads to the calculation of the w x , w p and w g weights, discussed in Section 4.1, as shown below: ω+ϕ 1 +ϕ 2 ; while w u was set to 0.05.
These values were tested with the sa-PSEDA variant and compared against several other parameters combinations.We concluded that the same setting can be used for sa-PSEDA apart from the value for the w u weight.The latter has to be lowered down to 0.01 to obtain optimal performances.This conclusion was obtained by running sa-PSEDA with a computational budget fixed to 300,000 fitness evaluations over four common testbed problems: x j 2 where −100 ≤ x i ≤ 100, x * = 0, 0, . . ., 0 and f min (x * ) = 0; • Rosenbrock: where −10 ≤ x i ≤ 10, x * = 1, 1, . . ., 1 and f min (x * ) = 0; • Rastrigin: where −5.12 ≤ x i ≤ 5.12, x * = 0, 0, . . ., 0 and f min (x * ) = 0; • Ackley: where −32.768 ≤ x i ≤ 32.768, x * = 0, 0, . . ., 0 and f min (x * ) = 0; displaying different fitness landscapes.Further details and source code (implemented in Java) for these functions can be found in [54].This experiment was repeated 50 times per problem to observe the average performances and its standard deviation.Each problem was considered at d = 30 and d = 10 dimension values.Table 2 shows a comparison obtained by executing sa-PSEDA with w u equal to 0.05 (as proposed in the [19]) and 0.01.The best results, displayed in boldface, show that 0.01 is preferable.This is also confirmed with the t-test outcome reported, with a compact yes/no notation, in the last column of Table 2.In this column, a "yes" indicates that the variant with w u = 0.01 statistically outperforms the other variant , while a "no" means that the two variants are statistically equivalent.It can be noted that sa-PSEDA seems to be very robust and resilient to variations of parameters as results are very similar.Hover, since improvements are registered with w u = 0.01, this value was used instead of the one proposed for PSEDA in Reference [19].
To show the quality of the proposed ODP method, when equipped with the fine-tuned sa-PSEDA optimiser, a set of comparison algorithms was chosen from the literature, as explained in Section 3.
Thus, the classification approach using a PSO algorithm in Reference [39] was implemented and run over the 11 datasets used in this study with the parameter settings proposed by its authors, namely n = 50, T max = 1000, v max = 0.05, v min = −0.05,c 1 = 2.0, c 2 = 2.0, w max = 0.9 and w min = 0.4.Moreover, the classification approach using the ABC algorithm in Reference [40] was also implemented and run with the parameter settings proposed by its authors, namely a colony size of 20, a maximum generations number of 2500 and a limit value of 1000.To perform a fair comparison, the proposed number of iterations for the PSO was increased from 1000 to 2500 in order to obtain the same number of fitness evaluations (i.e., 50,000) used for sa-PSEDA and PSO over the 11 classification problems.
The source codes used for sa-PSEDA, PSO and ABC are made available in an online repository reachable from https://bit.ly/2VK9CS3.
Finally, 12 popular classification methods were selected.The first one is the Nearest-Centroid (NC) classifier used, for example, in References [53,55].This classifier was taken into consideration as similar to the classification mechanism of the proposed ODP method, even though it does not come with an effective training logic.Usually, this kind of method is employed when a very fast classification process is required.The other 11 classification strategies are instead state-of-the-art schemes taken from the Weka software suite (release 3.8) [42].Weka contains a large number of Bayesian, functions-based, lazy (or instance-based), meta-schemes, rules-based, tree-based and so on, classifiers.We have chosen some methods from each available group to have a wider range of classification techniques.These are: Unless differently specified in the list above, all these methods were run with the parameter settings indicated in the original articles.
Considering that five instances of sa-PSEDA, PSO and ABC have been executed-that is, one per fitness function-a total of 27 classification schemes were run in this investigation.All the classifiers based on optimisation algorithm are referred with the same notation of ODP, that is, sa-PSEDA-ψ i , PSO-ψ i and ABC-ψ i where i = 1, 2, 3, 4, 5 is the number of the fitness function listed in Section 4.2.

Experimental Results
Due to a large number of results being obtained with the 27 classifiers and the 10 repetitions of the 10-folds cross-validation method, which means that 2700 classification tasks were run, results are arranged in two groups.Section 6.1 contains the comparison amongst the three classifiers using SI metaheuristics for optimisation plus the NC method.In Section 6.2 the best amongst the five sa-PSEDA instances (individuated in Section 6.1) were compared against the 11 state-of-the-art algorithms from Weka.
Numerical results are displayed in tables in terms of average accuracy (±standard deviation).The best value is highlighted in boldface.The outcome of the test of significance returned by the Weka platform [42] is also reported (i.e., a paired t-test with confidence level set to 0.05) to validate results statistically.A compact notation is used.The equals sign = indicates the comparisons which results are statistically equivalent, that is, the distribution of the results of the reference algorithm does not significantly differ from the distribution of the comparison algorithm, while circles are used to point out significant statistical differences.In particular, a black circle • refers to the case where the reference algorithm outperforms the comparison algorithm.Conversely, a white one • indicates that the reference algorithm is outperformed by the comparison algorithm.

Comparison against SI Classifiers
In total, 15 SI classifiers are evaluated.Table 3 compares their classification performances by arranging the three main methods into five groups according to the employed fitness function, that is, ψ 1 and psi 2 ψ 3 and ψ 4 and ψ 5 .The reference algorithm of each ith group (i = 1, 2, 3, 4, 5) is sa-PSEDA-ψ i .It must be remarked that the NC classifier, which does not use the SI logic, is also added here as the ODP method contains a similar classifier (trained with sa-PSEDA).Thus, it is appropriate to include it in this comparison.Table 3.Average accuracy ± standard deviation and statistical analysis [42] for the 15 SI classifiers and the NC classifier over the 11 classification problems.

Algorithm
Avg Accuracy Statistical Analysis sa-PSEDA-ψ 1 81. 45 It can be noted that sa-PSEDA displays the overall best performance.However, it is interesting to observe that when ψ 2 is employed, the ABC based classier obtains the best performance and statistically outperform sa-PSEDA in one case (it is equivalent to sa-PSEDA otherwise).Conversely, if the other fitness functions are used, sa-PSEDA always displays the best average result and it outperforms the other methods also statistically.This means that the second fitness function is not suitable for modelling the problems at hands, as it is the only one deteriorating the performances of the proposed method.The best accuracy is obtained by employing ψ 5 , as highlighted in the box plot in Figure 1 and the statistical analysis in Table 4.
The comparison between the NC method and sa-PSEDA-ψ 5 shows that ODP always outperforms its predecessor (i.e., NC).This motivates the proposed ODP method, which is basically an optimised version of the NC classifier.In this light, this work shows that extremely complex classification strategies are not necessary, as a simple one can be enhanced by optimising the training process (as done in ODP).This is even more evident in Section 6.2, where state-of-the-art classifiers are compared against the ODP method.
It also worth spending a few words on the key role played by the fitness function.Recently, the research community has been producing novel bio-inspired metaheursitcs for optimisation in an attempt to obtain better results.However, most of the novel algorithms share very similar structures, as, for example, firefly algorithms [66] and differential evolution [44,67] are ruled by similar internal dynamics [25] and are designed without taking into consideration the problem at hand.This process does not always lead to optimal results, as it is known that universal optimisers do not exist, and the best performance is always returned by algorithms tailored to the problem [68].In this light, more attention should be paid to the formulation of the problem by defining an informative fitness function.
For these reasons, the five functions used in this paper were defined to provide a variety of fitness landscapes and different results were expected depending on their use.Table 4. Average accuracy ± standard deviation and statistical analysis [42] for sa-PSEDA-ψ 5 against sa-PSEDA-ψ 1 ,sa-PSEDA-ψ 2 , sa-PSEDA-ψ 3 and sa-PSEDA-ψ 4 over the 11 classification problems.

Algorithm Avg Accuracy Statistical Analysis
sa-PSEDA-ψ 1 81.45 ± 13.61 7 • 4 = 0 • sa-PSEDA-ψ 2 77.40 ± 16.59 10 • 1 = 0 • sa-PSEDA-ψ 3 82.90 ± 13. 70 1 • 10 = 0 • sa-PSEDA-ψ 4 82.23 ± 14. 16 6 • 4 = 1 • sa-PSEDA-ψ 5 82.92 ± 13.88 − To summarise, the classifiers returning the first, second and third best accuracy value over each problem are listed in Table 5.This table gives a better overview and further confirms the superiority of sa-PSEDA-ψ 5 and sa-PSEDA in general.Indeed, the second-best classification approach is sa-PSEDA-ψ 3 and the only two datasets where sa-PSEDA is not listed amongst the first three most effective classifiers are Dermatology and Heart-2C.An explanation of the poor performance obtained over the Dermatology dataset in comparison with that one obtained with the simple NC method can be related to an inadequate computational budget of the sa-PSEDA algorithm.Indeed, this dataset has 204 (6 classes × 34 attributes) and might require more fitness functional calls, while the NC method is not affected by this problem.This should be investigated, as PSO and ABC perform better despite using the same computational budget, even though sa-PSEDA-ψ 5 has the overall (i.e., average) fastest convergence speed, as shown in Figure 2.
Finally, it can be observed that using a combination of ψ 1 and ψ 2 leads to better results only if higher importance is given to ψ 1 .This is what is done in function ψ 5 .Conversely, unbalancing the weights towards ψ 1 , as done in ψ 4 , is not beneficial as the obtained classification carry an accuracy inferior to the one of ψ 3 (which is balanced).

Comparison against State-of-the-Art Classifiers
A thorough comparative analysis is performed in Table 6, where the most accurate SI classifier, that is, sa-PSEDA-ψ 5 , is used as a reference algorithm against the 11 state-of-the-art methods listed in Section 5.2.
In terms of average accuracy, the proposed method is second only to the MLP scheme and significantly outperforms 10 state-of-the-art more complex classification strategies.
It must be noticed that sa-PSEDA-ψ 5 is very competitive if compared against MLP over each single dataset.Indeed, it outperforms MLP on six datasets, while it is significantly outperformed on only three other datasets.This leads to an inferior average result, which does not necessarily mean that one classier is better than the other.One must then conclude that MLP shows very high performances on three specific application domains but is less flexible than sa-PSEDA-ψ 5 .Interestingly, Dermatology is one of these three datasets.A loss of accuracy over this specific dataset was already pointed out in Section 6.1.Table 6.Average accuracy ± standard deviation and statistical analysis [42] for sa-PSEDA-ψ 5 against state-of-the-art algorithms from Reference [42]   It is worth noticing, with reference to Figure 3, that MLP displays longer whiskers than sa-PSEDA-ψ 5 , which also has fewer non-classified (outliers) points and a more contained box (i.e., upper and lower quartile range).This highlights some instability for the MLP method.This is confirmed by the fact that, concerning the statistical analyses shown in Table 6, the proposed approach displays the best performance.
To further understand the potential of the proposed method for medical diagnosis and prognosis, the algorithms returning the best, second-best and third-best accuracy values are listed next to each corresponding classification problem in Table 7.Without considering the sa-PSEDA instances equipped with the other fitness functions, sa-PSEDA-ψ 5 on its own appears in the table nine times and it is only missing in two datasets out of eleven.

Conclusions and Future Work
From the algorithmic point of view, the proposed sa-PSEDA algorithm appeared to be robust and more resilient to parameter variations than its predecessor PSEDA.This is interesting, as it comes to prove that the algorithmic design phase plays a major role in the metaheuristic optimisation field and that even a small variation in the algorithmic structure can lead to significantly better algorithmic structures.Moreover, this observation also shows that self-adaptation is key to good performances.This conclusion is coherent with the No Free Lunch Theorems in optimisation [68] and further confirms them.
As for the classification results, the proposed ODP method turned out to be extremely competitive against the chosen competitors, including the state-of-the-art classification algorithms from Weka, which are outperformed in several occasions.This is particularly evident when sa-PSEDA is used with the ψ 5 fitness function, which has proven to be the most adequate for the medical domain.This fitness function is not taken from the literature but is proposed in this article.Due to the obtained accuracy, we recommend its use while designing similar classification approaches.
Since a deterioration of the accuracy is noted in datasets containing a large number of outliers, future investigations will be carried out to deal with this issue.On top of designing a pre-processing phase to filter out outliers, we will focus on the algorithmic design to obtain a novel optimisation algorithm displaying a more robust structure.This should be doable by means of the EDA paradigm, as distributions can be updated so as to move their mean value away from outlier points.Alternatively, the use of multivariate Gaussian distribution could also help avoid this deterioration of the accuracy.Another future line of research is to handle the different types of search spaces (numerical, integer, binary) by means of the algebraic framework for evolutionary computations proposed in References [46,69,70].

n individuals 4 :
(g) i,t,k are small random numbers uniformly distributed in [0, 0.01].The starting value for each standard deviation is the smaller distance between the peak of the truncated normal distribution and the peak of the remaining truncated normal distributions.while t < M do M is the number of allowed generations 5:

Figure 1 .
Figure 1.Box plots for sa-PSEDA equipped with the five fitness functions under study.

Figure 2 .
Figure 2. Average fitness trends for the SI classifiers (equipped with ψ 5 ) and the NC classifier.

Table 2 .
Tuning results on numerical benchmarks
over the 11 classification problems.