Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efﬁcient Gene Selection

: In recent years, much research has focused on using machine learning (ML


Introduction
Alzheimer's disease (AD) is the most common cause of dementia and memory loss, in addition to being a major cause of death.It is a chronic neurodegenerative disease that starts silently and worsens gradually over time [1].In 2015, 47 million people worldwide were suffering from AD, costing more than USD 818 billion.Both figures are expected to rise as time goes by [2].It is also anticipated that 1 out of every 85 people will have AD by 2050 [3].
There are many known symptoms of AD, the most common being the difficulty to remember recent events.As the disease advances, symptoms can include problems with orientation, language, self motivation, mood, memory, self-care, and behavior [4].As the condition of AD patients deteriorates, they begin to withdraw from family and society.Gradually, body functions are lost, eventually leading to death.Although the speed of progression can vary, the typical life expectancy after AD becomes visible is 3 to 9 years [5].Thus, early diagnosis of AD can even save lives, and that is where the present work comes in.
Traditionally, AD diagnosis has been primarily carried out via brain magnetic resonance imaging (MRI) and neuropsychological testing [6].Recognizing the molecular-level of AD is lacking due to the difficulty of sampling posterior brains of normal and AD patients.Thankfully, recent trails have produced large-scale omics data for various brain areas.Using these data, it is easy nowadays to develop prediction methods, such as those in this article, whereby machine learning (ML) models are leveraged to diagnose AD as early as possible [7].Such methods can also be advantageous to the patient in that they are convenient and inexpensive.It has been even shown that they can better predict AD than clinicians in certain circumstances [8].This fact has led to much research focusing on ML application to AD diagnosis by using medical data in different forms, such as MRI.
MRI scans can be used with support vector machine (SVM), or variants thereof, to detect AD, as in [9], where an approach that leverages recursive feature elimination to select the features is introduced.The approach demonstrates higher accuracy in classifying mild cognitive impairment (MCI), control normal (CN), and AD cases (subjects or instances).A related SVM work is given in [10], where the authors incorporate universum to develop a twin SVM model.First, a universum hyperplane is constructed, then the classifying hyperplane is constructed by minimizing the angle with the universum hyperplane.The model is applied to AD detection and high-accuracy is reported.Another related SVM work appears in [11], where three variants of SVM classifiers are employed to detect AD using 30 features, selected from an original total of 420 features.
MRI scans can also be used with other ML models, especially neural networks and their derivatives, to detect AD, as in [12], where a convolutional neural network (CNN) is used for feature extraction, and k-means clustering is used to classify AD, MCI, and normal cognition (NC).The proposed method is reported to achieve high accuracy.CNN is also used in [13] to classify MCI and AD cases from normal (N) cases.They study the impact of incorporating data from MRI and diffusion tensor imaging (DTI).Their techniques achieve high-accuracy, specificity, and AUC values.CNN is used as well in [2] to recognize the patterns that identify each AD stage.To this end a time series is processed for each patient.In [14], partial least squares is used for dimensionality reduction, ANOVA is used for feature selection, and a random forest (RF) classifier is used for classification of AD.The authors in [15] use resting-state functional MRI and a deep learning (DL) technique to classify AD.They use an expanded network architecture to apply transfer learning with and without fine-tuning.In [16], the authors propose also a DL model for all level feature extraction and fuzzy hyperplane based least square twin support vector machine (FLS-TWSVM) for the detection of AD.Furthermore, in [17], the authors propose an ensemble of deep neural networks for the classification of AD.The proposed ensemble leverages the diversity introduced by many different locally optimal solutions reached by individual networks through the randomization of hyper-parameters.In [18], the authors propose an improved twin support vector regression model for brain age estimation, which can be helpful in mental health issues in general.
The diagnosis of various diseases is nowadays possible thanks to gene expression, which is the basis of the present work.Such data are obtained through the powerful technology of DNA microarrays [19].It provides expression levels of thousands of genes [20].The level of gene expression signifies the combination of different messenger molecule ribonucleic acid (mRNA) in the cell.By using this level, it is possible not only to detect diseases, but also to select the best treatment and discover mutations in other processes [21].
For example, the authors in [22] use a blood-derived gene expression biomarkers to distinguish AD cases from other sick and healthy cases.They use XGBoost classification models and succeed in detecting AD in a heterogeneous aging population by adding related mental and elderly health disorders.Nevertheless, improving the sensitivity of the model is still required to define a more specific blood signature to AD.In [23], three independent datasets, AddNeuroMed1 (ANM1), ANM2 and Alzheimer's Disease Neuroimaging Initiative (ADNI), are used to distinguish AD from CN. Different gene selection (GS) methods, such as variational autoencoder, transcription factor, hub genes, and convergent functional genomics (CFG) are used to select the most informative genes.Five models, SVM, RF, logistic regression (LR), L1-regularized LR (L1-LR), and DNN, are employed for classification.The AUC values obtained are 87.4%,80.4%, and 65.7% for ANM1, ANM2, and ADNI, respectively.The authors also analyze the biological functions of the blood genes related to AD and compare the blood bio-signature with the brain bio-signature.They employ 1291 brain genes extracted from a gene expression dataset with 2021 blood genes extracted from the other three datasets, given that there are 140 common genes between the two.In [24], a study is presented to identify expression genes from a blood dataset, and to explore the correlation between the blood and brain genes of an AD patient.They identify 789 deferentially expressed genes common in both blood and the brain.Least absolute shrinkage and selection operator (LASSO) regression is used as a GS method.Logistic ridge regression (RR), SVM, and RF models are used for classification.They succeed in discriminating AD cases from control cases with 78.1% accuracy.In [25], multiple brain regions are used to identify prospective diagnostic biomarkers of AD.Gene expression data from six brain regions are employed to determine AD biomarkers.A t-test is used to select the most informative genes.Significance tests are used to check those biomarkers and evaluate their potential for clinical diagnosis.The authors of [26] integrate gene expression and DNA methylation datasets, forming a multi-omics dataset, to predict AD using on a deep neural network (DNN).Principal component analysis (PCA) and t-stochastic nearest neighbor techniques are used to select and the most informative features.In [27], the authors use blood gene expression data obtained from the ANM and dementia case registry (DCR) cohorts.They employed recursive feature elimination for GS and used RF for classifying AD cases.They used ANM1 for training the classifier and integrated both ANM2 and DCR to use them for testing.They obtained 72.4% for AUC and 65.7% for ACC.
In Table 1, we present a summary of eminent studies to diagnose AD and to identify genes that qualify to be its biomarkers.The Table shows the original number of genes in each research work and the number of genes used after the GS step.We deduce that the number of selected genes has no obvious pattern or rule, and is largely dataset and model dependent.In other words, each diagnosis experiment can select a different subset of relevant genes and end up with a different accuracy value based on the ML model used.The table is also a testimony of the main obstacle facing the analysis of gene expression data-the small number of cases and the large number of genes.
In the present article, we propose a symmetric framework to predict AD, made of steps.First, we use a number of statistical metrics to evaluate the relevance of the genes of a dataset to AD prediction.We apply each metric individually, then average for each gene the values obtained from all applied metrics.Next, we select the genes that have the highest such averages, where highest is assessed with respect to some user defined threshold.Finally, we feed these genes into a number of ML models and monitor the classification performance.The model with the highest performance is considered for future use of the AD prediction system which is our ultimate outcome.
To validate the strategy, we use for gene evaluation the chi-squared (χ 2 ), analysis of variance (ANOVA), and mutual information (MI) metrics.They end up selecting the genes that are most relevant for AD detection.For classification, we use four different ML models, SVM, RF, LR, and AdaBoost.We keep on varying the number of informative genes to test which are essential for AD prediction.The AD prediction system obtained this way shows excellent results on four omics datasets.The article is organized as follows.In Section 2, we introduce the foundations and elements of the strategy used in our study.In Section 3, we validate the strategy by applying four specific ML models to classify AD cases with the datasets and display the results.In Section 4 we present our concluding remarks.

Materials and Methods
In this section, we introduce the proposed approach for GS and for AD classification, which is illustrated in Figure 1.The approach consists of four stages: integration of the datasets, preprocessing, GS, and classification.The details of the proposed approach are presented below.

Integration of Datasets
A typical problem with gene expression data in general, including that of AD, is that the number of genes is huge (usually in the thousands) whereas the number of cases is small (usually in the tens).This imbalance makes classification a difficult problem.A possible solution is to concatenate two or more datasets, provided that they have the same set of genes, which we have done in the present study.

Preprocessing
We start this step by first normalizing the gene values in order to avoid heavy variations among the different genes.We use for normalization the min-max method which re-scales the range of values for each gene to the interval [0, 1].In particular, given a set C of cases described by a set G of genes, the normalized gene value v c i ,g j , c i ∈ C and g j ∈ G, of some gene value v c i ,g j is given by where min c k ∈C v c k ,g j and max c k ∈C v c k ,g j are the minimum and maximum values, respectively, of gene g j ∈ G across all cases c k ∈ C.
Figure 1.Proposed symmetrical AD prediction framework.It takes as input a GE dataset, possibly by integrating a number of smaller datasets, as well as a set of classifiers.In a multistage operation, it selects a minimal set of genes to represent the data, identifies the best classifier, then finally gives as output the correct AD classification of an unseen case: positive/negative.
After normalization, we handle the problem of missing values, which is quite persistent in almost all experimental datasets.There are many approaches in the literature to handle missing values, and we have selected from them imputation, due to its simplicity and efficiency [28].In particular, we compute the mean of the existing values for each gene to fill in the missing values of that gene.Let C g j ⊂ C be the set of cases that have values for gene g j , and let C g j ⊂ C be the complementary set of cases without a value for that gene.Then, for each gene g j ∈ G, we assign to each case c i ∈ C g j that does not have a value for gene g j the value where |x| denotes the cardinality of set x.

Gene Selection (GS)
Selecting genes relevant to AD prediction from the raw gene expression dataset is crucial.Simply, relevant genes are class, model, and dataset dependent.Meanwhile, inclusion of inconsequential and redundant genes can negatively affect the classification accuracy significantly.Thus, in our work, we pay special attention to gene selection.We, first, inspect the significance of each gene with respect to AD prediction.For that we use three filter-based metrics.Then, we evaluate the significance of each gene with respect to each of four ML models that we have used.At the end of these two stages, we can identify the most relevant genes and the most accurate model for predicting AD.
GS is particularly challenging because the number of genes is typically very large and the number of cases is typically very small.This imbalance can be noticed in Table 2.
To overcome this problem, we introduce in the present work a novel scheme for GS.The scheme uses three symmetric filter-based techniques to rank the genes with respect to their ability to predict AD, χ 2 , ANOVA (F statistic), and MI.Filter-based gene evaluation techniques are preferred due to their computational feasibility.

•
Chi squared (χ 2 ): The higher the χ 2 value, the more dependent the two variables, hence the more important the gene under consideration for predicting AD.Conversely, the smaller the χ 2 value the more independent the two variables, and, hence, the more irrelevant the gene for predicting AD.

• Analysis of variance (ANOVA-F statistic):
Analysis of variance (ANOVA) is a powerful family of techniques to test the significance of the difference between the means of two random variables.In our situation, the two variables are a gene and the target output, which is the case diagnosis, AD or N. The F statistic is one metric of the ANOVA family.For a given dataset with two classes, 1 and 2, the F statistic of a certain gene and the class variable is calculated, after first determining the sum of squares and degrees of freedom, as [14] where •

Mutual Information (MI):
Let us first introduce entropy, which is a well-known metric in information theory.It is used as a measure of uncertainty in random variables.In particular, given a discrete random variable X, let p(x) = Pr[X = x], x ∈ A be the probability that X = x, where A is the domain set of X.The entropy of X, denoted by H(X), is given by Having introduced entropy, we are in a position to introduce the mutual information I(X, Y) which measures the shared information between two random variables X and Y.In our situation, the two variables are a gene and the target output, which is the diagnosis, AD or N. The MI is given by where is the conditional entropy of Y given X, with p(x, y) the joint distribution of X and Y and B the domain of Y.

Classification
After identifying the genes most relevant to AD prediction, in the GS stage, the next stage in our framework is classification.In general, we can use any ML model for AD prediction, but we will focus in our experiments below on the four that proved most powerful for the task, as per the recent studies surveyed in Section 1, namely: SVM, RF, LR, and Adaboost [29].The classification in the present work is binary, using the one-versus-all concept.That is, even if we provide our framework with a dataset of multiple classes, AD one of them, we consider AD one class and everything else the other class.

SVM:
SVM is a famous supervised ML model that classifies data by first mapping, in a nonlinear way, the data to high-dimensional gene spaces.Then, it finds a linear optimal hyperplane, a decision boundary, to separate the points of one class from that of the other.SVM aims to maximize the distances (called functional margin) between the hyperplane and closest training data points of any type.The hyperplane, which is basically the SVM classifier, is expressed as where w is a weight vector, b some bias and Ψ(x) a nonlinear mapping.The optimal hyperplane is defined by w and b that minimize the function where the ϕ i > 0 are some slack variables, n the number of cases, and A some factor.2.

RF:
RF is a popular ensemble ML model, which means it combines predictions from multiple ML algorithms together to improve accuracy.In particular, it is a collection of decision trees, comprising a forest, trained with the bagging method.Prediction is made for a new case by a majority vote according to these steps.First, Given a set X of cases for training, X = {x 1 , x 2 , ..., x n }, with labels Y = {y 1 , y 2 , ..., y n }, each node chooses a random case with g genes.Second, split the g genes and calculate the D node using the best split point, where D refers to next node.Third, continue splitting the tree until just one leaf node remains and the tree is complete.At this point, the algorithm is trained on each case individually.Finally, The prediction data from the n trained trees are collected by voting, and the highest votes are used to make the RF decision.3.

LR:
LR is usually used to estimate or predict the probability of categorical variables, especially in binary classification.The logistic regression Sigmoid activation is defined as The probability h θ (X) of the categorical dependent variable X equals where θ is the regression coefficient, determined by minimizing the cost function of logistic regression.

4.
AdaBoost With AdaBoost, predictions are made iteratively by computing the weighted average of the weak classifiers.The whole process can be summarized as follows.First, all cases in the training set are given the same weight.Second, a weak classifier h t is used to classify the cases, and the classification error rate ε t is calculated, and used to update the weight of each case and to calculate the weight α t of the weak classifier h t in the next iteration.The classification error rate of the weak classifier for the training set is given by where x i , i = 1, 2, • • • , n denotes input case i, y i ∈ {1, −1} denotes the labels of the classes, t is the current iteration number, h t (x i ) is the prediction result of the weak classifier, y i is the true label, I is an indicator function that returns 1 for a correctly classified case and 0 for a misclassified case, and w t i is the weight of the current weak classifier.The weights of the weak classifiers are By combining the weak classifiers and optimizing their weights, the following strong classifier is obtained where T is the total number of iterations and h t (x) is the prediction result of weak classifier h t .

Experimental Work
In this section we report the findings of the extensive experimentation we carried out to validate the proposed framework and its GS Algorithm 1.The experimental work was carried out using a composite dataset made up of four multi-tissue GE profiles of the human brain for DNA microarray data.The profiles come from three different parts in the brains of AD patients, prefrontal cortex (PFC), visual cortex (VC), and cerebellum (CR).Downloaded from the National Center for Biotechnology Information-Gene Expression Omnibus (NCBI-GEO) database [30], the four datasets have the access numbers GSE33000, GSE44770, GSE44771, and GSE44768 [30], with GSE33000, GSE44770 [26] focusing exclusively on the PFC, GSE44771 on the VC, and GSE44768 on the CR [31].We integrated the four geneexpression datasets carefully as the integration is known to be error prone [32].Specifically, we integrated those datasets that were generated from the same platform (GPL4372), with normal (non-demented, healthy) present for control.The integrated dataset, summarized in Table 2, consists of 1157 cases, 697 AD and 460 Normal, each described by 39,280 genes.
At the outset, preprocessing and GS were performed on the integrated dataset as per Algorithm 1, which was coded in Python version 3.7.3 with the Scikit-learn packages.For reprehensibility and the common good, we have uploaded the code to the GitHub repository at the URL provided at the end of the article.The code was run on an Intel (R) Core (TM) i7-8550U CPU, 8 GB RAM, and 64-bit OS Win 10 configuration.The algorithm was used principally to select the most relevant and informative genes for AD and remove the remaining genes which would produce poor results if they remained.It was also used to identify the best classifier, out of four classifiers used, to work with those genes.
Once the dataset was pre-processed, the genes were then evaluated individually for their relevance in predicting AD, using the three filter metrics mentioned above.Part of the result of this evaluation is shown in Figure 2, which shows the 30 genes with the highest average of the three metrics.One can look at these 30 genes as the most relevant for predicting AD.
The crux of the present work is its unique GS methodology.The methodology depends basically on the three gene subsets G χ 2 ,G MI and G F , as a start.From these three subsets, we proceed as follows.
• Construct from the above three sets, the following four intersection sets: Construct from the last three sets, the following set: We then train and test, using a repeated stratified k fold cross validation approach, every classifier on each of the above 8 sets, calculating the six performance metrics (sensitivity (recall), specificity, precision, kappa index, AUC, and accuracy) in the process.Simply, the best classifier and best gene subset (out of these 8 subsets) will be the ones that produce the highest values for the metrics (or the majority of them).
Proceeding with Algorithm 1, the final step was to apply the four ML models to the integrated dataset, with increasing numbers of genes, at an increment size α = 100, and calculate the accuracy.For this exercise, we first partitioned the input dataset into two sets of cases: 1000 cases for training/testing and 157 cases for final validation.The 1000 cases were used for the selection of the best classifier, and the 157 cases were isolated to test that classifier with.The objective of this isolation is to ensure the credibility of the performance of the classifiers, since it would classify cases it never saw before.For further credibility, we used repeated stratified k-fold cross validation, with k = 10 and the repetition number being 30.That is, in each fold 90% of the cases was used for training and 10% for testing, and this was repeated 30 times, for a total of 300 tests.In other words, for each model, on each fold, six performance metrics were evaluated.This process is repeated 30 times, for a total of 300 times.The results of the 10 folds are averaged, providing at the end of the test only 30 values for each metric, one value per repetition.
Train each classifier i on each of the 8 constructed gene subsets: and calculate the metrics sensitivity (recall), specificity, precision, kappa index, AUC and accuracy to identify the best gene subset.30 Validate the best classifier with the best gene subset using the V validation dataset, reporting the validation results: sensitivity (recall), specificity, precision, kappa index, AUC and accuracy.
We found out that the highest performance was consistently that of the SVM model, when used with the 700 genes with the highest χ 2 value, 1000 genes with the highest ANOVA (F statistic) value, and 1700 genes with the highest MI vlaue.Having identified these three sets of genes, we began to get their pairwise intersections, as per the proposed Algorithm, which are depicted visually in Figure 2. As can be seen, the intersection between the χ 2 and ANOVA sets contains 457 genes, between ANOVA and MI contains 916 genes, and finally between χ 2 , and MI contains 533 genes.Having obtained the pairwise intersections, we then obtained their union which contains 1058 genes.As can be seen from the bar charts, the 1058 genes of the G ∪∩ subset represent the most relevant (producing the highest performance) for predicting AD by the SVM model.Incidentally, We compared the genes selected by Algorithm 1 with the list of genes reported in the well-known database AlzGene [33], which represents 695 influential AD genes obtained from 1395 studies, and found that the 30 genes of Table 3 are actually in that database.
Figures 3-8 show the results of the training/test phase for six metrics, sensitivity (recall), specificity, precision, kappa index, AUC and accuracy, whose equations can be found in any ML reference, see e.g., [29].Each figure displays the values of this metric for for four classifiers, SVM, RF, LR and Adaboost, and eight gene subsets.As mentioned earlier, the training/test phase was carried out on only 1000 cases of the original dataset of 1157 cases.Further, in this phase, the testing was done using a repeated stratified k-fold approach, with k = 10 and the number of repetitions equal 30, to ensure credible results.It is evident from the Figures that best performing classifier, the one with the highest values of the metrics, is the SVM classifier.It is also evident that the gene subset associated with this high performance is the G ∪∩ subset, which contains 1058 genes out of an original number of 39,280 genes.For the box plot of Figure 8, we plotted the 30 accuracy values obtained from the 30 repetitions.The 10 results of the 10-fold in each repetition were first averaged, producing only one value, which was then considered the value of one repetition.
After finishing the training/test phase, we moved on to the validation phase, where the best classifier, SVM, was evaluated on the remaining 157 cases, using the minimal gene subset, G ∪∩ .As was the situation in the training/testing phase, the validation was done using a repeated stratified k-fold approach, with k = 10 and the number of repetitions equal 30, to ensure credible results.Table 4 shows the confusion matrix resulting from the validation phase.This matrix was used to calculate the six performance metrics of the SVM when the G ∪∩ gene subset is used.Table 5 shows the impressive values of 0.97, 0.97, 0.98, 0.945, 0.972, and 0.975 for the sensitivity (recall), specificity, precision, kappa index, AUC and accuracy, respectively.Compared with the results of the state of the art shown in Table 1, these values are much better.In fact, the proposed framework could achieve the same results of Table 1 using much fewer genes, telling how powerful our framework is, and how selective our GS algorithm can be.Code of the experimental work is available at: https://github.com/aliaa2007/AD_Classification (accessed on 22 February 2022).

Conclusions
In this article, we have presented a framework for the prediction of a disease that has not found enough attention in the literature-Alzheimer's disease (AD), using GE data.The framework has been shown to predict AD from GE data accurately and with a minimal number of genes, compared with recently published competitive frameworks.We have developed an efficient algorithm for GS and used it to identify the most relevant genes for AD prediction.The algorithm produces eight sets of genes, which are then explored by a number of ML models.The best model is the one that achieves the highest performance with the smallest number of genes.In our experiments on an integrated dataset of 39,280 genes, this model has turned out to be SVM.It reached an accuracy of 97.5% using the 1058 hand-crafted genes, obtained by intersections and unions of genes with high filter values.
The framework is characterized by its openness and symmetry.It can deal any number of ML models, any number of filter metrics and any number of genes.It can be generalized for other diseases as well.It demonstrated experimentally that it outperforms the state of the art in that it either achieves the same performance with fewer genes or higher performance with the same number of genes.
The present work, though reliable and meticulous, has one limitation as far as we can see.Specifically, it predicts an input case for being only AD or normal, meaning it is based on a binary classification scheme.As such, it cannot predict the various stages of AD, for example, which require a multi-classification scheme.We intend to address this limitation in a future work, producing a more powerful framework that is capable of predicting two or more stages of AD.
RF, LR, AdaBoost, . . . ) n 1 represents the number of cases with class 1, n 2 the number of cases with class 2, x the mean of all values of the gene, x1 the mean of the values of the gene with class 1, x2 the mean of the values of the gene with class 2, x 1,k the kth value, with class 1, of the gene, and x 2,k the kth value, with class 2, of the gene.A larger F statistic value means that the gene is important for determining the class, AD or N, and vice versa.

for i = 1 13 j = 0, Acc1=0 14 do 15 j = j + 1 16 18 δ19x i 20 while 23
to M do //Train every classifier i Construct an array G x i from the top jα genes of the x-sorted array 17Train classifier i on array G x i and test to find corresponding accuracy Acc x i , using repeated 10-fold cross validation =Acc x i −Acc1.Acc1=Acc (δ > 0 ∧ (j + 1)α < |G|); Identify the classifier i that produced the highest accuracy Acc x i across all three metrics x, hence construct three corresponding gene sets

Figure 2 .
Figure 2. Venn diagram of the three gene subsets obtained from χ 2 , ANOVA, and MI.Five other subsets are generated (using the union and intersection operations) from these three, for a total of eight subsets, that are then explored by different classifiers.The exploration results in both the best subset to represent the data and the best classifier to predict the disease.

Figure 4 .
Figure 4. Sensitivity (recall) of all four ML models for 8 gene subsets.The SVM model achieves the highest sensitivity (0.9777) when used with the 1058 genes of the G ∪∩ subset.

Table 1 .
Summary of some recent studies on the prediction of AD using gene expression data, employing different GS methods, and ML models.

of Cases No. of Genes GS Method No. of Selected Genes ML Model Performance Metrics
The contribution of this article is multifaceted as follows.• A comprehensive framework to diagnose AD from GE data; • A novel GS methodology based on hybrid filter/wrapper selection methods; • The use of 6 different performance metrics to evaluate the proposed framework; • High-performance exceeding, as demonstrated by experimental results, state of the art GE-based AD prediction frameworks; • An enrichment to the literature on AD prediction based GE data, which is admittedly poor compared to the literature on other diseases.
a well-known statistical metric used to examine the dependence between two random variables, in our situation a gene and the target output, which is the case diagnosis, AD or N.In order to calculate χ 2 we first build a contingency table, having r rows, where r denotes the number of distinct gene values, and c columns, where c denotes the number of distinct classes of the target output, in our situation 2. At the (i, j) entry of the table, we place both the observed value O ij and expected value E ij for gene value i of class value j.The observed value O ij is the number of times value i appears associated with class j, whereas the expected value E ij is the fraction of times value i appears as a value for the gene, multiplied by the number of cases having class j.With this table at hand, then

Table 2 .
Summary of the four datasets integrated in the present study into one dataset of 1157 cases, each described by 39,280 genes.

Table 3 .
Listing of the 30 genes having the highest averages of the three metrics: χ 2 , ANOVA, and MI, and overlap with the AlzGene database of the most influential AD genes.Precision of four ML models for 8 gene subsets.The SVM model achieves the highest precision (0.9739) when used with the 1058 genes of the G ∪∩ subset.

Table 4 .
Confusion matrix of the final validation experiment on the best classifier-SVM.The experiment was carried out on 157 cases never seen by the classifier in any training/testing, using 1058 genes selected by the proposed algorithm out of an original total of 39,280 genes.Specificity of all four ML models for 8 gene subsets.The SVM model achieves the highest specificity (0.9766) when used with the 1058 genes of the G ∪∩ subset.Kappa index of all four ML models for 8 gene subsets.The SVM model achieves the highest kappa (0.9393) when used with the 1058 genes of the G ∪∩ subset.

Table 5 .
Validation results of the SVM classifier obtained from 157 cases that were kept for final testing of the best classifier, using the the 1058 genes of the (G ∪∩ ) subset.AUC of all four ML models for 8 gene subsets.The SVM model achieves the highest AUC (0.9692) when used with the 1058 genes of the G ∪∩ subset.Figure 6.Kappa index of all four ML models for 8 gene subsets.The SVM model achieves the highest kappa (0.9393) when used with the 1058 genes of the G ∪∩ subset.AUC of all four ML models for 8 gene subsets.The SVM model achieves the highest AUC (.9692) when used with the 1058 genes of the G ∪∩ subset.Figure 8.A box plot of the accuracy metric of four ML models for 8 gene subsets.Once again in this plot, we can see that the SVM model achieves the highest accuracy when used with the 1058 genes of the G ∪∩ subset.Figure 8.A box plot of the accuracy metric of four ML models for 8 gene subsets.Once again in this plot, we can see that the SVM model achieves the highest accuracy when used with the 1058 genes of the G ∪∩ subset.