Robust Sampling of Defective Pathways in Alzheimer’s Disease. Implications in Drug Repositioning

We present the analysis of the defective genetic pathways of the Late-Onset Alzheimer’s Disease (LOAD) compared to the Mild Cognitive Impairment (MCI) and Healthy Controls (HC) using different sampling methodologies. These algorithms sample the uncertainty space that is intrinsic to any kind of highly underdetermined phenotype prediction problem, by looking for the minimum-scale signatures (header genes) corresponding to different random holdouts. The biological pathways can be identified performing posterior analysis of these signatures established via cross-validation holdouts and plugging the set of most frequently sampled genes into different ontological platforms. That way, the effect of helper genes, whose presence might be due to the high degree of under determinacy of these experiments and data noise, is reduced. Our results suggest that common pathways for Alzheimer’s disease and MCI are mainly related to viral mRNA translation, influenza viral RNA transcription and replication, gene expression, mitochondrial translation, and metabolism, with these results being highly consistent regardless of the comparative methods. The cross-validated predictive accuracies achieved for the LOAD and MCI discriminations were 84% and 81.5%, respectively. The difference between LOAD and MCI could not be clearly established (74% accuracy). The most discriminatory genes of the LOAD-MCI discrimination are associated with proteasome mediated degradation and G-protein signaling. Based on these findings we have also performed drug repositioning using Dr. Insight package, proposing the following different typologies of drugs: isoquinoline alkaloids, antitumor antibiotics, phosphoinositide 3-kinase PI3K, autophagy inhibitors, antagonists of the muscarinic acetylcholine receptor and histone deacetylase inhibitors. We believe that the potential clinical relevance of these findings should be further investigated and confirmed with other independent studies.


Introduction
Alzheimer's disease (AD) is the most common cause of dementia associated with aging, causing the loss of intellectual and social skills. The causes of Alzheimer's are not yet fully understood. The Early-Onset form of Alzheimer's Disease (EOAD) is due to mutations in chromosome 21, causing the formation of abnormal amyloid protein (APP) [1], and mutations on chromosomes 14 and 1 leading to abnormal presenilins 1 and 2 which undergo cleavage of APP [2,3]. However, the causes of Late-Onset Alzheimer's Disease (LOAD) are not yet completely understood. The prevalent common view is that they likely include a combination of genetic, environmental, and lifestyle factors that determine the risk for developing the disease.
The main milestones in the genetic analysis of LOAD go from the early descriptions of apolipoprotein E (APOE) gene polymorphisms in LOAD and APP gene mutations in EOAD in 1991 to the presenilin 1 (PS1) and presenilin 2 (PS2) gene mutations in EOAD in 1995. Since 2009 until now new polymorphisms influencing the risk of LOAD are being reported every year. Ricciarelli et al. (2004) investigated gene expression in Alzheimer's disease and aging, reporting 314 genes that were differentially expressed in LOAD cerebral cortex, and confirming via reverse transcription polymerase chain reaction (RT-PCR) the increased expression of the interferon-induced protein 3 in LOAD brains [4]. Kong et al. (2009) performed independent component analysis (ICA) of Alzheimer's DNA microarray gene expression data [5]. They identified more than 50 significant genes with high expression levels in severe LOAD, representing immunity-related proteins, metal binding proteins, membrane proteins, lipoproteins, neuropeptides, cytoskeleton proteins, cellular binding proteins, and ribosomal proteins. Our understanding of the etiology and pathogenesis of LOAD was predominantly influenced by recent developments of genetics [6]. Particularly, genome-wide association studies (GWAS) identified over 20 genetic loci associated with LOAD.
Recent genetic data continue to support the amyloid hypothesis of LOAD with protective variants being found in the amyloid gene, and both common low-risk and rare high-risk variants being discovered in genes that are part of the amyloid response pathways. These data support the view that genetic variability in how the brain responds to amyloid deposition is a potential therapeutic target for the disease, and are consistent with the notion that anti-amyloid therapies should be initiated early in the disease process (Hardy et al. 2014) [7]. Genome-wide association studies involved genes related to three main ontological pathways: [1] Endosomal vesicle recycling (phosphatidylinositol binding clathrin assembly protein (PICALM) coding gene, bridging integrator-1 protein (BIN1) coding gene). (2) Cholesterol and lipid metabolism (apolipoprotein E (APOE) gene, clusterin protein (CLU) coding gene, and binding cassette subfamily A member 7 protein (ABCA7) coding gene) [3]. Innate immune system (Clusterin complement C3b/C4b receptor 1 (CR1) gene, membrane-spanning 4-domains subfamily A (MS4A) gene cluster, triggering receptor expressed on myeloid cells 2 (TREM2) gene) [8]. Giri et al. (2016) overviewed the genes associated with LOAD, noting that the majority of genes associated with LOAD cluster roughly within three main pathways: lipid metabolism, inflammatory response and endocytosis [9]. Additionally, several signaling pathways associated with LOAD might modulate various processes, such as the reduction of amyloid-β aggregation and inflammation, regulation of mitochondrial dynamics, and increased neuronal activity [10].
The difficulty to define some common mechanisms that could be responsible for development of LOAD is partly due to the high degree of underdeterminacy that is present in all genetic experiments. Besides, no single gene can describe complex diseases such as Alzheimer's, caused by a combination of genetic, environmental, and lifestyle factors. Complex diseases do not obey the standard Mendelian patterns of inheritance, and it is well known that genes involved in complex diseases work in synergy (see for instance [11]).
Pathway and network analysis is a good alternative to understand Omics data, and allows finding distinct cellular processes and signaling pathways that are associated with the set of differentially expressed genes. Pathway analysis needs databases with pathway collections and interaction networks, and programming packages to analyze the data. The most popular freely available public collections of pathways and interaction networks are Kyoto Encyclopedia of Genes and Genomes (KEGG) [12] and REACTOME [13]. Pathway and network analysis of cancer genomes is currently used for better understanding of various types of tumors [14]. Dimitrakopoulos and Beerenwinkel (2017) reviewed several computational methods of the identification of cancer genes and the analysis of pathways [15]. For AD, Mizuno et al. (2012) developed a publicly available pathway map called AlzPathway (http://alzpathway.org/) that comprehensively catalogs signaling pathways in AD using CellDesigner [16]. AlzPathway is currently composed of 1347 molecules and 1070 reactions in neuron, brain blood barrier, presynaptic, postsynaptic, astrocyte, and microglial cells and their cellular localizations. There are still some outstanding challenges concerning both annotations and methodologies [17]. The annotation challenges are due to low-resolution of available databases; while the methodological challenges concern mainly finding the set of genes that are indeed related to the disease and understanding the dynamical nature of biological systems and the effect of external stimuli.
In this paper, we try to address the first methodological challenge related to the phenotype prediction problem, i.e. the development of robust computational methods of linking the cause (genotype) and the effect (phenotype). Researchers typically use sets of differentially expressed genes, but fold change is sensible to the presence of noise in genetic data and in the wrong class assignment of the samples [18].
The holdout sampler [19] looks for different equivalent high discriminatory genetic networks that are related to the uncertainty space of the classifier that is used to predict the phenotype. The holdout sampler generates different random 75/25 data bags (or holdouts): 75% of the data in each bag is used for learning and 25% for blind validation. For each of these bags the small-scale genetic signatures (header genes) are determined. The posterior analysis consists of finding the most frequently sampled genes taking into account all the highly predictive networks, that is, the small-scale genetic signatures with high validation accuracy. The biological pathways can be identified performing posterior analysis of these signatures established during the cross-validation holdouts and plugging the set of most frequently sampled genes into ontological platforms. That way, the effect of helper genes whose presence might be due to noise or to the high degree of underdeterminacy of these experiments is damped. As we briefly explain in the next section, this algorithm is inspired by the sampling of the equivalence region of a regression problem using bootstrapping (random data sampling with replacement) to find different sets of equivalent predicting parameters.
We show the application of this algorithm to the analysis of the genetic pathways involved in LOAD and mild cognitive impairment (MCI), obtaining an unexpected association with influenza viral RNA transcription and replication as the main mechanisms in LOAD and MCI development. Neurodegenerative diseases could be induced by chronic and viral infections that may lead to a loss of neural tissue in the central nervous system. It has published rare instances in which acute severe encephalitic viral diseases directly cause transient symptomatic Parkinson Disease [20]. Besides, in the comparison of the LOAD patients vs. healthy controls (HC) we have also compared the altered genetic pathways derived by using various sampling algorithms to probe the hypothesis of biological invariance [21], that is, the genetic pathways that are involved in the disease development should be independent of numerical algorithm (classifier and sampling algorithm) that is used to unravel them. The discrimination between LOAD and MCI could not be clearly achieved and their joint (LOAD + MCI) difference from healthy controls can be achieved with a common list of genes found in their individual comparisons (LOAD vs. HC and MCI vs. HC). This fact is somehow expected as MCI may represent an early stage of LOAD and the genetic mechanism may be the same.
Finally, based on the results of these analyses we show some implications for drug repositioning for LOAD-MCI. Table 1 shows the list of most discriminatory genes of the phenotypes of the LOAD patients compared with HC. In this case the predictive accuracy estimation is based on Leave-One-Out-Cross Validation (LOOCV) by averaging the LOOCV predictive accuracy over all samples of the validation dataset in each bag and involves a simple k-Nearest-Neighbor classifier in the reduced set of high discriminatory genes (small-scale signature). Table 1. Late-onset Alzheimer's Disease (LOAD) vs. healthy controls (HC). List of most discriminatory genes with a Fisher's ratio higher than 0.8. All these genes are underexpressed in LOAD.

Gene
Mean  Table 2 shows the most frequently sampled genes (sampling frequency higher than 0.7%) detected by the holdout algorithm. We also provide the mean of the expression in each group, the fold change, the Fisher's ratio, and the sampling frequency. Table 3 shows the same analysis by using the Fisher's ratio sampler (with a sampling frequency higher than 0.35) and Table 4 shows the results obtained with Random Forest (with a sampling frequency higher than 0.19). The sampling frequencies are varying and depend on the sampling algorithm.  Most discriminatory genes sampled in different networks for the control vs. LOAD phenotype (sampling frequency higher than 0.7%). The sampling frequency is the number of times that a gene appears in the whole set of sampled genes. In this case a sampling frequency of 2.31% in a set of 43202 sampled genes in 1000 holdouts implies that this gene has been sampled 997 times. The sampling frequency could be also defined with respect to the number of holdouts and the first gene would have a sampling frequency of 99.7%. In any case this figure is a way of ranking the relative importance of each gene. We also provide the mean of the expression in each group, the fold change, the Fisher's ratio, and the sampling frequency. All these genes are underexpressed in LOAD.   Table 5 shows the pathway analysis using the list of most highly sampled genes together with the summary of the results obtained for all the comparisons performed in this paper. The scores relative to these pathways are given in Table S1 as supplementary material. We have used the sampled genes with a frequency higher than 0.2, since this set contains enough discriminatory genes to perform the pathway enrichment. This sampling frequency has been judged to be optimum in the enrichment analysis. Figure 1 shows the correlation network between the most discriminatory genes of the LOAD vs. healthy control phenotype and serves to explain how the most discriminatory genes are interrelated and control gene expression.   Table 6 shows the list of most discriminatory genes for the MCI vs. control phenotype discrimination. Table 7 shows the most discriminatory genes sampled in different networks for the Control vs. MCI phenotype found. In this case we have considered a sampling frequency higher than 0.38%. Table S2 (given in Supplementary Materials) shows the scores relative to the main genetic pathways involved in MCI vs. HC. Figure 2 shows the correlation network between the most discriminatory genes of the MCI phenotype.  Most discriminatory genes sampled in different networks for the control vs. MCI phenotype (sampling frequency higher than 0.38%). We also provide the mean of the expression in each group, the fold change, the Fisher's ratio, and the sampling frequency. Overexpressed genes are shown in bold.  Table 8 shows the list of most discriminatory genes for the MCI vs. LOAD phenotype discrimination. Table 9 shows the most discriminatory genes sampled in different networks for the LOAD vs. MCI phenotype with a sampling frequency higher than 0.5%. Table S3 (given in Supplementary Materials) shows the scores relative to the main genetic pathways in the discrimination between MCI and LOAD. Figure 3 shows the correlation network between the most discriminatory genes of the LOAD/MCI phenotype.    Table 10 shows the list of most discriminatory genes for the MCI vs. LOAD phenotype discrimination. Table 11 shows the most discriminatory genes sampled in different networks for the LOAD + MCI vs. HC phenotype with a sampling frequency higher than 0.5%. Table S4 (given in Supplementary Materials) shows the scores relative to the main genetic pathways involved in the MCI + LOAD vs. HC discrimination. This last classification has been performed since the difference between MCI and LOAD cannot be easily established.    Table 12 summarizes the most important findings found in the main comparisons.

LOAD vs. Healthy Controls Classification
In the LOAD vs. HC comparison, the maximum Fisher's ratio obtained was 1.29 and corresponds to MRPL51. Only six genes have a Fisher's ratio higher than 1. This fact gives an idea about the discriminatory power of gene expression data. The highest accuracy (79.52%) was obtained just with the two first genes: MRPL51 and CETN. This accuracy was increased up to 84% by using a genetic signature of 21 genes that have been sampled within the set of most discriminatory genes. This signature is shown in Table 12, that serves to summarize the results. The most frequently sampled genes were RPL36AL, MRPL51, LOC4011206, and RPS27A (Table 2). These five genes are underexpressed in LOAD.
RPL36AL is a protein-coding gene involved in metabolism and viral mRNA translation. It has been found that the overexpression of this gene is associated with cellular proliferation in hepatocellular carcinoma [22]. MPRL51 is also a protein-coding gene related to mitochondrial translation. Recently it has been found that mitochondrial genes are altered early in blood in LOAD and MCI, showing reduced expression of OXPHOS (oxidative phosphorylation) genes. RPS27A is also a protein-coding gene involved in interferon gamma signaling pathway and activated TLR4 signaling involved in immune responses. CETN (Centrin-1) is a protein-coding gene that plays a fundamental role in microtubule organization. LOC4011206 is a non-characterized gene.
The main pathways with high scoring matches found in the LOAD vs. HC discrimination are viral mRNA translation, influenza viral RNA transcription and replication, gene expression, mitochondrial translation, rRNA processing, and metabolism (Table S1). Other pathways involved are organelle biogenesis, HIV life cycle, antigen presentation, and TCR signaling. Besides, Table 5 shows the comparison of these pathways to those found by the Fisher's ratio sampler and the Random Forest sampler. It can be observed that all the samplers depicted similar pathways. Besides the most important biological processes involved (see Table 12) also pointed in the same direction.
The correlation network ( Figure 1) shows one unique link relating the header gene MRPL51 to NDUFA1, which is involved in the mitochondrial respiratory chain pathway for the generation of cellular energy. The correlation between both genes is very high and positive (0.94). The main sub-tree develops under LOC646200 (RPL22-60S ribosomal protein L22-heparin binding protein HBp15). This protein can bind specifically to Epstein-Barr virus-encoded small RNAs.

MCI vs. Healthy Controls Classification
In the MCI vs. HC comparison, the maximum Fisher's ratio was 1.47, and corresponds to TAX1BP1. The highest accuracy (77.17%) was obtained with the first 59 most discriminatory genes. Table 6 shows only genes of this signature with Fisher's ratio greater than 1. This accuracy was increased up to 81.5% by using a genetic signature of 13 genes (given in the summary Table 12) that have been sampled within the set of most discriminatory genes.
The most important sampled genes (Table 4) are underexpressed in MCI samples with respect to healthy controls. Only DENND1C, SULT1A3 are overexpressed in MCI. DENND1C is a protein-coding gene involved in clathrin-mediated endocytosis that seems to play an important role in AD [23]. SULT1A3 is a protein-coding gene related to sulfotransferase enzymes that catalyze the sulfate conjugation of many hormones and neurotransmitters. It has been found that Alzheimer's subjects have a significant lower SULTA13 copy number compared to healthy controls [24]. Induction of SULTA13 significantly protects cells from dopamine neurotoxicity. The effect of dopamine in AD has been discussed in the literature [25]. The number of most-frequently sampled genes with respect to the LOAD vs. HC discrimination has increased, but the sampling frequencies of these genes have decreased. This result could be interpreted in the sense that the MCI discrimination vs. HC is more ambiguous than the LOAD vs. HC discrimination, and might correspond to the fact that MCI is an earlier stage than LOAD.
The pathways with higher scores found by the Holdout sampler are given in Table S2 and involve similar pathways to the LOAD vs. HC comparison (viral mRNA translation, gene expression, influenza viral RNA transcription and replication, metabolism of proteins, rRNA processing in the nucleus and cytosol, ubiquitin-proteasome proteolysis, antigen processing, cell cycle checkpoints, metabolism, HIV life cycle, mitotic metaphase and anaphase, CLEC7A (Dectin1 signaling), cellular senescence, TCR signaling, etc.). The main biological processes involved are also similar to the LOAD vs. HC comparison. Although not shown in the paper, the other samplers also depicted similar pathways.
The header gene in the correlation network (Figure 2), TAX1BP1, is positively correlated to TMC01 and VAMP7, which is only correlated to DENND1. TAX1BP1 encodes a HTLV-1 tax1 binding protein that interacts with TNFAIP3 (tumor necrosis factor (TNF) alpha induced protein 3) and inhibits TNF-induced apoptosis. Among its related pathways are apoptosis, autophagy, and innate immune system. The degradation of this protein by caspase-3-like family proteins is associated with apoptosis induced by TNF. This protein might also have a role in the inhibition of inflammatory signaling pathways. Interestingly, the caspases have been also found to be involved in ALS.
The most important sub-tree in the correlation network concerns TMC01, that plays a key role in calcium homeostasis. Dysregulation of calcium homeostasis in Alzheimer's disease has been pointed by Kawahara (2004), Small (2009), Brawek and Garaschuk (2014) [26][27][28]. TMCO1 is positively correlated to RPL17 (Ribosomal Protein L17). Among its related pathways are viral mRNA translation and MAPK signaling. VAMP7 is a protein-coding gene involved in the targeting and/or fusion of transport vesicles to their target membrane during transport of proteins. This gene has multiple interesting functions according to the Human Gene Database.

MCI vs. LOAD Classification
MCI represents an early stage of AD, and therefore the genetic background of LOAD might be shared by an important subgroup of MCI subjects. However, not all MCI subjects convert to LOAD since they may also convert to other diseases or remain stable over time. Therefore, this comparison should be interpreted as a temporal difference between both conditions, and the analysis was done even though it was not expected to find highly discriminatory genes.
The highest accuracy (68%) was obtained with the first 118 genes, showing that, as expected, the difference between LOAD and MCI could not be clearly discriminated. Besides, to support this fact, only the three first genes in Table 5 have a Fisher's ratio greater than 0.5: RPS4Y1, HLA-DRB1, and JARID1D. This accuracy was increased up to 74% by using a genetic signature of 8 genes that have been sampled within the set of most discriminatory genes: RPS4Y1, FAM96A, HS.460758, CLEC2A, SNORA25, SMCHD1, GIMAP2, MAP2K2. Some of these genes are related to the RANK (Receptor activator of nuclear factor-kappa B)-signaling pathway. The most important pathway involves the regulation of apoptosis by protein ubiquitination and degradation, which is one of the major mechanisms to regulate apoptotic cell death (Yang and Yu, 2003) [29]. A regulated balance between cell survival and apoptosis is essential for normal development and homeostasis of multicellular organisms. Defects in control of this balance may contribute to autoimmune disease, neurodegeneration, and cancer. A second important pathway concerns regulation of stress-activated MAP kinases by G protein signaling, that has an important role in the immune system [30] and has been a target in LOAD [31]. Besides, it is interesting to observe that within the set of most frequently sampled genes shown in Table 9 some genes are underexpressed and other overexpressed. This is different than in previous comparisons (MCI and LOAD vs. HC) where most of the discriminatory genes were underexpressed. The main pathway related to the underexpressed genes is chromatin regulation/acetylation. The epigenetic alterations in AD have been outlined by Sanchez-Mutt and Gräff (2015) [32]. The overexpressed genes are related to the regulation of activated PAK-2p34 by proteasome mediated degradation as main pathway. In both cases, pathways related to the immune system response are involved. The correlation network (Figure 3) has one main sub-tree relating RPS4Y1 and XIST, and two other final nodes (JARID1D and HS.546019). Xist (X-inactive specific transcript) is an RNA gene on the X chromosome of the placental mammals that acts as a major effect of the X inactivation process. The X-chromosome instability phenotype in LOAD has been studied by Bajić et al. (2009) [33]. Additionally, Barati and Mansour (2015) [34] pointed that Xist has the highest level of overexpression in LOAD using microarrays expression techniques.

MCI+LOAD vs. Healthy Controls Classification
The maximum Fisher's ratio (1.22) in this comparison (Table 8) corresponds to LOC401206 (RPS25P6) that according to Aceview (NCBI) is an intronless pseudogene derived from the RPS25 gene. RPS25 (Ribosomal Protein S25) is a protein-coding gene. Among its related pathways are viral mRNA translation and activation of the mRNA. Within the set of most discriminatory genes we found MRPL1 and TAX1BP1 that also appeared in the LOAD and MCI vs. HC individual comparisons. Only SULT1A3 is overexpressed in MCI and LOAD in the set of high discriminatory genes with FR greater than 0.8. The major pathways (shown in Table S4 provided as supplementary material) and biological processes involved are similar to those of the LOAD and MCI vs. HC comparisons, reinforcing the viral hypothesis found in previous comparisons.

Implications in Drug Repositioning
Regarding therapeutics GeneAnalytics identified bortezomib and carfilzomib as potential compounds acting on some of the genes that are involved in the deregulated LOAD and MCI pathways. Bortezomib and carfilzomib are proteasome inhibitors used for multiple myeloma and mantle cell lymphoma. Nevertheless, GeneAnalytics does not provide information if these compounds act in the right direction to regulate genes involved in LOAD and MCI. Therefore, this finding indicates only the genetic processes and the genes that are involved.
A more detailed drug repositioning was performed via Dr Insights (Chan et al., 2019) that uses the Connectivity Map (CMAP) transcriptomic experiments in different types of cell lines [35,36]. Table 13 shows the results of drug repositioning via Dr. Insights, using the list of most discriminatory genes identified by the holdout sampler. This analysis highlighted the importance of different compounds analyzed in breast cancer cell lines (MCF7) and one compound in prostate cancer cell line (PC3). In this table for each drug we also show the pathways that are affected by the genes whose expressions are increased or decreased as deduced from CMAP.
Emetine and its desmethyl analog cephaeline are isoquinoline alkaloids, which is a large class of nitrogen-rich natural compounds. Isoquinoline alkaloids are not a structurally homogeneous group, and their properties depend on the different degrees of oxygenation and intramolecular arrangements. Emetine is used for the treatment of amebiasis and is a component of ipecac syrup. Emetine is both myotoxic and cardiotoxic. Some isoquinoline alkaloids, such as galantamine, have been approved to treat MCI and other memory impairments. Its mechanism of action consists in cholinesterase inhibition that prevents the breakdown of the neurotransmitter acetylcholine, increasing its level in the synaptic cleft and in brain areas lacking cholinergic neurons. This treatment does not cure the condition but slows the rate of cognitive decline. Recently, different isoquinoline alkaloids have been studied as potential compounds for the treatment of LOAD [37,38].
This drug affects the tumor necrosis factor (TNF)-related apoptosis-inducing ligand (TRAIL) which has been shown to be unregulated in HIV-1-infected and immune-activated macrophages. TRAIL is also induced on neuron by beta-amyloid protein, an important pathogen for Alzheimer's disease [39]. It has been shown that neutralization of TNFSF10 ameliorates functional outcome in a murine model of Alzheimer's disease [40,41]. Besides, it is evidenced that necroptosis is a major driver on neuron cell death in neurodegenerative diseases [42]. Additionally, among the pathways associated with genes whose expression has been decreased, the most important are related to cellular senescence and cellular response to stress.
Tanespimycin is antitumor antibiotic used for the treatment of leukemia and different types of solid cancer, whose mechanism of action consists of inhibiting Hsp90 (heat shock protein 90), which is a chaperone protein that assists other proteins to fold properly, stabilizes proteins against heat stress, and aids in protein degradation. Oxidative stress may cause Hsp60 structure modifications leading to loss of Hsp60 functions with the consequences of protein misfolding, aggregation and deposition. Besides, Hsp90 downregulation may induce the reduction of Tau hyperphosphorylation and aggregation and may trigger the so-called stress response. That is, in the presence of cellular stress and Hsp90 inhibitors, Heat Shock Factor 1 (HSF-1) protein dissociates from the chaperone, reaches the nucleus, inducing the activation of heat shock genes and of the stress response via the production of Hsp90, Hsp70, and Hsp40, restoring protein homeostasis [43].
The pathways associated with the genes whose expression has been increased are linked to the regulation of HSF1 heat shock response and unfolded protein response. The genes with decreased expression control apoptosis and signaling by TGF-beta receptor complex through a phosphorylated receptor SMAD (R-SMAD).
Wortmannin is a potent PI3K inhibitor, that serve to inhibit different phosphoinositide 3-kinase enzymes, which are part of the PI3K/AKT/mTOR pathway, an important signaling pathway for many cellular functions such as growth control, metabolism and translation initiation. Wortmannin and LY-294002 are autophagy inhibitors [44]. The main pathways related to the genes with increased expression are: toxicity of botulinum toxin type D, Insulin Receptor Signaling (IRS) activation, estrogen biosynthesis, collagen biosynthesis and growth hormone receptor signaling. Botulinum toxin is a neurotoxic protein produced by the Clostridium botulinum that prevents the release of the neurotransmitter acetylcholine from axon endings.
The main pathways related to the genes whose expression are decreased are related to RNA modification in the nucleus, regulation of TP53 activity through methylation, fatty acyl-CoA biosynthesis, and galactose catabolism.
Biperiden is a muscarinic antagonist that blocks the activity of the muscarinic acetylcholine receptor, and has effects in the central and peripheral nervous systems. It has been used in the treatment of Parkinsonism [45]. The increased expression genes pathways are related to the Nuclear Pore Complex (NPC), which is the largest protein complex in the cell and also to the HIV life cycle; while the genes whose expression has been decreased are related to the secretin family receptors, that are involved in numerous key neurotransmitter systems in the brain and seem to be disrupted in Alzheimer's disease (AD) hemostasis, and also in the regulation of the adaptive immune system [46].
Trichostatin A is a histone deacetylase inhibitor (HDACi). Acetylated histones and DNA methylation play important role in Multiple Sclerosis (MS) [47,48]. The overexpressed genes are related to glutamate neurotransmitter release cycle, antigen activation of B-cell receptor and Collapsin Response Mediator Proteins (CRMPs) in Sema3A signaling that modulate the immune system in neurological disorders with inflammatory components [49]. Glutamate neurotransmission is critical for synaptic plasticity and survival of neurons. However, excessive activity causes excitotoxicity and promotes cell death. This constitutes a potential mechanism of neurodegeneration occurred in Alzheimer's disease [50]. The glutamate pathway has been found to be crucial in the development of Chronic Fatigue Syndrome (CFS) in cancer patients treated with radiotherapy [51].
The decreased expression genes are related to SMAD transcription factors, which play the key in the most versatile cytokine signaling pathways via the Transforming Growth Factor-β (TGFβ) pathway. The role of these factors in AD has been investigated [52]. Cyclin D, synthesized in G1 phase, is involved in regulating cell cycle progression, and is activated through phosphorylation. Overexpression of cell cycle proteins of peripheral lymphocytes (CDK2, CDK4, CDK6, cyclin B, and cyclin D) was observed in AD patients [53]. Finally, the last pathway concerns the transcriptional activation of mitochondrial biogenesis that controls the energy generating functions of mitochondria in accordance with the metabolic demands. The last drug that has been repositioned with high score is LY-294002, which is a PI3K inhibitor, like wortmannin. The main pathways are associated with RNA polymerase promoter, and adaptive immune system response, among others.
Other drugs that appear to be involved with lower scores are: saquinavir-MCF7 (5.2 × 10 −6 ), sirolimus-MCF7 (1 × 10 −7 ), and valproic acid-PC3 (5 × 10 −5 , 2 × 10 −4 ). Saquinavir is an antiretroviral drug used to treat or prevent HIV/AIDS. Saquinavir is a protease inhibitor that cleaves protein molecules into smaller fragments. Sirolimus (also known as rapamycin) is an immunosuppressant, that inhibits activation of T-cells and B-cells by reducing their sensitivity to interleukin-2 (IL-2) through mTOR inhibition. Finally, valproic acid is a medication primarily used to treat epilepsy and bipolar disorder. Proposed mechanisms for the valproic acid include increasing brain levels of gamma-aminobutyric acid (GABA), blocking of voltage-gated sodium channels, and inhibiting histone deacetylases.

Materials and Methods
We performed the pathway analysis of a cohort of patients with Late Onset Alzheimer Disease (LOAD), mild cognitive impairment (MCI), and healthy subjects (controls) (HC). The source of data were Alzheimer case-control samples originated from the EU funded AddNeuroMed Cohort, which is a large cross-European AD biomarker study relying on human blood as the source of RNA (Sood et al., 2015) [54]. The dataset contains 38323 probes and 329 samples (145 LOAD, 80 MCI, and 104 healthy controls). These authors identified a set of 150 probe sets that could be used in a diagnosis of Alzheimer disease (AD) based on gene expression. This multi-tissue RNA signature, extracted from peripheral blood samples, could be used as a diagnostic tool. Nevertheless, its predictive value has been recently discussed and received some criticism [55,56].
We performed a retrospective cohort study using a novel machine learning and uncertainty methodology to sample different combinations of highly predictive genes for four different phenotype prediction problems, identifying the discriminatory genetic pathways in each case: LOAD vs. HC; MCI vs. HC, MCI vs. LOAD and MCI + LOAD vs. HC. Identification of the deregulated (or defective) genetic pathways is critical to establish personalized therapies. The potential relevance of these findings could be further investigated in clinical studies.

Sampling Defective Pathways in Phenotype Prediction Problems
To understand the uncertainty in phenotype prediction problems and the need of robust methods of sampling, we first present the existing inherent uncertainty in a simple linear regression.
The least squares fitting of a linear model y * (x) = a 0 + a 1 x to a set of experimental data (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x s , y s ) consists of finding the parameters m = (a 0 , a 1 ) so that the distance between the observed data y obs = and the corresponding predictions y pre (m) = is minimum according to the Euclidean distance in R s . The regression problem is equivalent to finding the least squares solution of the linear system of equations Fm = y obs , where the matrix depends on the abscissas of the data points x = The uncertainty analysis of the least squares solution consists in sampling the family of equivalent models m = (a 0 , a 1 ) that fit the observed data y obs within the same error bounds: Fernández-Martínez et al. (2012,2013) demonstrated that the topography of the data error cost function corresponds to a straight flat elongated valley if the inverse problem is linear, whereas in the nonlinear case, the cost function topography consists of one or more curvilinear valleys (or basins) of low misfits eventually connected by saddle points. In this simple linear regression problem, the equivalent models belong to an ellipse whose axes and orientations are related to the eigenvalues and eigenvectors of the matrix F T F [57][58][59].
This mathematical result might sound a little bit technical, but its importance lies in the fact that the uncertainty space of any decision problem has a deterministic structure. Furthermore, it can be shown that a simple way of sampling these equivalent model parameters in a linear regression problem consists of finding the least-squares solution of these different data bags. That is, solving a set of different regression problems with partial information, for example, with 75% of the observed data in fitting and the rest (25%) in validation (evaluating the predictive accuracy of this set of parameters). This procedure is named as 75/25 holdout (bootstrap) technique. Figure 4 shows a numerical example of this procedure performed in simple linear regression problem. The figure shows the ellipse of model uncertainty for a relative misfit of 20%, and the different sets of parameters in the least squares fitting of different bagging experiments. It can be observed that these sets belong to the region of uncertainty and the sampling is denser along the axis of maximum uncertainty of this ellipse. Figure 4. Linear regression model. Ellipse of uncertainty for a relative misfit of 15% and different sets of model parameters (a 0 , a 1 ) found in the different bagging experiment. It can be observed that these models sample the region of uncertainty within the ellipse of 15% relative misfit. This example is very important to understand that the list of genes that equally explain a phenotype is not unique, and one simple method to sample these high discriminatory genetic networks is by performing random holdout, looking for the minimum-scale signatures that better explain the phenotype in each holdout, and finding the most-frequently sampled genes in these signatures, that are similar in phenotype prediction problems to the points (model parameters) located within the ellipse of this simple regression problem.
Similarly, phenotype prediction problem can be viewed as a generalized regression between the discriminatory sets of genes that characterize the given phenotype and sets of sample classes that form the training data set [11]. As in the linear regression problem, one of the main obstacles in the analysis of genetic data is the absence of a conceptual model that relates the different genes/probes to the class prediction (phenotype). For this reason, a classifier L * (g) has to be constructed, as an algorithm that maps the set of genetic signatures g to the set of classes into which the phenotype is divided: In this paper we have performed four different comparisons: LOAD vs. HC, MCI vs. HC, LOAD vs. MCI, and LOAD + MCI vs. HC, to better understand the molecular mechanisms involved in LOAD and the main differences between LOAD and MCI.
Finding the discriminatory genetic signatures corresponding to L * (g) can be interpreted as a generalized regression problem of the observed sample class vector c obs with respect to the genetic signatures. For that purpose, the modeling is divided into 2 steps: learning and validation. The learning process consists of giving a subset of samples T (training data set) whose class vector c obs is known, and finding the subset of genetic signatures g, of minimum size that maximizes the learning accuracy (the percentage of samples with correctly predicted class): Here L * (g) − c obs 1 stands for the prediction error (in percentage). In practice, the predictive accuracy of a genetic signature is established via cross-validation. Typically leave-one-out-cross validation (LOOCV) is used to use all the genetic samples that we have at disposal. The genetic signature with the highest accuracy and having the least number of discriminatory genes is named the smallest-scale signature. The validation consists in predicting the class of a new sample (whose class is unknown) using the genetic signatures that have been found during the learning process. The stability of the smallest-scale signature found at the learning step can be established via a holdout sampler procedure, where different samples for the training and validation set are randomly selected.
It is important to remark that the phenotype prediction problems have always a high degree of underdeterminacy, since the number of monitored probes (genes) is much greater than the number of samples from human subjects enrolled in clinical trials. Therefore, the learning step involves several lists of genes with similar predictive accuracy. These genes might be involved in the genetic pathways associated with the disease. For a given classifier the smallest-scale signature is the one that has the least number of discriminatory genes with the highest predictive accuracy. This knowledge is very important for early diagnosis and treatment optimization. Due to noise in the genetic data and class assignment, some of these highly discriminatory signatures might be false, containing genes uninvolved in the genetic pathways [18].
To deal with this uncertainty problem we use bootstrapping methodology to sample the genes that are discriminatory in different holdouts and perform posterior analysis of these discriminatory networks to unravel the biological pathways that are involved in the disease development. This algorithm has been named holdout sampler [19] and has been used to sample the uncertainty space in various inverse problems [60,61]. It has been also successfully applied in phenotype prediction and drug design [62][63][64][65]. Besides, the predicted pathways are compared with those obtained using other sampling algorithms such as the Fisher's ratio sampler [66] and Random Forest [67]. Random Forest (RF) are random decision trees for classification via ensemble learning. RF have been used for phenotype prediction and uncertainty analysis by Pang et al. [68]. These algorithms clearly outperform Bayesian Networks (BN) [69] that sample the posterior distribution of the genetic signatures related to the phenotype prediction, P g/c obs , according to Bayes rule: P g/c obs ∼ P(g).P c obs /g , Here P(g) is the prior distribution of the genetic signatures, which is uniform in the set of discriminatory genes, or proportional to their discriminatory power, and P c obs /g is the likelihood of the genetic signature g, that depends on its predictive accuracy Acc(g) as follows: P c obs /g = ke Acc(g) .
BNs have been used for the analysis of defective pathways by Jiang et al. [70] to discover gene interactions and by Su et al. [71] to analyze epigenetic modifications that affect the diseases. BNs have not used in this paper because they are very ineffective for pathways sampling and also very time consuming, due to optimization procedure of finding the optimum probability factorization. This procedure is inadequate for pathways sampling since the set of possible probabilistic factorizations of the uncertainty space is not unique. In fact, all the samplers follow implicitly equation [4] with different prior probabilities defined for different sets of discriminatory genes. Additionally, the likelihood P c obs /g , depends on the type of classifier and on the cost function that is used to define the predictive accuracy Acc(g).
For all the algorithms used in this paper the flowchart is shown in Figure 5 and described as follows: Data bagging: different random 75/25 data bags holdouts are generated, where 75% of the data is used for learning and 25% for validation. In the present study, we have used 1000 different bags.
Gene selection: For each of these bags the genes are selected and the classifier is built. The set of genes that are used by different classifiers is first restricted to the most discriminatory ones to avoid the impact of genes that do not contribute mechanistically to the phenotype discrimination. The genes with Fisher's ratio higher than a minimum cut-off (0.5 in this case) are selected. The accuracy is calculated on the validation set. This way of proceeding is based on the fact that variables with high discriminatory power serve to span the main features of the classification, while variables with lowest discriminatory ratios account for the details in the discrimination. This method determines the minimum amount of high-frequency details (helper genes) that are needed to optimally discriminate between classes promoting the header genes, which are those that explain the phenotype in a robust way. This methodology [11,18] has been successfully applied to the bioinformatics modeling of high-dimensional Omics data [72,73], and in the analysis of medical decision problems using hospital data [74,75].
Posterior analysis: once the data bagging simulation and analysis have been finished, the posterior analysis consists of finding all the minimum size signatures that have predictive accuracy of validation higher than a given tolerance. In this case, we have considered all the holdouts with predictive accuracy higher than 80%. Finally, we performed frequency analysis of these lists to find the most frequently sampled genes to establish the links with defective genetic pathways in Reactome Pathway Database.

Conclusions
In this paper, we present a novel algorithm to sample defective pathways involved in phenotype prediction problems that are highly underdetermined. The methodology consists of sampling different equivalent high-discriminatory genetic networks that are associated with the uncertainty space of the k-NN classifier that is used to separate the LOAD, MCI, and HC classes. To perform this task, the algorithm looks for the minimum scale signatures (header genes) corresponding to different 75/25 random holdouts. This methodology is related to bootstrapping techniques [76]. The biological pathways can be identified by performing posterior analysis, finding the most frequently sampled genes in these minimum scale signatures, and plugging these sets into ontological platforms. That way, the effect of helper genes whose presence might be due to noise or to the high degree of underdeterminacy of these experiments is damped. The proposed algorithm is very fast and simple to implement. However, a major problem is that predicted pathways may depend on the number of genes that are provided. Our experience recommends first using a sufficient number of representative genes. The second recommendation is to perform alternative analyses using variable numbers of genes to establish the cutoff frequency.
We show the application of this methodology to the analysis of the defective pathways in AD and MCI compared to healthy individuals. The results show common pathways for AD and MCI patients that are related to viral mRNA translation, influenza viral RNA transcription and replication, gene expression, mitochondrial translation, rRNA processing and metabolism. These pathways seem to be very consistent both for LOAD and MCI. The viral pathways have never been cited before among key pathogenic pathways involved in AD [9].
The predictive accuracies to discriminate LOAD and MCI from healthy controls were 84% and 81.5%, respectively. In addition, LOAD and MCI could not be clearly discriminated (74% accuracy). The most discriminatory genes of the LOAD-MCI discrimination are associated to regulation of apoptosis by protein ubiquitination and degradation and regulation of stress-activated MAP kinases by G protein signaling. This fact implies that MCI and LOAD can be diagnosed using similar pathways. We have also provided the list of genes that optimally discriminate LOAD and MCI from HC, which include the discriminatory genes that were found in individual discriminations of LOAD and MCI vs. HC.
In conclusion, the retrospective analysis of this LOAD-MCI dataset using novel sampling approaches provides a new working hypothesis for the main genetic mechanisms involved in LOAD-MCI. Based on these findings we propose a set of repositioned drugs by using CMAP methodologies. The main drugs belong to the category of isoquinoline alkaloids, antitumor antibiotics, PI3K and autophagy inhibitors, antagonists of the muscarinic acetylcholine receptor, and histone deacetylase inhibitors. The mechanisms of action of these drugs are also detailed and correlated with specific pathways.
We believe that the potential clinical relevance of these findings should be further investigated and confirmed with clinical studies of other independent cohorts using similar platforms to avoid possible biases. Funding: This research was funded by NSF grant DBI 1661391, and NIH grants R01 GM127701 and R01 GM127701-01S1.