Gene Network Analysis of Alzheimer’s Disease Based on Network and Statistical Methods

Gene network associated with Alzheimer’s disease (AD) is constructed from multiple data sources by considering gene co-expression and other factors. The AD gene network is divided into modules by Cluster one, Markov Clustering (MCL), Community Clustering (Glay) and Molecular Complex Detection (MCODE). Then these division methods are evaluated by network structure entropy, and optimal division method, MCODE. Through functional enrichment analysis, the functional module is identified. Furthermore, we use network topology properties to predict essential genes. In addition, the logical regression algorithm under Bayesian framework is used to predict essential genes of AD. Based on network pharmacology, four kinds of AD’s herb-active compounds-active compound targets network and AD common core network are visualized, then the better herbs and herb compounds of AD are selected through enrichment analysis.


Introduction
Alzheimer's disease (AD) is a chronic age-associated neurodegenerative disorder, and there are no definitive treatments or prophylactic agents. Its pathological features include senile plaque, nerve fiber tangles, and massive loss of neurons [1]. As its pathogenesis is not clear, clinical drugs used commonly can only relieve symptoms within a certain period of time but cannot improve the disease fundamentally.
Network pharmacology is associated with drug targets and human disease genes. On the basis of understanding the "drug-target gene-disease gene" network, the effects of different drugs on different target proteins are evaluated by using network analysis methods [2,3].
Many different computational methods have been employed for the different application fields. Gianni D'Angelo and Francesco Palmieri proposed a novel autoencoder-based deep neural network architecture, where multiple autoencoders are embedded with convolutional and recurrent neural networks to elicit relevant knowledge about the relations existing among the basic features (spatial-features) and their evolution over time [4]. Gianni D'Angelo and Francesco Palmieri described the use of Genetic Programming for the diagnosis and modeling of aerospace structural defects. The resulting approach aims at extracting such knowledge by providing a mathematical model of the considered defects, which can be used for recognizing other similar ones [5]. Zhang et al. proposed a Bayesian regression approach to explain similarities of disease phenotypes by using diffusion kernels of one or several protein-protein interaction (PPI) networks [6]. Chen et al. proposed two improved Markov random field (MRF) algorithms, which can automatically assign weights to different data sources, using Gibbs sampling processes [7,8]. Chen et al. proposed a fast and high-performance multiple data integration algorithm [9] for identifying human disease genes, the logistic regression based algorithm is extended to the multiple data integration case, where the parameters (weights) of different data sources can be tuned automatically.

Prediction of Functional Gene Modules
The correlation between AD original network and divided module network is discussed based on gene function enrichment analysis and association indices [29,30]. Jaccard association index is often used to evaluate the functional correlation between each module and the original network [29]. In addition, Fuxman Bass Juan et al. survey many association indices, such as Simpson, Geometric, Cosine, PCC (Pearson Correlation Coefficient) [31]. Zhu and Qiao et al. further extend the PCC association index to measure the correlation between each module and the function of the original network [32], as shown in Table 1. Table 1. Correlation index.

Correlation Index Formula Meaning
Jaccard The range of values is [0, 1], and the closer it is to 1, the stronger the correlation. Simpson Where, O represents the set of the original network pathways; C i represents the set of the i-th module pathways after partition.

Screening of Essential Genes
Research on the essential genes can help us to understand the biology of the disease. Various tools have been developed to predict and judge the essential genes in the network [33]. In this paper, the network topology attributes of functional modules are analyzed by 11 indexes of Cyto-Hubba [33], such as degree centrality (DC), betweenness centrality (BC), closeness centrality (CC), density of maximum neighborhood component (DMNC), maximum neighborhood component (MNC), bottleneck (BN), edge percolated component (EPC), maximum clique centrality (MCC), edge clustering coefficient (ECC), radiality and clustering coefficient.

Integrated Algorithm for Predicting Essential Genes
Chen et al. proposed a fast and high-performance multiple data integration algorithm for identifying human disease genes [9]. The disease gene identification problem was first expressed as a two-classification problem, and the feature vectors of each gene were extracted from the integrated network. Combined with the binary logistic regression model, maximum likelihood estimation and Bayesian idea, the model parameters are estimated, and the posterior probability of each gene was calculated. The final decision score was obtained by calculating the percentage of individual posterior probability.

Acquisition of Priori Probability of Genes
Suppose the integrated network contains genes g 1 . . . g n+m , in which g 1 . . . g n are the unknown ones and g n+1 . . . g n+m are the known ones in the OMIM database. Similar to the method used in references [8,9], for i = 1 . . . n, if g i belongs to the protein complex, then let its prior probability be: where A denotes the number of AD genes in the complex and B denotes the number of all disease genes in the complex. If g i does not belong to the protein complex, then let its prior probability be: where C is the number of all known genes of AD and D is the total number of human genes. Then, generate a random number following the standard uniform distribution U(0, 1). If the numerical value of the random number is larger than P i , then assign 0 as the prior label for g i . Otherwise, assign 1 as the prior label for g i . The prior probability of AD genes in the OMIM database areP n+i = 1, i = 1 . . . m.
Obtain Feature Vectors according to the Integrated Network and Binary Labels Only considering direct neighbors to construct feature vectors limits the capability of the method to use other topological attributes in a biological network. Therefore, the number of second order neighbors (indirectly connected) are employed to construct the feature vector [9] as: where, ϕ i1 and ϕ i0 are the number of direct neighbors of g i that are connected to vertices with labels 1 and 0, ϕ i1 and ϕ i0 are the numbers of the second-order neighbors of g i that are connected to vertices with labels 1 and 0. All feature vectors of individual genes together form a feature matrix as: Estimate Parameters and Calculate the Posterior Probability Given a prior configurationX for all vertices, a maximum likelihood estimation (MLE) method can be used to estimate the parameter vector ω.
Parameter vector can be written as: The likelihood function can be written as: The logistic sigmoid function can be written as: Among them, the linear function The log likelihood function of (7) can be written as: From (8) and (10), we get ln L(ω; x 1 , Then, a unique global optimal solution can be found by solving a convex optimization problem. The parameter vector ω is obtained by calculating the maximum value of (11). Then calculate the posterior probability of each gene from (8) and (9).

Get Decision Score
Considering that a gene has a higher decision score than most genes, it is more likely to be associated with the disease. Therefore, the final decision score is obtained by using the percentage value of the posterior probability [9]. The decision score is calculated as follows: where P i is the posterior probabilities of each gene and q i is the top percentage value of P i among all those posterior probabilities.

Herb-Active Compound-Target Network
In Supplementary Table S2, 14 kinds of herb compound targets are described. Figure 1 shows the network of four herb-active compounds-target genes. In each sub-image, from the inside to the outside, there are herbs, active compounds, ingredients of the active compound and associated target genes. These active compounds and their ingredients are represented by the same color. In Figure 1a, the blue circle stands for herb KXS. The red triangle, green triangle, and yellow triangle stand for KXS' active compounds Poria Cocos(Schw.) Wolf. (PCW), Panax Ginseng C. A. Mey. (PGCAM), Acoritataninowii Rhizoma (AR), respectively. The red hexagon, green hexagon, and yellow hexagon stand for ingredients of PCW, ingredients of PGCAM, ingredients of AR, respectively. Blue diamond stands for target genes associated with these ingredients. In Figure 1b, blue circle stands for herb DGSYS. Red triangle, purple triangle, navy blue triangle, wathet blue triangle, green triangle, yellow triangle stand for DGSYS' active compounds Chuanxiong Rhizoma (CXR), Paeoniae Radix Alba (PRA), Angelicae Sinensis Radix (ASR), PCW, Alisma Orientale(Sam.) Juz. (AOJ), Atractylodes Macrocephala Koidz. (AMK), respectively. Red hexagon, purple hexagon, navy blue hexagon, wathet blue hexagon, green hexagon, yellow hexagon stand for ingredients of CXR, ingredients of PRA, ingredients of ASR, ingredients of PCW, ingredients of AOJ, and ingredients of AMK, respectively. Blue diamond stands for target genes. In Figure 1C, blue circle stands for herb YGS. Red triangle, purple triangle, navy blue triangle, wathet blue triangle, green triangle, and yellow triangle stands for YGS' active compounds AMK, CXR, ASR, PCW, Radix Bupleuri (RB), Uncariae Ramulus Cumuncis (URC), respectively. Red hexagon, purple hexagon, navy blue hexagon, wathet blue hexagon, green hexagon, and yellow hexagon stand for ingredients of AMK, ingredients of CXR, ingredients of ASR, ingredients of PCW, ingredients of RB, ingredients of URC, respectively. Blue diamond stand for target genes. In Figure 1d, blue circle stands for herb YQTYT. Red triangle, purple triangle, navy blue triangle, wathet blue triangle, green triangle, yellow triangle, and orange triangle stands for YQTYT' active compounds Hedysarum Multijugum Maxim. (HMM), CXR, ASR, PGCAM, Radix Salviae (RS), Radix Paeoniae Rubra (RPR), Codonopsis Radix (CR), respectively. Red hexagon, purple hexagon, navy blue hexagon, wathet blue hexagon, green hexagon, yellow hexagon, and orange hexagon stand for ingredients of HMM, ingredients of CXR, ingredients of ASR, ingredients of PGCAM, ingredients of RS, ingredients of RPR, ingredients of CR, respectively. Blue diamond stands for target genes. green triangle, yellow triangle, and orange triangle stands for YQTYT' active compounds Hedysarum Multijugum Maxim. (HMM), CXR, ASR, PGCAM., Radix Salviae (RS), Radix Paeoniae Rubra (RPR), Codonopsis Radix (CR), respectively. Red hexagon, purple hexagon, navy blue hexagon, wathet blue hexagon, green hexagon, yellow hexagon, and orange hexagon stand for ingredients of HMM, ingredients of CXR, ingredients of ASR, ingredients of PGCAM, ingredients of RS, ingredients of RPR, ingredients of CR, respectively. Blue diamond stands for target genes.

AD Gene Network Construction
First, we collect AD-associated genes from NCBI database, OMIM database, and TTD database, and eliminated data duplications. Then 859 AD-associated genes are obtained. A disease gene network was constructed using the STRING database (input the above genes and select Homo sapiens, Figure 2a), which consists of 746 genes and 10920 edges. In addition, another PPI network is obtained from the IntAct database. Then, an initial integrated network, which includes 4210 genes and 21664 edges, is generated by merging the above interaction networks.

AD Gene Network Construction
First, we collect AD-associated genes from NCBI database, OMIM database, and TTD database, and eliminated data duplications. Then 859 AD-associated genes are obtained. A disease gene network was constructed using the STRING database (input the above genes and select Homo sapiens, Figure 2a), which consists of 746 genes and 10,920 edges. In addition, another PPI network is obtained from the IntAct database. Then, an initial integrated network, which includes 4210 genes and 21,664 edges, is generated by merging the above interaction networks. Similar to the method used in reference [9], considering the expression status of 13,416 human gene products and containing 79 human tissues in the GEO database (GSE1133), the PCC value between genes is calculated. A pair of genes are linked by an edge if the PCC value is larger than 0.5. Therefore, the gene co-expression network is constructed. Then, we select those genes and edges that appeared in two biological networks (an initial integrated network and a gene co-expression network). The information of AD pathway is added to the integrated network. Pathway datasets are obtained from the database of KEGG and another AD network generated based on three mini metabolic networks [34]. A pair of genes are linked by an edge if they co-exist in any pathway or network. Finally, a multi-database integrated network includes 2017 genes, and 85,152 edges is obtained (Figure 2b). In Figure 2, nodes stand for AD-associated genes from multiple databases, edge of a pair of nodes stand for interaction between nodes.

Module Partition
The integrated network is divided into modules by Cluster one, MCL, Glay and MCODE. The gene network modules under different division methods are obtained, the corresponding network entropy is calculated ( Table 2). The AD network is divided into 18 modules by MCODE method, its network entropy is 6.05, which is the lowest. Therefore, MCODE is the optional division method. The score of each module based on MCODE method is defined as the product of the density of the subgraph and the number of vertices (genes) in the sub-graph (DC × |V|), which reflects the density of each node in the modules [12]. The number of genes and score of each module are shown in Table 3 (ignoring a single gene).

Calculation of Association Indices
In order to explore the correlation between the original network and the divided module network in biological function, KEGG enrichment analysis is carried out on the original AD network and module networks. The final results show that there are 146 pathways involved in the original AD network. These modules, divided by MCODE method, cover 136 pathways with a coverage rate of 93.16%. It shows that these divided modules can express most of the functions of the original AD network.
We count pathway numbers of each module by enrichment analysis, intersection numbers of pathway numbers between original network and each module, union numbers of pathway numbers between original network and each module, the gene proportion of each module in original network, shown in Table 4. Some modules (12, 14 and 18) are enriched to 0 pathways, so they are ignored. Module 1 contains 400 genes, accounting for 20.04% of the total number of genes, and it can be enriched to 132 pathways, 128 of them are consistent with the original network pathways. Some association indices (Jaccard, Simpson, Geometry, Cosine and PCC), are calculated shown in Figure 3. Module 1 is key module in the AD gene network.

Prediction of Essential Genes
Essential genes can perform their function to a greater extent than other genes in the disease gene network. Module 1 is most representative in AD division modules by MCODE. We use 11 network ranking indexes in Cyto-Hubba to sort the genes in module 1 and select the top 100 genes in each index. Those genes that appear more than six times in the top 100 genes are selected as essential genes of AD. Table 5 shows the repetition times of the genes in module 1 by 11 algorithms.

Prediction of Essential Genes
Essential genes can perform their function to a greater extent than other genes in the disease gene network. Module 1 is most representative in AD division modules by MCODE. We use 11 network ranking indexes in Cyto-Hubba to sort the genes in module 1 and select the top 100 genes in each index. Those genes that appear more than six times in the top 100 genes are selected as essential genes of AD. Table 5 shows the repetition times of the genes in module 1 by 11 algorithms. These genes contain many known AD disease genes: APP, ACHE, ADAM10, APOE, CHRM1, CHRM3, PSEN1, PSEN2 and so on. In addition, CHAT, DR6, NFKB, BACE1, IDE, PP2A, GSK3B appear in the metabolic network of AD [34] (Table 5), which shows that module 1 is a key module and can be used to predict essential genes instead of the original network.

Integrated Algorithm for Predicting Essential Genes
The posterior probability of candidate genes in AD disease network are calculated by an integrated algorithm. Table 6 shows the relevant information of the top 30 candidate genes (2017 candidate genes in total).  Table 6 shows that the known AD genes APP, ADAM10, ACHE and APOE are in the prediction results. Further, the receiver operating characteristic (ROC) analysis is employed as the evaluation criteria to confirm the performance advantage of Integrated algorithm by varying a threshold for determining positives. The first positive control genes are those known AD disease genes from the pathway of KEGG (hsa05010: Alzheimer's disease), the negative control genes are selected from leukemia genes and diabetes genes that do not associate with AD genes in an integrated network. The relationship between the true positive rate (TPR) and the false positive rate (FPR) is shown with a blue line in Figure 4; the area under the ROC curve (AUC) is 0.984. The second positive control genes are those disease genes from the AD network generated based on three mini metabolic networks [34], the negative control genes are selected from leukemia genes and diabetes genes that do not associate with AD genes in an integrated network, the relationship between TPR and FPR is shown with a green line in Figure 4, the AUC is 0.916. These results demonstrate that Integrated algorithm can identify essential genes of AD.

Screening of Essential Genes of AD
The essential genes are obtained by using the modular network algorithm and in grated algorithm. The common genes between the above two algorithms as final essent genes of AD (Table 7). Table 7. Predicted essential genes for AD.

Screening of Essential Genes of AD
The essential genes are obtained by using the modular network algorithm and integrated algorithm. The common genes between the above two algorithms as final essential genes of AD (Table 7). Table 7. Predicted essential genes for AD.

Enrichment Analysis of Herb Compound Target
The Gene Ontology (GO) enrichment analysis (including Biological Process (BP), Cell Component (CC), Molecular Function (MF)) and KEGG pathway enrichment analysis are described in Supplementary Tables S3-S6, these similar genes can be enriched into ADassociated pathways, which indicates that these similar genes are significantly associated with a response to AD.
We count the number of similar genes between the target genes of the herb compound and essential genes of AD, the number of GO enrichment analysis items and the number of KEGG pathway enrichment analysis items, as shown in Figure 6.

Enrichment Analysis of Herb Compound Target
The Gene Ontology (GO) enrichment analysis (including Biological Process (BP), Cell Component (CC), Molecular Function (MF)) and KEGG pathway enrichment analysis are described in Supplementary Table S3-S6, these similar genes can be enriched into ADassociated pathways, which indicates that these similar genes are significantly associated with a response to AD. We count the number of similar genes between the target genes of the herb compound and essential genes of AD, the number of GO enrichment analysis items and the number of KEGG pathway enrichment analysis items, as shown in Figure 6. We can see from Figure 6 that YQTYT achieves the best performance in GO enrichment analysis items and KEGG pathway enrichment analysis items, while the number of similar genes between YQTYT and essential genes of AD is 17, which indicates that YQTYT is the best herb in four kinds of AD herbs, and YQTYT may have a better therapeutic effect on AD. We can see from Figure 6 that YQTYT achieves the best performance in GO enrichment analysis items and KEGG pathway enrichment analysis items, while the number of similar genes between YQTYT and essential genes of AD is 17, which indicates that YQTYT is the best herb in four kinds of AD herbs, and YQTYT may have a better therapeutic effect on AD.
Furthermore, we count the number of similar genes between the target genes of YQTYT compound and genes of AD pathway in the KEGG (hsa05010: Alzheimer's disease), compound HMM has the largest number of similar genes, followed by compound RS (Figure 7), so HMM and RS are both contributive to the treatment of AD. We can see from Figure 6 that YQTYT achieves the best performance in GO enrichment analysis items and KEGG pathway enrichment analysis items, while the number of similar genes between YQTYT and essential genes of AD is 17, which indicates that YQTYT is the best herb in four kinds of AD herbs, and YQTYT may have a better therapeutic effect on AD.
Furthermore, we count the number of similar genes between the target genes of YQTYT compound and genes of AD pathway in the KEGG (hsa05010: Alzheimer's disease), compound HMM has the largest number of similar genes, followed by compound RS (Figure 7), so HMM and RS are both contributive to the treatment of AD.

Conclusions
Currently, herbs have an effect on some diseases such as AD, nephropathy. Herbs are more systematic and holistic. However, some studies are still applying the traditional research idea, "one drug-one target-one illness", which ignores the multi-target and multicomponent characteristic of herbs. In this paper, the gene network of AD is constructed by considering some factors such as gene co-expression and metabolic relationship. The modular network algorithm, the logical regression algorithm under Bayesian framework and maximum likelihood estimation, which simplify the gene network and find essential genes highly associated with the AD. By using the idea of network pharmacology, YQTYT is the best herb in four kinds of AD herbs, and YQTYT may have a better therapeutic effect on AD. In addition, HMM and RS are selected as the better herb compounds for AD based on gene function enrichment analysis. Which means the herb compounds may play a major role in the treatment of AD.
Therefore, network pharmacology, network science, machine learning and statistical strategy are expected to find multi-target herb and herb components for the treatment of AD. Theoretical knowledge is provided for the follow-up study of herbs in the treatment of AD, and a feasible scheme is provided for the study of "drug-target-disease".  Table S4-2: GO Enrichment analysis (BP) of similar genes between target genes of DGSYS compound and essential genes of AD, Table S4-3: GO Enrichment analysis (CC) of similar genes between target genes of DGSYS compound and essential genes of AD, Table S4-4: GO Enrichment analysis (MF) of similar genes between target genes of DGSYS compound and essential genes of AD; Table S5-1: KEGG Enrichment analysis of similar genes between target genes of YGS compound and essential genes of AD, Table S5-2: GO Enrichment analysis (BP) of similar genes between target genes of YGS compound and essential genes of AD, Table S5-3: GO Enrichment analysis (CC) of similar genes between target genes of YGS compound and essential genes of AD, Table S5-4: GO Enrichment analysis (MF) of similar genes between target genes of YGS compound and essential genes of AD; Table S6-1: KEGG Enrichment analysis of similar genes between target genes of YQTYT compound and essential genes of AD, Table S6-2: GO Enrichment analysis (BP) of similar genes between target genes of YQTYT compound and essential genes of AD, Table S6-3: GO Enrichment analysis (CC) of similar genes between target genes of YQTYT compound and essential genes of AD, Table S6

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.