Multi-Data Aspects of Protein Similarity with a Learning Technique to Identify Drug-Disease Associations

: Drug repositioning has been proposed to develop drugs for diseases. However, the similarity in a single aspect may not be sufﬁcient to reveal hidden information. Therefore, we established protein–protein similarity vectors (PPSVs) based on potential similarities in various types of biological information associated with proteins, including their network topology, proteomic data, functional analysis, and druggable property. Based on the proposed PPSVs, a separate drug– disease matrix was constructed for individual to prevent characteristics from being obscured between diseases. The classiﬁcation technique was employed for prediction. The results showed that more than half of the tested disease models exhibited high performance, with overall F1 scores of more than 80%. Furthermore, comparing all diseases using traditional methods in one run, we obtained an (area under the curve) AUC of 98.9%. All candidate drugs were then tested in clinical trials ( p -value < 2.2 × 10 − 16 ) and were known drugs based on their functions ( p -value < 0.05). An analysis revealed that, in the functional aspect, the conﬁdence value of an interaction in the protein–protein interaction network and the functional pathway score were the best descriptors for prediction. Based on the learning processes of PPSVs with an isolated disease, the classiﬁer exhibited high performance in predicting and identifying new potential drugs for that disease.


Introduction
On average, it takes at least 10 years from the discovery and development of a candidate drug to commercialize it for the treatment of a disease. Drug discovery requires significant investment, and the probability of success remains low at less than 10% [1]. The process of drug discovery consists of (i) discovery and preclinical screening of compounds that affect a disease in the laboratory, (ii) safety review to confirm the safe usage, (iii) clinical research phase I, (iv) clinical research phase II, (v) clinical research phase III, in which the drug is tested on human subjects, (vi) Food and Drug Administration (FDA) review of the drug and subsequent approval, and (vii) FDA post-market safety monitoring of the drug [2]. Given this long process, alternative routes for the discovery of new drugs for a disease are required.
Drug repositioning, also known as drug repurposing, is a well-known strategy for finding new indications for an existing drug [2,3]. The process of drug repositioning consists of (i) compound identification to screen the candidate drug for use, (ii) compound acquisition to optimize candidate compounds, (iii) compound development to ensure the Most computational methods for predicting the drug repositioning are based on guiltby-association using observations of similarities between drug-related proteins or genes in order to indicate for similar diseases. In particular, the type of relationship among genes, i.e., whether they are positive (e.g., activation) or negative (e.g., inhibition), is important information that has been proposed for use in modeling drug repositioning. This information calculates the relationship between genes based on their gene expression profiles, particularly the relationship between target genes and disease genes for drug repositioning [15]. However, the gene expression data for the interaction between target genes and disease genes are not always available, and these missing data can hamper model construction for drug repositioning.
Lee and Yoon [16] integrated existing gene networks from several databases, including BioCarta [17], Reactome [18], the Pathway Interaction Database (PID) [19], and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway [20], to calculate the shortest path score between drug-related genes (or target genes) and disease genes for both positive or negative interaction types [16]. Because several target genes can be influenced by a drug, the interaction types between drugs and target genes were determined to combine the shortest path scores for target genes that were activated by a disease gene for each drug. If the interaction type was positive (e.g., an activator, stimulator, or inducer), the shortest path score was multiplied by +1; otherwise, it was multiplied by −1. Summing all of the shortest path scores expressed the value of a drug when associated with a disease gene. This technique suggested the assessment of model's performance in treating a disease with the drug.
The objective of our research was to develop a drug repurposing approach that could be used to predict novel drugs for the treatment of a target disease using multiple biological characteristics summarized in protein-protein similarity vectors (PPSVs), which identify similarities between drug-related proteins and disease-related proteins. The PPSVs were based on four types of biological information: the topological network, proteomic data, functional analysis, and druggable property. The hypothesis of this research was that if a target protein is similar to a disease-related protein, then the drug that relates to the target protein may interact with the disease-related protein. In other words, the similarities between target proteins and disease-related proteins in several biological characteristics can be used to indicate whether has the potential to treat a specific disease. Moreover, every disease affects the structure or function of specific characteristics, such as causes and symptoms; thus, designing a model specific to the disease may be a more effective approach. This research thus employed a classification technique based on a random forest algorithm to predict candidate drugs for specific diseases.
The rest of the paper is structured as follows. Section 2 describes the proposed method for predicting drugs used to treat individual diseases based on four biological characteristics included in the PPSVs. Section 3 summarizes the performance score of the model and the results from the validation of the candidate drug-disease pairs using information from clinical trials and in terms of functional similarity and compares its performance with other existing methods. Gold features, i.e., those that are most important for predicting whether a drug can be used to treat a specific disease, and novel drug-disease pairs are also discussed in this section. Section 4 describes the results and the limitations of this research. A conclusion is provided in Section 5.

Materials and Methods
In the present study, the workflow for the prediction of promising drugs for the treatment of a disease consisted of seven steps ( Figure 1). First, three types of association including (1) the drug and target protein, (2) the disease and disease-related protein, and (3) the drug and disease, were collected from curated databases. Second, PPSVs were generated to determine similarities between target proteins and disease-related proteins. Third, a drug-disease matrix was constructed for each disease based on the PPSVs features. Fourth, a classification model was generated to predict candidate drugs for each disease.
Fifth, the candidate drugs were validated based on experimental knowledge from clinical trials and previous literature and an analysis of their functional similarities. Next, the performance of the proposed model was compared to existing methods. Finally, the gold features, i.e., the most important descriptors for the prediction, were determined to indicate the characteristics that are most relevant to the drug repositioning approach.

Dataset
The human protein-protein interaction (PPI) network was constructed using interaction information from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) v.11database [21]. Only interactions with confidence scores higher than 0.8 were selected to construct the network. The associations between approved drugs and diseases were taken from the Comparative Toxicogenomics Database (CTD) [22] which collects the chemical and phenotypic characteristics of drugs. Only drug-disease pairs supported by therapeutic evidence were used for our analysis. The DrugBank database was used to map approved drugs, their target proteins, and their genes [23]. The group of genes or proteins associated with specific diseases were extracted from a database of gene-disease associations (DisGeNET) [24]. Overall, we obtained 14,264 known drug-disease associations involving 1317 approved drugs and 478 diseases and employed these associations as positive samples. The remaining 615,262 drug-disease associations from a combination of the approved drugs and diseases were classified as negative samples.

Protein-Protein Similarity Vectors (PPSVs)
A PPSV is a feature vector of similarities between target proteins and disease-related proteins. The PPSVs were generated based on four types of biological information: the topological network, proteomic data (protein sequencing), functional analysis, and druggable property. An overview of a PPSV in four information types is summarized in Figure 2.

Topological Network Information
We investigated the human PPI networks in two ways:

1.
Neighboring similarity score: Nei P j , P k The Nei P j , P k score represents the cosine similarity between two proteins in terms of their common neighboring proteins or partners. This similarity score is calculated using the dot product of two vectors of proteins that are partners of the observed proteins. Then, it is divided by the magnitudes of each vector as shown in Equation (1): where P j and P k are proteins j and k, respectively. N P j and N(P k ) are vectors for neighboring proteins of protein j and k, respectively. For all P u in the network and P u = P j , N P j = 1 if P u is a neighboring protein of P j , otherwise, N P j = 0. The Nei P j , P k score ranges between 0 and 1. A score of 0 indicates that there are no common partners between two proteins while the score of 1 indicates that all proteins are partners of both proteins. A high neighboring similarity score indicates that the two proteins have high common neighbors in the human PPI network. This score is a good indication that these two proteins may cooperate within the same module to regulate the same functional task(s).

2.
Closeness score: Closer P j , P k Closer P j , P k represents the closeness between two proteins based on the length of the shortest path in the PPI network. The score can be calculated as shown in Equation (2): where D P j , P k is the length of the shortest path between proteins j and k in the human PPI network. This closeness score also ranges between 0 and 1. The score for a self-protein is assigned to 1 and the score for any two disjoint proteins is assigned to 0.

Protein Sequencing
We conducted both local and global alignment to compare any two protein sequences.

1.
Local alignment score, Loc P j , P k , indicates the most similarity alignment of regions within two protein sequences. The local alignment score represents the similarity in the structure, function, and evolution of two protein sequences in some regions. The most similar regions of the sequences for proteins j and k are aligned. A drug that can bind to a certain region of a protein could also bind to the other protein if that protein has a similar region [25].

2.
Global alignment score, Glo P j , P k , represents the score of the alignment of the entire protein sequences for proteins j and k. The global alignment score may also reflect the similarities in the protein structure, function, or evolution of the two protein sequences. Both local and global alignment techniques were conducted using the Biostrings package in R language based on the BLOSUM62 substitution matrix with a gap opening of 10 and a gap extension of 0.5. These parameters are the same as the default values in the EMBOSS water tool option from the European Bioinformatics Institute.

Pathway and Functional Analysis
We initially counted the number of common pathways for any two proteins as the pathway score PW P j , P k . We investigated any two proteins in the KEGG database [20,26,27]. The pathway score can be calculated using Equation (3): where PW P j and PW(P k ) are the pathways found for proteins j and k, respectively. A high pathway score indicates that two proteins operate in the same functional modules. In the same manner, a certain drug might disturb the pathway of proteins within the same functional modules during the treatment of a disease.
To determine the co-functions of any two proteins, we directly used the confidence score Con f P j , P k , which represents the approximate probability that there exists an interaction between two proteins if both proteins are in the same metabolic module within the KEGG database [26]. The Con f P j , P k score between proteins j and k was retrieved from the STRING v.11 database [21]. High confidence scores represent a high possibility of the association between two proteins. Thus, a drug that can bind to one target protein may also disturb the functional modules of the other target protein related to other diseases. This provides evidence for the potential of repurposing a certain drug to treat other diseases. This confidence score is multiplied by 1000, giving a range of 0 to 1000. A self-protein is assigned a score of 1000, and any two disjoint proteins are assigned a score of 0.

Druggable Property
We investigated the drugs associated with two target proteins by recording the number of common drugs that bind to both proteins. This score is defined as a shared drug score, ShareDr P j , P k where P j and P k are any two proteins. A high score indicates that the two proteins share a high number of approved drugs that bind to them.
A PPSV feature between the proteins j and k is denoted as PPSV P j , P k . This feature is a vector containing the seven similarity scores: Nei P j , P k , Closer P j , P k , Loc P j , P k , Glo P j , P k , PW P j , P k , Con f P j , P k , and ShareDr P j , P k . Subsequently, these features were used to represent the similarities between a target protein and a disease-related protein. The R source code used to calculate these PPSV features is available at https: //github.com/ksatanat/PPSV (accessed on 13 March 2021). All of these features were used to construct a drug-disease matrix in which a comparison of the PPSV values for all corresponding proteins was conducted, and the final values for each feature were then rescaled to a relative standard value between 0 and 1.

Constructing Drug-Disease Matrix
The three steps used to construct the drug-disease matrix in this study are summarized in Figure 3. First, PPSV table containing the PPSV features for a drug target protein and a disease-related protein was constructed. For a drug-disease pair, the features for all combinations between the drug target proteins and disease related proteins were calculated.
An example of a PPSV table for a drug is presented in Figure 3a. Second, the maximum PPSV score of all target proteins for a drug was identified ( Figure 3b). The maximum PPSV score was selected as the similarity vector score for a drug and a disease-related protein.
This vector for the drug and disease-related protein represents the similarities between all target proteins and a disease-related protein based on their topological PPI networks, proteomic data for sequence alignments, functional analysis, and druggable properties of proteins. Third, as shown in Figure 3c, the same process was employed for all combinations of drugs and disease-related proteins. Therefore, a drug-disease matrix for each single disease was constructed. All of the values in the column vectors of the matrix were rescaled to a range of 0 and 1. Eventually, we obtained a total of 478 matrices (for 478 diseases) with a size of 1315 drugs (representing the number of drugs) × the size of its PPSV, which depends on the number of disease proteins.

Predicting Candidate Drugs for a Disease
For a drug-disease matrix, all of the values in each column vector were rescaled to a range of 0 and 1. Known drug-disease associations were given a positive label; otherwise, they were given a negative label. A random forest classifier was employed as the predictive model to identify candidate drugs for the treatment of a disease. In total, we obtained 478 models for all investigated diseases. The python source code for building the prediction model is provided at https://github.com/ksatanat/PPSV (accessed on 13 March 2021).
Each drug-disease matrix was split by randomly selecting 20% of all drug-disease pairs to be a test set, and the remaining 80% were assigned to be a training set. This division was conducted to ensure the same proportion of each label. To obtain a balanced data set for each machine, a bootstrap randomization technique was used to generate five sets of negative labels from training data into five machines. We performed a 5-fold cross validation to avoid overfitting of each machine. For each fold with the same number of trees, a grid search cross-validation technique [28] with 5-fold cross-validation was employed to identify the best hyperparameters in the forest. The parameters ranged from 50 to 300 in increments of 50. For each disease, a drug-disease pair from the test set was applied to the five machines to obtain five prediction scores. The average of the prediction scores for a pair was calculated. Subsequently, the area under the curve (AUC) of the receiver operating characteristic was employed to evaluate the performance of the disease model. Moreover, the F1 score and accuracy score (ACC) were also employed to assess the performance. We repeated these process five times to prevent any bias. The average of the performance scores from these five replications was calculated to represent the performance of the model.
Finally, the average prediction score was calculated based on the prediction scores from five machines and five experiments for a drug-disease matrix. If the average prediction score for a drug-disease pair was greater than or equal to 0.5, the pair was considered a potential candidate drug for treatment of the disease.

Evaluating Performance of Model
The performance metrics employed in this study were the AUC, F1, and ACC scores. The AUC value represents the area under the curve for the ROC which is the curve plotted between true positive rate (TPR) and false positive rate (FPR) in several thresholds. The TPR and FPR are calculated as follows: where true positive (TP) outcomes correctly predict a positive label, true negative (TN) outcomes correctly predict a negative label, false negative (FN) outcomes incorrectly predict a negative label, and false positive (FP) outcomes incorrectly predict in positive label. The F1 score is the harmonic mean between precision (PRE) and recall (REC), described as follows: The accuracy (ACC) is the overall percentage of correct predictions in both positive and negative labels. It can be calculated as follows: The F1 and ACC scores are determined based on the criteria of the maximum F1 score. All performance scores ranged between 0 and 1. Scores closer to 1, indicate a higher performance. The measurement metrics have been employed in previous studies and are described in more details which can be found in [29,30].

Investigating Performance of Predictive Models
For each disease, our model generated using the random forest classifier was used to recognize known drugs and to predict possible new drugs from among existing drugs. The average prediction score was computed from the prediction score for each drug from five trained machines. Then, the metrics AUC, F1, and ACC were then calculated using the average prediction score over five iterations. The performance of the model for predicting candidate drugs was assessed based on the average AUC, F1, and ACC scores from five iterations. The numbers of diseases whose predictive models exhibited a higher performance in terms of their AUC, F1, and ACC scores than specific thresholds are listed in Table 1.  Table 1 shows the number of diseases in our proposed model yielded the performance values above the certain thresholds. At a threshold of 0.6 for the AUC, F1, and ACC scores, 420, 478, and 462 (87.9%, 100%, and 96.7%) diseases, respectively, performed above the threshold. When the threshold for the AUC, F1, and ACC scores was 0.9, a total of 50, 63, and 62 diseases, respectively, surpassed this threshold. Overall, more than half of the tested diseases produced a high performance with more than 70% of the AUC, F1, and ACC scores.

Validation of Known Drugs from Clinical Trial Data and Past Literature
Drugs whose average prediction score were greater than or equal to 0.5 for a certain disease were identified as a candidate drug. To analyze the effectiveness of our predictive model, current experimental knowledge, such as known drugs from clinical trial data and those studied in previous literature, was used to verify the predictions.

Validation Using Known Drugs from Clinical Trial Data
Using current experimental knowledge based on clinical trial data from the AACT database [31] in R package, we applied Fisher's exact test [32] with a confusion table of size 2 × 2 with the rows of candidate drug-disease pairs and non-candidate drug-disease pairs and columns of clinical trials and non-clinical trials. The confusion table of the proposed model is presented in Table 2. The null hypothesis was that there was no association between candidate drug-disease pairs and the likelihood of being observed in clinical trials. In another words, the alternative hypothesis was that the candidate drug-disease pairs were more likely to be observed in clinical trials than in non-clinical trials. The confusion table in Table 2 shows that 20,875 of the predicted candidate drugsdisease pairs were found in the database for 5900 pairs. The proposed model predicted 608,651 non-candidate drug-disease pairs, none of which were found in the database of 586,917 drug-disease pairs. The results of Fisher's exact test produced a p-value less than 2.2e −16 , indicating that the candidate drug-disease pairs were significantly more likely to be found in the AACT database.

Validation Using Previous Literature on Candidate Drugs
The candidate drugs-disease pairs from the proposed model were also validated by counting candidate drug-disease pairs found in the PubMed database. We employed the "easyPubMed" package in R language to retrieve scientific publication records from the PubMed database [33]. In total, 19,030 of the 20,875 candidate drug-disease pairs (91.2%) predicted by our model were found in PubMed. We also searched the AACT database for the use of the remaining 1845 candidate drug-disease pairs in clinical trial data and found 15 in this database (Supplementary Table S2).

Verification of Functional Similarities
The functional similarity of both candidate and non-candidate drug-disease pairs was investigated by comparing them with the functional modules of all known drugs for each disease. The steps for the verification of functional similarity are illustrated in Figure 4. First, enrichment analysis was conducted for the candidate drugs, non-candidate drugs, and known drugs based on functional modules or pathways for each disease using the KEGG database [26] (Figure 4a). Second, we compared the similarity of the functional modules for each candidate and non-candidate drug with that of a known drug for each disease (Figure 4b). The similarity score for the functional modules was computed using Jaccard similarity as follows: where M d , M kd are the set of functional modules for a drug and a known drug, respectively. We then computed the Jaccard similarity score for the functional modules of each candidate and non-candidate drug in comparison to all known drugs. The maximum Jaccard score was employed to represent the functional similarity score for each candidate and non-candidate drug in comparison with all known drugs for a disease (Figure 4c). Subsequently, the functional similarity scores for all candidate and non-candidate drugs were computed. Finally, we validated the functional similarity scores for the candidate and non-candidate drugs in comparison with known drugs using the Wilcoxon rank-sum test.
The null hypothesis was that there was no difference in the medians functional similarity scores between the candidate drugs and the non-candidate drugs when compared with the known drugs. In other words, the alternative hypothesis was that the median functional similarity scores for the candidate drugs was higher than that of the noncandidate drugs when compared with known drugs. This was applied to all diseases. The results showed that there were 478 diseases with corresponding p-values that were lower than 0.05 in the Wilcoxon rank-sum test. Hence, the functional similarity scores of the candidate drugs were higher than the scores for non-candidate drugs when comparing them with known drugs for all 478 diseases. This indicates that the candidate drugs associated with each disease exhibited stronger functional similarity with known drugs than with non-candidate drugs.

Comparison of Our Method with Other Existing Methods
We compared the performance of our proposed method with other existing methods. Lee and Yoon [16] described a method based on a directed gene network using the random forest classifier in 2018. They generated models corresponding to each disease with the weight of the out-degree in the gene directed network including the positive or negative associations between genes and between the drugs and their target genes. An assessment of Lee's method in predicting drug-disease associations revealed that the random forest classifier produced excellent prediction performance. To compare the performance of our proposed approach with Lee's method, the same set of drugs, diseases, and known drug-disease pairs were employed as inputs with the total number of diseases set at 460.
The percentage of diseases whose proposed method and Lee's method exhibited a higher performance in terms of their AUC, F1, and ACC scores than specific thresholds are listed in Table 3. For thresholds of 0.5 or more, our proposed model consistently produced acceptable AUC, F1, and ACC scores for 440 or more of the 460 diseases (95.9%), whereas Lee's method produced acceptable scores for fewer than 440 diseases. Additionally, the results show that higher AUC, F1, and ACC thresholds lead to a higher number of efficient models for the prediction of new drugs for a disease. Wu et al. [6] proposed a drug-disease associations model in 2019. They generated five meta-paths which are drug-disease matrices based on drug-disease, drug-protein, and disease-protein interaction data, that included reliable negative, i.e., a set of drugs that cannot be used to treat diseases. They applied the singular value decomposition (SVD) technique to extract the latent features for the drugs and diseases. Then, they combined these latent features to represent drug-disease pairs. Random forest classification was employed to generate five models from the five meta-paths. They subsequently applied the ensemble technique to combine all five random forest models. Wu's approach outputs all candidate drug-disease pairs but our method proposes candidate drugs for each disease. To compare the two approaches, we combined all candidate drugs for all diseases and then randomly selected the non-candidate drugs for each disease with the same number of those candidate drugs corresponding with disease. This produced a set of 1317 drugs, their targets, and 478 diseases for comparison. Table 4 demonstrates that the performance values with 478 common diseases in our proposed model are higher than that of Wu's method. Thus, the results showed that our model has outperformed that other model. We also analyzed the performance of Wu's method, Lee's method, and our proposed method at the same time by combining candidate drugs for all of the diseases in Lee's method in order to evaluate the same set of drugs, diseases, and known drug-disease pairs.  Table 5 shows the performance of the three models for the 460 diseases and 1315 drugs common to all methods. Lee's method, which is based on a random forest classifier, and our method produced a higher performance, while the area under the precision-recall curve was greater for Wu's method than for Lee's method. The performance scores for our method were also higher than those of Lee's method. Overall, our method dominated the other models based on the common set of diseases and drugs. The ROC curve for the three approaches is presented in Figure 5.

Gold Features Analysis
This research proposed a method to investigate the relationship between target proteins and disease-related proteins based on several biological information, such as topological network, proteomic data, functional analysis, and druggable property, using PPSVs. The PPSVs consisted of seven features: the neighboring similarity score, the closeness score, the local alignment score, the global alignment score, the pathway score, the confident score, and the shared drug score. We investigated the gold features of our approach, which represent the best descriptors for the prediction results, using Pearson correlation coefficients (PCC). The PCC is used to measure the linear correlation between two variables. If a PPC is close to 1 or −1, the two variables have a highly positive or negative linear correlation, respectively. We calculated the PCC score for the seven features of each disease-related protein and the labels of drug-disease pairs that were added to the drug-disease matrix for each disease. The absolute PCC scores were then ranked in descending order for each disease to produce a highly linear correlation of the seven features for each disease-related protein. We then repeated this process for all diseases. The number of features that satisfied the top two of the absolute score in PCC for all 478 diseases is presented in Figure 6. Figure 6 shows that the confidence score appeared in the highest and second highest PCC for 122 and 117 diseases, respectively, while the pathway score was found among the top two PPC for 113 and 85 diseases, respectively. The top two features for both the highest and second-highest PCC were the confidence and pathway scores, which were related to functional similarity in the PPSVs, thus these were identified as the gold features. The third feature most commonly appearing among the highest and second-highest PPC was the neighboring similarity score in the PPSVs. Therefore, two proteins that have a highly similar neighborhood might cooperate in the same module to regulate the same functional task. Figure 6 also reveals that the number of local alignment features ranked among the highest and second-highest PCC was higher than that of global alignment features. Hence, the local alignment score was more important for the model predictions than the global alignment score. Interestingly, we ran our model using only the important features identified above, i.e., the confidence, pathway, neighboring similarity, and local alignment scores, which lead to better performance for a higher performance threshold (Table 6). In particular, using the important features to generate the model, the number of diseases that satisfied the performance threshold of ≥0.90 increased. However, the number of diseases decreased for the performance scores of AUC and F1 more than 70% (the threshold ≥0.70). Based on this, it can be concluded that these important features play a crucial biological role in identifying drug-disease associations, but some diseases require other biological information in the PPSV to be predicted. Thus, the final model was trained on all features for all diseases.

Investigation of Novel Drug-Disease Pairs
In this section, we reveal the novel of candidate drug-disease pairs based on false positives of our method with high performance of more than 90% of AUC score. The 50 diseases with AUCs of more than 90% are shown in Table 1. The candidate drugs corresponding to these 50 diseases were ranked in descending order based on the functional similarity score. The functional similarity score is the maximum of Jaccard similarity score for the functions of a drug and all known drugs (see Section 3.3). We investigate these novel candidate drug-disease pairs that were incorrect positive predictions in more detail in this section.
The top 20 candidate drug-disease pairs were identified by ranking the functional similarity score of the candidate drugs and known drugs in descending order, as shown in Table 7. It was found that the average functional similarity score for the top 20 candidate drug-disease pairs was about 0.98. In other words, 20 candidate drugs for diseases with false positive labels were 98% similar to their known drugs. However, correct positive prediction of candidate drug-disease pairs (i.e., true positives) were 100% similar to their known drugs. This indicates that these 20 novel candidate drugs, with their high functional similarity score of 98% could be used to treat the corresponding disease. Novel of candidate drug-disease associations are useful for proposing alternative indications of existing drugs, assessing side effects, and determining potential drug resistance. Pergolide (DrugBank ID: DB01186) is dopamine receptor agonist used for the treatment of Parkinson disease [34]. In our investigation, we found that pergolide has a high functional similarity score with the known drug zuclopenthixol (DrugBank ID: DB01624), which is used to treat bipolar disorder, as shown in the first highlight. Some studies have investigated the possibility of the pergolide being indicated for the treatment of bipolar disease (MESH ID: D001714). Bouckoms and Mangini explored pergolide as a supplement to antidepressant therapy with tricyclic antidepressants and monoamine oxidase inhibitors. Pergolide successfully adjusted the mood, interest, and energy of 11 out to 20 bipolar patients within a week [35]. However, pergolide has been withdrawn from several markets, the US and Canadian markets, because it was found to increase the risk of cardiac valvulopathy [36].
Further, one of the top 20 drug-disease associations identified in this study suggested that isradipine (DrugBank ID: DB00270) is also related to bipolar disease (MESH ID: D001714), as shown in the second highlight. Ostacher et al. investigated the potential use of isradipine in the treatment of bipolar depression [37]. Clinical trial information of isradipine in the treatment of bipolar depression can be found at www.clinicaltrials.gov (NCT01784666). Therefore, our identified candidate drug-disease associations might be useful in further pharmaceutical analysis of drug repositioning. The list of all predicted drug-disease associations identified using our method is presented in Supplementary  Table S1.

Discussion
Our method presented the topological network, proteomic data, functional analysis, and druggable property as important information for target proteins and disease-related proteins for use in PPSVs. The human PPI network was reconstructed using interactions with a high confidence score of more than 80%. This network provides high confidence for the relationship between proteins in terms of the neighboring similarity score and the closeness score in the PPSVs. Selecting maximum scores from the PPSVs to construct a drug-disease matrix allowed the similarity in various characteristics between proteins to be identified. Thus, a high score for PPSV features represented high confidence in the similarity between all of the target proteins and a disease protein.
A limitation of this research is that the investigated disease ideally should have had 10 or more approved drugs. We had a restriction in obtaining enough data of drug-disease pairs for fitting the model in 5-fold cross validation to predict candidate drugs for each disease. Our method is significantly different from Wu's and Lee's methods. Wu's method required many known drug-disease pairs to generate suitable meta-paths for the prediction of novel drug-disease associations from among all diseases. Lee's method is based on a directed gene network and positive or negative types of gene interactions to predict drug-disease pairs for each disease. However, our technique is based on various biological information within a PPSV to predict candidate drugs for each disease.
A comparison of the performance of our proposed method, Wu's, and Lee's methods demonstrated that both Lee's and our proposed methods produced a strong performance because the two models employed a similar classifier and a similar approach to generate a model for each disease when predicting candidate drugs. However, further analysis indicated that our method outperformed both Wu's and Lee's methods. The reason for this may be that our developed PPSVs contains various sources of biological information, including the topological network, proteomic data, functional analysis, and druggable property. Note that, the side effects were not considered by the methodology. The side effects of existing approved drugs have been thoroughly examined during clinical trials. It is supposed that there are not many opportunities to use this method. The best descriptors for drug-disease associations in our study were neighboring score, confident score, local alignment score, and pathway score. They exhibited good performance as indicators in the search of existing drugs for use with other diseases.
Our gold features analysis revealed that the best descriptors for predicting drugdisease associations were the confidence score and pathway score, which were related to functional similarity in the PPSVs. The results also indicated that the local alignment score was more important than the global alignment score for prediction. Hence, our finding suggests that the interaction among proteins in the PPI network is crucial for assessing whether a drug can bind to an alternative protein based on functional interactions.

Conclusions
This research proposed the use of multiple biological types of information, including proteomic data, functional analysis, and druggable property, for target proteins and diseaserelated proteins to produce PPSVs for an investigation into the similarity between proteins in the human PPI network. The proposed method predicted approved and novel drugdisease associations for individual diseases based on random forest classification. Further, the experimental knowledge from clinical trial data and the PubMed database were used to verify the predicted candidate drugs. It was found that the predicted candidate drugs were significantly more functionally similar to known drugs. Our proposed model was also found to outperform other existing techniques, with the confidence and pathway scores proving to the best descriptors for prediction process in terms of functional similarity in PPSVs. The novel candidate drug-disease pairs identified in this study can be investigated further using pharmaceutical analysis in the laboratory.
Supplementary Materials: The following are available online at https://www.mdpi.com/2076-3 417/11/7/2914/s1, Table S1: List of all predicted drug-disease associations. Table S2: Candidate drug-disease pairs that found in clinical trials even if not found in PubMed database.