Computational Prediction of Compound–Protein Interactions for Orphan Targets Using CGBVS

A variety of Artificial Intelligence (AI)-based (Machine Learning) techniques have been developed with regard to in silico prediction of Compound–Protein interactions (CPI)—one of which is a technique we refer to as chemical genomics-based virtual screening (CGBVS). Prediction calculations done via pairwise kernel-based support vector machine (SVM) is the main feature of CGBVS which gives high prediction accuracy, with simple implementation and easy handling. We studied whether the CGBVS technique can identify ligands for targets without ligand information (orphan targets) using data from G protein-coupled receptor (GPCR) families. As the validation method, we tested whether the ligand prediction was correct for a virtual orphan GPCR in which all ligand information for one selected target was omitted from the training data. We have specifically expressed the results of this study as applicability index and developed a method to determine whether CGBVS can be used to predict GPCR ligands. Validation results showed that the prediction accuracy of each GPCR differed greatly, but models using Multiple Sequence Alignment (MSA) as the protein descriptor performed well in terms of overall prediction accuracy. We also discovered that the effect of the type compound descriptors on the prediction accuracy was less significant than that of the type of protein descriptors used. Furthermore, we found that the accuracy of the ligand prediction depends on the amount of ligand information with regard to GPCRs related to the target. Additionally, the prediction accuracy tends to be high if a large amount of ligand information for related proteins is used in the training.


Introduction
Post-genome research has been providing a large amount of omics data on genes and proteins, including genomes, transcriptomes, and proteomes. On the other hand, the development of technologies such as high-throughput screening has led to the accumulation of compound and bioactivity information on a vast number of compounds and drugs. This information is published in public databases such as ChEMBL [1][2][3] and PubChem [4] and are freely available to use. Such bioactivity information between compounds and proteins is also referred to as drug-target interaction (DTI) and in a broad context, simply Compound-Protein interaction (CPI). The research to utilize such data has been been one of the major hot topics in the field of drug discovery. Many drugs affect the human body in the form of drug effects or side effects through interactions with biomolecules such as target proteins. This is why the identification of CPIs is an important issue in drug discovery research. However, accurate and comprehensive identification of CPIs in experiments is almost impossible due to the enormous costs involved. In recent years, various Artificial Intelligence (AI) technologies have been developed to predict CPIs (or DTIs) on a large scale by effectively utilizing the vast amount of bioactivity data that has been accumulated [5][6][7][8][9][10][11][12][13][14].
In the early stages of drug discovery, ligands that act on target proteins are often insufficiently identified or not even identified at all. In addition, we cannot expect to get a lot of information on the three-dimensional structure of target proteins in these cases. The in silico approach does not work well in situations where known active ligands and protein structural information is extremely limited. Nevertheless, in order to move forward with the drug discovery project, we should also consider trying in silico approaches when we want a ligand even if it is less active. In such cases, AI technology for CPI prediction is promising as an in silico approach.
One of the many AI-based methods for predicting CPIs is called the CGBVS technique. This technique is theoretically simpler than other methods, can be implemented without difficulty, and gives sufficient accuracy of prediction. Hamanaka et al. [14] have implemented CGBVS with a deep neural network (CGBVS-DNN) that enabled training of over a million CPIs. Wassermann et al. [7], using a machine learning approach similar to CGBVS, used a limited number of protease targets as an example to test the prediction accuracy for orphan targets. They have concluded that ligand information of nearest neighbors is essential for a good prediction of ligands of orphan targets.
We studied the following two aspects when using the CGBVS technique. The first aspect is how accurate the ligand prediction is for orphan targets. In this study, we focused on the G protein-coupled receptor (GPCR) family, which is an important and data-rich target in drug discovery. Out of the available 243 possible targets, we randomly selected 52 GPCRs. We created 52 machine learning models of CGBVS, omitting all the ligand information for one particular target GPCR per model. That is, we created a virtual orphan GPCR per model and tested whether the model could predict the ligands for that virtual orphan GPCR. We also investigated how the accuracy of ligand prediction for 52 selected virtual orphan GPCRs is affected by the combination of compound and protein descriptors used in the machine learning process.
The second aspect is to examine the conditions and applicability of high prediction accuracy. Here, we first introduced an applicability index which helped us determine whether it is possible to apply CGBVS to true orphan GPCRs.

CGBVS
For the purpose of investigating the relationship between the applicability of the CGBVS method to ligand prediction of orphan GPCRs and the protein kernel, we used SVM instead of Deep Neural Network. The CGBVS technique we used is mostly implemented according to the method studied by Yabuuchi et al. [8], but the machine learning method for Support Vector Machine (SVM) [15] is slightly different from the original CGBVS technique. The reason is that the kernel function part of SVM is clearly divided into a compoundderived part and a protein-derived part for each calculation. The method of calculating this SVM is the same as the method used in the work of Wassermann et al. [7]. Letting c be the compound vector and p be the protein vector, we can then express the Compound-Protein interaction vector (CPI vector) x as their tensor product x = c ⊗ p. The SVM kernel for the CPI vector can then be expressed by [16]: where K C and K P are the compound and protein kernels, respectively. Since the compound and protein kernels can be calculated independently, there is no need to explicitly calculate the matrix representation of the tensor product as a representation of the actual CPI vector. A schematic diagram of the calculation procedure of our CGBVS technique is shown in Figure 1. The first step is to prepare the structural formula data set of the compounds and the amino acid sequence data set of the proteins for machine learning. From a compound structural formula, descriptors such as physicochemical parameters and fingerprints are calculated and converted into a compound vector. From an amino acid sequence, the descriptors associated with the strings are calculated and converted into a protein vector. The two vectors created are combined according to bioactivity data to create the CPI vector. If the activity value of the data is higher than the set threshold, it is a positive CPI vector; otherwise, it is a negative CPI vector. The CGBVS model is created by machine learning via SVM of positive and negative CPI vectors based on aforementioned kernels. This CGBVS model allows us to predict the activity of unknown Compound-Protein combinations. The usual SVM score is the value of the distance from the discriminative surface (decision function), but this value is sigmoidally transformed to perform probability estimation [17].

Virtual Orphan GPCR Model
The CGBVS model developed in this study is based on the GPCR-related activity data from the ChEMBL 25 database [1][2][3]. The total number of GPCRs was 243, and the number of associated compounds was 280,648. The criterion for the presence or absence of activity used in this study was whether there was 50% activity at ≤1 µM or at ≥3 µM, respectively.
In addition, no distinction was made between agonists and antagonists. In addition, data such as the inhibition rate of a single concentration were not used. In this condition, the number of CPIs for positive samples was 165,877 and the number of CPIs for negative samples was 233,272.
To create a virtual orphan GPCR model, we select one GPCR and delete the CPI data for that GPCR from the training data set (see Figure 2) Fifty-two GPCRs having 100 or more active ligand data were randomly selected as virtual orphan GPCRs in this study (Table 1).  The CPI data for only one target are deleted per CGBVS model which leaves CPIs for 242 target out of the available 243. As control models to compare the prediction performance of the virtual orphan GPCR models, we also built models that included only half of the original number of ligand data for each target GPCR as a training set and retain the other half as a test set. We refer to these as half-sampled GPCR models (Tables S1 and S2).
The two types of compound descriptors used in this study are descriptors that can be calculated using alvaDesc [18] and ECFP [19]. Using alvaDesc, a software developed by the company Alvascience, 941 non-fingerprint 2D descriptors were calculated, while 2048-bit Extended Connectivity Fingerprints having a radius of 2 (ECFP4) were calculated using RDKit [20]. For proteins, on the other hand, there are three types of descriptors used: PROFEAT 2016 [21], ProtVec [22], and Multiple Sequence Alignment (MSA). PROFEAT descriptors were generated using the web service [23], and we calculated 1437 descriptors using the default settings. ProtVec descriptors were generated using a free tool called BioVec [24] to calculate 1500 descriptors. For MSA descriptors, the number of descriptors generated are equal to the number of target proteins employed in machine learning. There are some techniques that used pairwise sequence alignment as the SVM kernel [25,26]. In our case, we have developed a technique to create descriptors from multiple sequence alignment. To calculate for MSA descriptors, the GPCR amino acid sequences are first prepared in FASTA format. The identity matrix was then generated after performing Multiple Sequence Alignment using Clustal Omega. Then, the identity matrix S is eigendecomposed as in the equation where Λ is a diagonal matrix and U is a unitary matrix made from eigenvectors. Finally, each column of the matrix X in Equation (2) can be taken as the column feature vector of the corresponding protein. In rare cases, several eigenvalues with small negative numbers may be found. In such cases, the eigenvalues and eigenvectors of the negative numbers are removed, and the matrix X is calculated. The MSA feature vectors computed above can be reconstructed through approximation of the matrix elements of the identity matrix S by choosing a linear kernel as the SVM kernel. Since the negative eigenvalues and eigenvectors have been removed, this kernel matrix is a semi-positive definite symmetric matrix. The compound and protein descriptors described above are high dimensional vectors, thus we used principal component analysis to perform dimensionality reduction. We took the cumulative contribution of the principal components up to 99%.
The kernel function computed in SVM machine learning depends on the feature vector created from each descriptor. Table 2 shows the correspondence between descriptors and kernel functions. In this study, we created virtual orphan GPCR models for all combinations of two compound and three protein descriptors (six combinations) for each of the 52 GPCRs. To avoid overfitting of the SVM model, we have set the appropriate hyperparameters to maximize accuracy via 5-fold cross validation. Machine learning calculations with SVM have been performed using a proprietary tool named CzeekS [27].

Descriptor
Class SVM Kernel Equation

Model Validation
As a method of confirming the prediction performance of virtual orphan GPCR models, a set of compounds for validation was screened against the virtual orphan GPCR. The set of compounds used for validation was composed of 280,648 GPCR-related compounds from the ChEMBL 25 database. These compounds are identical to those used to create the CGBVS models, but since each model is created after deleting the data of the GPCR to be tested, they are not considered to be problematic as validation compounds for prediction performance. Validation of HS GPCR models was performed in the same way as the virtual orphan GPCR models; however, the ligand data of the target GPCR included in the training set were omitted from the test set. The area under receiver operating characteristic curve (AUROC) and Enrichment Factor (EF) were adopted as measures of predictive performance. In these calculations, compounds whose interaction data with the target do not exist in the CHEMBL database were treated as having no activity. AUROCs are calculated using scikit-learn by sorting in order of increasing CGBVS score. On the other hand, EF 1% is calculated as N total is the total number of compounds screened and N subset is the number of compounds selected from the top scores. In addition, A total is the total number of active compounds for the target GPCR, and A found is the number of active compounds for the target GPCR found among the top scoring compounds selected.

Applicability Index
We considered the applicability index A(p i ) of CGBVS to the target GPCR p i to be proportional to the sum of the number of active ligands N j of the neighboring GPCRs p j of the target virtual orphan GPCR. Thus, we defined it as Here, w is a weight function whose argument is the value of the protein kernel function K P , and its functional form is the sigmoid function The two parameters of the sigmoid function, α and r, are determined to maximize the Spearman's correlation coefficient between AUROC and log A. The AUROC is calculated using the procedure described here earlier. Bayesian optimization was used to perform the optimization of the correlation coefficient.

Analysis of Prediction Accuracy
The EF 1% calculated in this study are for the top 1% (2806 compounds) from the highest scores of the screened compounds. In this case, EF 1% takes a value from 0 to 100 and, when EF 1% is equal to 1, it corresponds to random screening. The model is not worthy for screening unless the EF 1% value is at least above 1. The enrichment factors for the 52 GPCRs used as virtual orphan targets are shown in Figure 3. Additionally, a table comparing the values of EF 1% for the virtual orphan and half-sampled GPCR models is provided as Supplementary Data (Table S1). All possible combinations of compound and protein descriptors were tested for each GPCR target and results showed large variations in EF 1% among GPCRs. The same was observed in EF 1% among descriptor combinations for the same GPCR. This may indicate that each GPCR target possibly requires different combination of descriptors that are suitable for accurate prediction. Figure 3 shows that the combination of alvaDesc and MSA (red bars) has good EF 1% values for most GPCRs, indicating that it could possibly be the best descriptor combination. The next best descriptor combination is exhibited by alvaDesc-PROFEAT (blue bars) followed by ECFP-MSA (purple bars).   In order to simplify the EF 1% results for each GPCR and highlight the effect of descriptor combinations, the frequency distribution of EF 1% for each descriptor combination is summarized in Table 3. For all combinations of descriptors, more than half of the GPCRs had EF 1% greater than 1 (more than 26), and some of them had EF 1% greater than 30, which can be considered accurate and better than our expectations. Some of them were comparable to the EF 1% of the half-sampled GPCR models, and, surprisingly, there were six EF 1% values that exceeded those of the half-sampled models. Looking at the variations in EF 1% value and the number of GPCRs for each descriptor, it can be seen that the difference in the protein descriptor has a greater impact on the EF 1% than the difference in the compound descriptor. We have found that, when MSA is used as the protein descriptor, there are many GPCRs having better EF 1% than when other descriptors are used. In particular, for the combination of alvaDesc and MSA, there were nine GPCRs with EF 1% greater than 30, making it the most suitable combination for the construction of the GPCR model. EF 1% is commonly used as a performance indicator for screening measurements. Since the number of active compounds for each protein is different, (A total in the Equation (3)), care must be taken in the simple comparison of prediction performance between different GPCRs. Therefore, in this study, we calculated AUROC (area under the ROC curve) as another predictive performance indicator. The calculation results of AUROC for 52 GPCRs are shown in Figure 4. As in the case of EF 1% , a table comparing the AUROC with the half sampled GPCR models is shown in Table S2. For all combinations of GPCRs and descriptors, the values of AUROC for the half-sampled GPCR models are higher than that for the virtual orphan GPCR model. Figure 5 shows four representative ROC curves from 52 GPCRs. Each one has been chosen for its particular characteristic. CHRM3 is a case where all combinations of descriptors have high predictive performance, while ADRB2 and HCRTR2 are cases where the six curves are scattered. HRH3 is a case where the predictive performance for all combinations of descriptors is low.
For the three GPCRs in Figure 5 that result in high prediction performance, it can be seen that the prediction performance is good in two descriptor combinations: alvaDesc-MSA (red) and ECFP-MSA (purple). The trend is roughly the same for other GPCRs. The number of GPCRs with AUROC greater than 0.8 was the largest in alvaDesc-MSA with 29, followed by ECFP-MSA with 27. The number for other descriptor combinations was 18-24. These results suggest that the best prediction results are obtained when MSA is used as the protein descriptor. It is interesting to note that the number of GPCRs with AUROC of 0.8 or higher in the half sampled GPCR models is 49 for all six combinations of descriptors (see Table S2). This means that the difference in prediction accuracy due to the difference in descriptors is small in the conventional method of accuracy comparison such as cross-validation but is clearly bigger when using virtual orphan GPCR models. Furthermore, in the case of virtual orphan GPCR models, prediction is greatly influenced by the combination of compound and protein descriptor used.

Applicability of CGBVS for Orphan Targets
In actual drug discovery research, when trying to search for active compounds of true orphan GPCRs using CGBVS, it is necessary to have an indicator of whether CGBVS will work or not. According to the results of the performance evaluation of CGBVS models by EF 1% and AUROC, the prediction performance is high for widely studied GPCRs, such as adrenergic and muscarinic receptors, for which there is abundant data on known ligands. This is thought to be because the prediction performance does not deteriorate even if all the ligand data of the target orphan GPCR is deleted, since abundant ligand data of related GPCRs of the target GPCR can be included in the training data. This can be understood from the fact that the applicability domain of a machine learning model is often set to the region around a dense area of training data [28][29][30]. Therefore, we defined the applicability index A(p i ) of CGBVS to be proportional to the sum of the number of active ligands N j of the related GPCRs p j to the target GPCR, as in Equation (4). The results of calculating the applicability index as described above for six different combinations of compound and protein descriptors are shown in Figure 6 as scatter plots. Values of Spearman's correlation between log A and AUROC are summarized in Table 4. Similar to EF 1% , the variation in the values of correlation coefficients are also larger among protein descriptors compared to that among compound descriptors. In addition, the protein descriptor with the highest correlation coefficient is MSA, and both alvaDesc and ECFP are highly correlated with log A and AUROC. As for the other two protein descriptors, PROFEAT showed a weak correlation and ProtVec showed no correlation at all. Therefore, when attempting to find the ligand for a true orphan GPCR using CGBVS, the use of MSA as the protein descriptor can provide some estimate of whether the ligand search will be successful or not. Virtual orphan GPCRs with AUROC greater than 0.8 are positive, and those with AUROC less than 0.8 are negative, and are predicted based on whether they exceed the threshold of log A. The threshold of log A was determined to maximize the accuracy of the prediction. Table 5 summarizes the results of validating the prediction accuracy of virtual orphan GPCRs with AUROC greater than 0.8. The accuracy and positive predictive value (PPV) of MSA was close to 0.9, indicating high prediction accuracy compared to other protein descriptors. In the case of PROFEAT, the accuracy was not bad at over 0.7, but the PPV was a little low at 0.65 when the compound descriptor was alvaDesc.
One of the features of MSA protein descriptor is that the values of the parameters r and α of the applicability index weight function are smaller than those of the other two types of protein descriptors. This means that the weight function is looser in shape than the other protein descriptors, and the active ligands of GPCRs with small protein kernel values (small similarity) also contribute to the applicability index. This can be understood from the fact that the threshold of log A is the largest for MSA. In using MSA, although machine learned target proteins are less similar to the virtual orphan target, we were able to construct a machine learning model in which the ligand information exert influences on each other. This leads to high applicability of CGBVS to identifying ligands for orphan targets. The opposite can be said for PROFEAT which requires high similarity between machine learned and orphan targets in order to have high applicability. On the other hand, the weight function in the case of PROFEAT is a parameter with a shape that changes more rapidly at the position where the protein kernel value is larger than MSA. This means that only the ligand information of GPCRs that are very close to the virtual orphan GPCRs affects the prediction accuracy. In a study by Wassermann et al., the accuracy of ligand prediction for orphan targets was found to be greatly influenced by the ligand information of related targets. Our results are consistent with theirs, and Equation (4) gives a more generalized interpretation.

Conclusions
We tested the prediction accuracy of the CGBVS technique for 52 virtual orphan GPCRs. Machine learning with the CGBVS method was performed for all possible combinations of two types of compound descriptors and three types of protein descriptors. In the prediction of the ligands of virtual orphan GPCRs, it was shown that the protein descriptor had a greater impact on the prediction accuracy than the compound descriptor. Of the three types of protein descriptors validated in this study, MSA had the best accuracy, with the highest number of GPCRs exceeding the reference values (EF 1% > 10, AUCROC > 0.8) for both EF 1% and AUROC indices. On the other hand, for compound descriptors, alvaDesc had slightly more GPCRs with better prediction accuracy than ECFP, but with only small differences between actual values.
We also examined the conditions under which ligand search for virtual orphan GPCRs was possible using CGBVS. The simple applicability index we defined in Equation (4) was shown to correlate well with AUROC when an MSA descriptor was used. There is a weak correlation for PROFEAT and almost no correlation at all for ProtVec. By using an MSA descriptor, we can, therefore, determine whether CGBVS can be applied to an unknown orphan target by the value of log A. In this case, if log A is 8.7 or higher when using alvaDesc as compound descriptor, and if log A is 9.0 or higher when using ECFP, a high success rate can be expected.
Supplementary Materials: The following are available online. Table S1: Enrichment factors for 52 GPCRs which were calculated from the results of screening using virtual orphan (VO) and halfsampled (HS) GPCR models, Table S2: AUROC for 52 GPCRs which were calculated from the results of screening using virtual orphan (VO) and half-sampled (HS) GPCR models.