Exploring the Potential of Spherical Harmonics and PCVM for Compounds Activity Prediction

Biologically active chemical compounds may provide remedies for several diseases. Meanwhile, Machine Learning techniques applied to Drug Discovery, which are cheaper and faster than wet-lab experiments, have the capability to more effectively identify molecules with the expected pharmacological activity. Therefore, it is urgent and essential to develop more representative descriptors and reliable classification methods to accurately predict molecular activity. In this paper, we investigate the potential of a novel representation based on Spherical Harmonics fed into Probabilistic Classification Vector Machines classifier, namely SHPCVM, to compound the activity prediction task. We make use of representation learning to acquire the features which describe the molecules as precise as possible. To verify the performance of SHPCVM ten-fold cross-validation tests are performed on twenty-one G protein-coupled receptors (GPCRs). Experimental outcomes (accuracy of 0.86) assessed by the classification accuracy, precision, recall, Matthews’ Correlation Coefficient and Cohen’s kappa reveal that using our Spherical Harmonics-based representation which is relatively short and Probabilistic Classification Vector Machines can achieve very satisfactory performance results for GPCRs.


Introduction
Rational drug discovery aims at the identification of ligands that act on single or multiple drug targets [1][2][3]. The process is usually performed by research which is focused on developing methods and tools for understanding chemical space. In order to find the desired candidates, several computational approaches are required which enable to predict drug-like properties.
Take for instance virtual screening [4], which has its roots in cheminformatics and performs the rapid in silico assessment of large libraries of chemical structures to identify those most likely to bind to a drug target. Recently, one may observe the success and possible new opportunities with regards to ligand-based virtual screening [5]. In this modern era of computational technological advancement, machine learning has been extensively applied to predict the activity of new candidate compounds. Willett et al. proposed a binary kernel discrimination approach [6]. The multidimensional analysis of classification performance of compounds were performed by Smusz et al. [7]. The Bayesian belief network was adopted by Nidhi et al. [8] and Xia et al. [9]. A lot of promising prediction results by adopting Support Vector Machines were obtained by Buchwald et al. [10], Bruce et al. [11] Czarnecki et al. [12], Rataj et al. [13], and Zhang et al. [14]. Liu et al. have constructed ensembles to identify Piwi-Interacting RNAs [15].
However, the success of applied machine learning methods depends on the molecular structure representation employed, also known as the molecular descriptors [16]. Thus, the main challenge is to devise representations of molecules that are both complete and concise to enable to reduce the number

Results And Discussion
In this section we present the evaluation measures employed for performance comparison. Then we analyze and discuss the experimental results and compare our results with other approaches.

Evaluation Measure
We considered compound activity prediction as a binary classification task. Hence, a number of commonly used measures can be employed to evaluate its performance. These methods include accuracy (ACC), precision (PRE), recall (REC), the Matthews Correlation Coefficient (MCC) and the Cohen's kappa (κ). They are listed in Table 1. Table 1. Evaluation measures for the binary classification problem: TP-true positives (the total number of active compounds that are predicted correctly), TN-true negatives (the total number of inactive compounds that are predicted correctly), FP-false positives (the total number of these compounds that have no interaction with the receptor but are predicted as active), FN-false negatives (the total number of these compounds that are active but are predicted as inactive), P A -an observed level of agreement, P E -an expected level of agreement.

Measure Computational Formula Description
Accuracy It returns a value between −1 and +1, where +1 represents a perfect prediction, −1 total disagreement between prediction and observation and 0 indicates no better than random prediction.
Cohen's kappa [48] It returns a value between −1 and +1, where +1 represents a complete agreement, 0 or lower values mean chance agreement.

Experimental Design
In the study, the flowchart of SHPCVM is shown in Figure 1. More specifically, after getting the data the spherical harmonics-based descriptor is calculated. In order to obtain the optimal number of features, we perform feature selection process. Then the final molecular descriptor is used as input to train the PCVM classifier. We divided the datasets into training (80%) and test (20%) sets to carry out the computer experiments. Since cross-validation is a useful tool to select the appropriate model and tune a few parameters, ten-fold cross-validation was used for the training purposes. Finally, the performance of each classifier was evaluated on an external test set randomly selected from the original dataset (20%).
We used in-house Python code for features calculations and the scikit-learn package (http://scikitlearn.org/) for machine learning. 3D coordinates for the molecules were generated using 2D → 3D structure generation routines included in the RDKit [49] and Open Babel [50] python packages. Both Connectivity descriptor and MOE-type features for each molecule were calculated by Python ChemoPy package [51].

Descriptor Insights
The main goal of any molecular descriptor is to achieve a mapping from the original space to another designed descriptor space. Since the new space usually has a smaller dimension, some information will be inevitably lost after the reduction. Thus, a perfect descriptor is supposed to preserve the core information. In our computer experiments we have examined whether the spherical harmonics-based descriptor meets the expectations. We have performed PCA [52] on 49 dimensional descriptor and analyzed the quality of the separation between active and inactive molecules. PCA is a well-known and widely used method that projects a dataset onto the directions that account for most of the variance in the dataset. Figure 2 shows the distribution of the active and inactive compounds in P35372 dataset after applying PCA to the 49 dimensional spherical harmonics-based descriptor, MOE-type and Connectivity descriptor, and choosing the top three principle components. One may notice that the biologically active compounds are gathered together.
On the other hand, the inactive compounds are spread out. Obviously, the active and inactive molecules are not completely separated. However, it is quite easy to notice some patterns and clusters of actives and inactives. The visual inspection suggests that the spherical harmonics-based descriptor preserves most of information to allow classification and can be further explored. Please note that the goal of this computer experiment was to ensure whether the information preserved by the descriptors may be enough to apply the representation to search for active compounds. If the descriptor was useless, the data would be randomly separated and none interesting patterns could be observed. Indeed, Figure 2 indicates the data described by spherical harmonics based descriptor is not linearly-separable but we did not expect it. Instead, we have found out the descriptor is a good tool to analyze the chemical space. What is more, to give an illustrative example Figure 2 shows the distribution of data for only 1 out of 21 sets included in the datasets. However, we have observed similar tendency in all datasets.  The results of PCA applied to P35372 dataset, i.e., the percentage of the variation explained by each principal component for three different descriptors are shown in Figure 3. It can be noticed that for Spherical Harmonics-based descriptor the top three principle components explain more than 70% of the variation of samples in the descriptor space. It suggests that the 3D spatial distribution illustrated in Figure 2 may, at least partially, reflect the real spatial distribution in the descriptor space.
Moreover, the PCA results indicate the actives and inactives represented by the three descriptors (MOE, Connectivity and SH-based) are not linearly separable. Nevertheless, such data can still be classified correctly using some non-linear approaches.

Performance Evaluation
The purpose of the computer experiments presented in this subsection was three-fold. As the introductory computer experiments described in Section 2.3 have demonstrated, the spherical harmonics-based descriptor is a reliable descriptor to analyze the molecular space. For this reason, our first goal is to assess the ability of PCVM classifier with the spherical harmonics-based descriptor to predict biologically active compounds. Secondly, we aimed to compare the PCVM performance with SVM approach and another classifiers. Finally, we compared the prediction performance of PCVM as a representative classification method when different descriptors are used.

PCVM Model with a Spherical Harmonics-Based Descriptor
After ten-fold cross-validation procedure, a performance estimate was obtained for each test dataset. The outcomes over the evaluation measures for PCVM and the molecules are shown in Tables 2-6. The results suggest that the proposed approach is valuable. We observed that ACC is more than 0.8 in the vast majority of cases. The minimum values for ACC, PRE, REC, MCC and κ are 0.742, 0.726, 0.752, 0.69, and 0.651 respectively.
The results illustrated in Tables 2-6 indicate that our approach has good discriminative capabilities for the molecular activity recognition. One may notice it is able to outperform representative models. The corresponding outcomes obtained by cross-validation on the training set are available as Supplementary Materials. Based on reported values, SHPCVM is indeed a robust approach. It appears the results can be replicated on unseen data.
A point to consider is the fact that our final representation is strictly dependent on the precision of 3D structure model. Consequently, for different conformations, we get different representation of the given molecule. Also, the quality of 3D structure is significant. Here, we want to stress that although the molecular activity is the joined effect of varied factors (physico-chemical and biochemical properties, among others), PCVM combined with the new shape-based representation is able to give good prediction outcomes. Our results again indicate that the choice of a proper set of features which describe the molecule may affect prediction performance. Furthermore, the choice of PCVM model as a classifier is meaningful as well. This fact is explored in the next computer experiments.

SVM Model with a Spherical Harmonics-Based Descriptor
Inspired by the previously shown results we validated the performance of SVM [53] classifier and compared it with PCVM. Tables 2-6  The analysis in Tables 2-6 show that the performance of PCVM has significantly outperformed SVM. Moreover, Figure 4 presents the maximum values recorded for PCVM and SVM. Both Tables 2-6 and Figure 4 reveal SHPCVM can be further used. Indeed, the performance of SVM is not so much competitive against the PCVM. The major reason PCVM is significantly better than SVM may be the fact that probabilistic decisions are important to accomplish such tasks.

Comparison with Other Classification Methods
To further investigate the prediction performance of our approach, we also compared the proposed approach with several other existing methods on the GPCR datasets. The prediction results for the three additional classifiers and abovementioned measures are reported in Tables 2-6. One may observe that PCVM with a harmonic-based representation achieves the best results for all datasets. Tables 2-6 suggest the worst outcomes were provided by Naïve Bayes classifier. Some results are random in case of this approach. Take for instance the value for Q14416 or Q8TDU6 data presented in Table 5. It is probably caused by the fact that NB is a very a simple method that makes a strong assumption on the shape of the data distribution which may not be true for the analyzed datasets. Also, it can be seen in Tables 2-6 that RF and KNN results are poor. Generally, the outcomes show a common trend with the results for RF, KNN and NB, namely the results are much more worse than for either SVM or PCVM, but with specific differences due to the use of different classification methods.

Comparison with Other Representations
To assess the ability of PCVM classifier, two existing descriptors, i.e., MOE (60 dimensions) and Connectivity (44 dimensions) found in RDKit [49], a popular cheminformatics package are applied to represent the GPCR datasets and the results are compared with the results of SH. The comparison of the results of these approaches in terms of Accuracy (ACC) and Matthews Correlation Coefficient (MCC) is listed in Tables 7 and 8. Additionally, Figure 5 illustrates the maximum values obtained for each descriptor and PCVM when all measures are taken into consideration.    Table 7 suggests that the highest accuracy was obtained for SH-based variant and equals 0.862 in Q9Y5N1. Thus, from the results in Tables 7 and 8, we can also conclude that the spherical harmonic-based representation was able to handle all the datasets. Most importantly, the results for harmonic-based representation (Tables 7 and 8 and Figure 5) show that using SH-based as the descriptor has an influence on prediction of molecules activity. Although a harmonic-based representation has the same length as MOE-type descriptor, it has improved the effectiveness of the prediction of active molecules. The other results for the rest of datasets indicate that SHPCVM is very promising for molecular activity prediction and they are available in the Supplementary Materials.

Materials And Methods
In this section, we give a brief introduction to datasets we used for computer experiments. Then we introduce the details of PCVM, SVM, Random Forest, Bayesian classifier and KNN. Also, we present a brief introduction of representation descriptors, including characteristics of Spherical Harmonics-based approach.

Datasets
To get the data we partially repeated the steps described in [54]. We downloaded data for 3052 G-protein coupled receptors from UniProt database [55]. The database consists of 825 human GPCR proteins. Among these, we obtained 519 051 GPCR-ligand interactions data from the GLASS database [56]. For the purpose of ensuring the effectiveness of the computer experiments, we sorted the GPCRs by the number of interacting ligands, as done in [54]. Since some GPCR individuals have very small number of ligands or none, a threshold value to indicate the minimum number of ligands each target is expected to have is set to 600. Finally, we selected 21 proteins which are listed in Table 9. In consequence, there is a one individual which represents family F (Q99835), two representatives of class C (P41180, Q14416), one target from family B (P47871) and the additional representatives are associated with class A. All used ligands were gathered from CHEMBL database [57].
Several measures may be employed to verify the activity of molecules. They include IC 50 , EC 50 , K i , K d , etc. [58]. Thus, we followed the approach of Wu et al. [54] and the p-bioactivity is used in the work which is defined as − log 10 val. Please note that val is the raw bioactivity. The value of the raw bioactivities of ligands varies over a large range. However, taking logs reduces the magnitude of data in relation to other variables data, and the properties of the model were not lost in any case. In the datasets the activity range is extremely diverse. The smallest activity value is −12 and the largest is 4. For ligands which have more than one activity value, we assume the mean as the final p-bioactivity value. The inactive molecules are those which do not interact with the target GPCR. We selected them randomly from the set of irrelevant GPCR data, similarly as described in [54]. In consequence, the number of inactive compounds for a given GPCR target is about 30% of the actives (see Table 9). Unfortunately, the number of irrelevant datasets which are considered as inactive is smaller than the number of active compounds. This is the reason the data is unbalanced.
Please note that to solve the imbalanced data set problem, we have also made an attempt to select the compounds with the lowest activity data as inactive. In the experiments we have considered the values below −10. Taking such extra molecules decreased the results in the range of 0.222 to 0.375. We believe it was caused by the fact the low activity compounds were labeled as inactive.

Spherical Harmonics-Based Descriptor
To clearly introduce the Spherical Harmonics-based descriptor, we briefly introduce the concept of Spherical Harmonics and our feature selection idea in the following two subsections.

Spherical Harmonics
Spherical harmonics are considered as a set of solutions to Laplace's equation in spherical coordinates [80,81]. The coordinates construct a set of basis functions where P m l means the associated Legendre polynomials which are real-valued and defined over the range [−1, 1]. The goal of S m l is functions normalization.
We introduce the concept of spherical depth which is a function that provides the distance between two atoms. Thus, one can consider a molecule in a spherical depth map as a spherical function f (θ, φ) that may be expanded into a linear combination of all spherical harmonics scaled by their associated Fourier coefficients c lm : For molecular representation we need only real value spherical harmonics. The real valued spherical harmonic basis functions are shown in Figure 6. The real spherical harmonics can be expressed in spherical coordinates as follows: The spherical harmonic features (coefficients) are given by the equation: In consequence, the spherical harmonics descriptor is seen as a k dimensional vector where bandwidth that is important to achieve a certain concentration factor equals N, v i = ∑ l m=−l |c 1,1 | 2 and d(V) ≤ N 2 . Furthermore, V is rotation invariant.

Feature Selection
Interestingly, it shows spherical harmonics are able to capture a various number of geometric object properties. The molecule's model is characterized by the energies at different frequencies of spherical harmonics. Thus, at high frequencies one may capture some details, whereas low frequencies rather reveal gross information. In other words, for small value of l in Equation (5) we consider low frequencies and the higher value of l gives more details.
Nevertheless, the SH descriptor, itself, may produce numerous features. Obviously, it is one of many descriptors which may be employed to classification. However, the number of features included in the well-known descriptors (SH descriptor, among others) can be high. Such high dimensionality combined with a comparatively small sample size usually causes a degradation of the classifier's performance. Such a phenomenon is known as the curse of dimensionality [82]. It shows a well-defined dimensionality reduction scheme may lead to an improvement in the performance of a prediction model. Feature selection algorithms reduce the dimensionality of the input sequence by selecting only a subset of features.
Feature selection approaches can be divided into filters [83] and wrappers [84]. Filters perform feature selection independently from the learning process. Wrappers combine the learning process and feature selection to select an optimal subset of features. Here, we apply Minimum Redundancy Maximum Relevance feature selection approach (MRMR) [85]. It represents a filter-based methodology.
Generally, it selects highly predictive but uncorrelated features. The features are ranked according to the minimal-redundancy-maximal-relevance criteria.
Let us denote two random variables X and Y. Now, their mutual information is defined as: where p(•) is the probability density function, x and y represent realizations X and Y. MRMR criterion is the following: where max D(S, y) = 1 |S| ∑ x i ∈S I(x i , y) (max relevance), min R(S) = 1 |S| 2 ∑ x i ,x j ∈S I(x i , x j ) (min redundancy) and S is the set of n input variables.

Descriptor Computation
To sum up, the procedure used to calculate our Spherical Harmonics-based descriptor includes the following steps which are also depicted in Figure 7.
1. Reading in atom's type, coordinates, temperature factor, occupancy. 2. Placing a molecule into a common frame of reference. 3. Scaling in such a way each molecule fits within the unit ball. 4. Placing an orthogonal grid around each molecule. 5. Building so-called spherical depth map which provides the distance between the closest atoms. 6. Using the grid values to perform decomposition into spherical harmonics. 7. Learning the most informative Spherical Harmonics features by applying feature selection strategy Section 3.2.2 to the vector of coefficients given in (5) and (6). In our approach, feature selection enables finding the most discriminative features (more precisely: type of features) before the training phase. Tests are performed on external data that was never used for neither feature selection nor training. All in all, SH-based descriptor is shorter than SH descriptor since it contains only the most descriptive types of features. Removing irrelevant features leads to the improvement in prediction and increases interpretability of the classification model.
Finally, the dimension of the descriptor presented in the paper is 60. The final length of 60 was chosen arbitrarily. We leave for further studies the challenges connected with the most optimal selection of number of coefficients. It is worth mentioning that since our SH-based descriptor depends on the 3D structure of the molecule, the molecular conformation has an influence on molecular prediction ability. In fact, it was out of the scope of this paper and we have not tested different conformations. Nevertheless, our studies suggest the more faithful 3D model is, the better Spherical Harmonic-based representation is expected to be. However, discussions on the impact of the 3D structure on SH-based representation could be a fruitful direction for future work.

Probabilistic Classification Vector Machines (PCVM)
Probabilistic Classification Vector Machines [86] is a probabilistic kernel classifier with a kernel regression model ∑ n i w i φ i,θ (x) + b, where w i are the weights of the basis functions φ i,θ (x) and b is a bias. In the work we have adopted some PCVM settings to the molecules classification problem which is considered as a binary classification.
Suppose we have a dataset S = {x i , y i } n i=1 , where y i ∈ {−1, +1} (labels -active and inactive molecules). We employed a probit link function where ψ(x) is the cumulative distribution of the normal distribution. Expectation Maximization approach is used to optimize parameters. Finally, the model is defined as follows where Ψ(x) is seen as a vector of basis function evaluations for a molecule x.

Other Approaches
Meanwhile, in order to further evaluate the performance of SHPCVM, we separately train the different state-of-the-art classifiers mentioned in the following subsections using Spherical Harmonics-based representation to encode the molecules.

Support Vector Machines (SVM)
SVM [53] is a state-of-the art machine learning method that finds a hyperplane to separate data from different classes. SVM has been widely used in chemoinformatics and its generalization performance is significantly better than that of competing methods [87]. The choice of similarity measure is a vital step to increase the performance of SVM. Typically, a positive semi-definite similarity measure between data points (i.e., a kernel) is applied.
For the class of hyperplanes in a dot product space H, SVM performs a classification of samples using a decision function as follows: where b ∈ R is the bias weight and w ∈ H are the feature weights.
For a linearly separable set of observations, a unique optimal hyperplane exists. It is differentiated by the maximal margin of separation between any observation point x i and the hyperplane. The optimal hyperplane is the solution of In case of nonlinear decision function, the kernel trick is applied. f can be defined as: where k : H × H and (x, x ) → k(x, x ).

Random Forests (RF)
A Random Forest is a supervised machine learning methodology that can be used to classify data into activity classes [88]. In formal, we consider a collection of randomized base regression trees m n (x, Θ m , S n ), where Θ 1 , Θ 2 , . . . are associated with the randomness in the tree construction. Such random trees combined together form the aggregated regression estimatê Y 2 ), . . . , (X n , Y n )} ⊂ R d × R is a training sample of independent and identically distributed random variables, E refers to the expectation with respect to the random parameter.

K Nearest Neighbours (KNN)
K Nearest Neighbours classifier is a relatively simple classification model which uses a known dataset of molecules to classify a new compound by polling the closest data molecule in the known dataset. To be more precise, the new compound is classified based on the class with the majority representation among the k nearest neighbors.
The goal is to classify a new molecule mol ∈ M (made up of m i , where i = 1, . . . , |M|). Furthermore, each molecule m i is described by a set of features, ie. a vector V i = ( f 1 , f 2 , f 3 , . . . , f n ) (descriptor). Formally, for each m i ∈ M the distance between a new molecule mol and x i is calculated.
where δ(•) is a distance metric. Now, the voting strategy may be defined as follows vote_proc(y j ) = k ∑ c=1 1 (mol, x c ) (y j , y c ), where y j , y c ∈ Y (set of labels -active and inactive).

Naïve Bayes (NB)
Naïve Bayes classifier [89] is a linear classifier that assumes the features in a descriptor are mutually independent.
Suppose a given molecule m is assigned the activity class a A * = arg max a p(a|m).
NB uses the Bayes' rule p(a|m) = p(a)p(m|a) p(m) .
To estimate p(a|m), i.e., the probability of the molecule m being in class a, NB uses the following equation: where V i = (x 1 , x 2 , x 3 , . . . , x n ) is a feature vector that describes molecule m.

Conclusions
In this article, we propose a novel molecular activity prediction method called SHPCVM. More specifically, there are two main contributions of the paper.
• We have introduced the novel Spherical Harmonics-based descriptor. The key principle of our Spherical Harmonics-based approach is not the usage of Spherical Harmonics themselves but the fact that our technique makes use of feature selection strategy (Minimum Redundancy Maximum Relevance) that enables obtaining only representative features. We outline that such an approach leads to the development of a more interpretable representation. What is more important for us, the vector representation we get is relatively short and that affects the computational costs. Therefore, our approach has a significant impact on molecular activity prediction where one does not have a large set of labeled examples and low-dimensional descriptor is required. • We have tested several machine learning methods, more precisely Probabilistic Classification Vector Machines (PCVM), Support Vector Machines (SVM), Naïve Bayes (NB) and K Nearest Neighbours (KNN) for molecules described by the proposed Spherical Harmonics-based model. The results yield Probabilistic Classification Vector Machines (PCVM) and Spherical Harmonics-based descriptor is superior to another approaches when molecular activity prediction of small compounds is considered. Obviously, the outcomes have revealed the influence of PCVM.
Experimental results for G protein-coupled receptors (GPCRs) demonstrate SHPCVM produces the best performance ranging from 0.742 Accuracy to 0.862, and from 0.691 to 0.794 in terms of Matthew Correlation Coefficient. Although the goal was to find out a tradeoff between the descriptive capabilities and computational costs of the descriptor, our approach may pave the way for more interpretability oriented research on molecule's computational model.