Using Big Data Analytics to “Back Engineer” Protein Conformational Selection Mechanisms

In the living cells, proteins bind small molecules (or “ligands”) through a “conformational selection” mechanism, where a subset of protein structures are capable of binding the small molecules well while most other protein structures are not capable of such binding. The present work uses machine learning approaches to identify, in a very large amount of protein:ligand complexes, what protein properties are associated with their capacity to bind small molecules. In order to do so, we calculate 40 physicochemical properties on about 1.5 millions of protein conformations: ligand and protein conformations. This work describes a machine learning approach to identify the unique physico-chemical descriptors of a protein that maximize the prediction rate of potential protein molecular conformations for the test case proteins ADORA2A (Adenosine A2a Receptor), ADRB2 (Adrenoceptor Beta 2) and OPRK1 (Opioid Receptor Kappa 1). We find adequate machine learning techniques can increase by an order of magnitude the identification of “binding protein conformations” in an otherwise very large ensemble of protein conformations, compared to random selection of protein conformations. This opens the door to the systematic identification of such “binding conformations” for proteins and provides a big data approach to the conformational selection mechanism.


Introduction
One of the frontier domains of biological research is to determine which small molecules, such as substrates or pharmaceuticals, are more likely than other small molecules to bind on a protein. Among the vast amount of small organic molecules present in living cells, only a small fraction-sometimes only a single chemical species-will normally bind to a specific protein. This specificity of protein:ligand is a central concept in chemical biology and in biochemistry, and the basis of virtually all projects in small molecules drug discovery. Yet, major questions remain on the mechanisms of protein:ligand specificity.
The widely accepted view of protein:ligand interactions is that a protein cycles through multiple conformations, and that a few of these protein conformations will be bound by the small molecules. Modern biophysical tools such as virtual docking [1], that predict the probability of a given small chemical to bind to a given protein, are used routinely in fundamental and industrial research, but they are limited because they usually consider only one protein conformation in an "induced fit" mechanism rather than an ensemble of conformations.
In our previous work we have performed massive systematic computational characterization of protein:ligands interactions for a very large number of protein conformations and small molecules. We have shown that we can identify the small number of protein conformations that will be "selected" for binding by their ligands. In principle these Molecules 2022, 27, 2509 2 of 14 "binding" conformations will lead to free energy minima in the (protein + ligand) complex free energy hypersurface. In our research we are aiming to understand why these rare apo-conformations possess this capacity to bind their ligands, while the vast majority of the other protein conformations do not. Being able to successfully identify and predict the properties of binding conformations that render them able to bind their ligands would be a transformational change in our understanding of conformational selection, in that it will open the door to the prediction of which protein conformations should be used in the drug discovery pipeline that relies on computational docking. That will massively reduce the enormous cost of docking, by focusing on these binding confirmations alone.
We also will answer several very important fundamental questions, beyond the nature of the protein properties responsible for conformational selection: are these properties similar for all proteins? Are they similar for classes of structurally close proteins? Or are they different from all proteins? Are these properties related also to the chemical properties of the ligands, as hypothesized behind most docking approaches? This work contributes to such characterization of what properties of an apo-protein conformation leads to conformational selection.
The fundamental approach is relatively simple: from our previous work we already [2] know which few protein conformations, among thousands, will lead to conformational selection, and we know which small molecules will be ligands of the proteins in these binding conformations. Calculating physico-chemical descriptors of the proteins, and building simple regression models could separate the protein conformations that bind ligands, from the protein conformations that do not bind ligands. However, the sheer amount of data renders "simple regression" calculations all but impossible. The data set we use here corresponds to about 1.5 millions of protein conformation and protein:ligand complex, and their associated interaction energies. A small fraction of this dataset contains a few hundreds of protein binding conformations that lead to conformational selection, and most of the rest of the data corresponds to protein conformations that are not statistically binding well to ligands. Calculating even "only" 40 physiochemical descriptors properties of the protein conformations, leads very quickly to a complexity that only Big Data Analytics can handle.
In addition, there are fundamental issues that are specific to the field of Big Data analytics that we aim at addressing in this work. Most of the real-world biomedical datasets suffer from statistical ill-conditioning issues such as class imbalance problem [3], which arises due to imbalanced groups or sub-categories present in the data. In this scenario, typically, the majority class or larger group of data that consists of non-binding protein conformations overshadows the minority class or smaller data group that comprises the data-of-interest, i.e., the binding protein conformations. In such a case, any machine learning (ML) technique applied for data-learning on a biased population dataset could result in higher misclassification of the smaller population of binding conformations, since the decision-making process is more biased towards the larger population of non-binding conformations. Conventional ML algorithms are not directly equipped to handle the class imbalance problem during the data-learning phase and therefore a new two-stage sampling-based classifier framework was proposed in our previous work [4,5] which aimed at tackling the class imbalance problem and maximizing the detection of potential binding conformations. This paper extends the use of the two-stage sampling-based classification approach with additional feature scoring and feature selection methods in conjunction with an Enrichment ratio framework, in order to validate and gain deeper understanding about the unique physico-chemical attributes of the selected protein conformation predictions from the ML framework. The feature scoring method uses three different feature selection methods to select the physio-chemical features or descriptors of potential drugbinding molecular conformations for target proteins ADORA2A, ADRB2 and OPRK1. The two-stage sampling-based classifier to maximize the prediction rate of potential binding conformations for target proteins ADORA2A, ADRB2 and OPRK1 and uses the proposed Enrichment ratio framework for the validation of the prediction results obtained from the ML framework. The goal of our ML-based feature analysis process is to advance our understanding of the significance of specific physico-chemical attributes or descriptors than make certain protein conformations more conducive to the protein:ligand binding process.

Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA) is a statistical technique to calculate the linear dependency between each feature and the target variable and select the features with the highest F-values [6]. The ANOVA technique is performed between each feature and the target vector and the obtained F-value is assigned to that feature, wherein the features are ranked based on their information content. The higher the F-value, the more important is the feature. Here top 'K' features with the highest F-values were retained, where the K features to be selected is experimentally determined by the user. Therefore, ANOVA technique provides selection of primary physio-chemical protein features that play an important role in protein:ligand binding and conformation selection process.

Mutual Information (MI)
Mutual Information (MI) is defined as a measure between two random variables X and Y, that determines the amount of information obtained about one variable, through the other random variable [7]. In information theory, entropy is an important basis used for uncertainty or known information measurement. Given a variable Y and the conditional entropy H(X|Y) of X with respect to Y is defined as: p(x,y) is the joint probability density function. • p(x|y) is the posterior probabilities of X given Y. • p(x) is the probability density function From Equations (1) and (2), mutual information I(X;Y) [8] can be defined as below: Using Equation (3), if X and Y are independent of each other, i.e., if they are unrelated, then their MI value is 0. Similarly, if X and Y are dependent on each other, then their MI value is 1. Here independent implies that no information of Y can be obtained using X and dependent implies that we can determine X from Y or vice-versa. MI measure for the physio-chemical features can be determined as follows:

•
Compute the MI to measure the dependency between the physio-chemical features vector and the target variable for all features.

•
The features are then ranked based on their MI values.

•
The top K features with highest MI values are retained where K is defined by the user.
Thus, MI explains the unique information present in the physio-chemical protein features that assists in protein conformation selection process.

Recurrence Quantification Analysis
Recurrence Quantification Analysis (RQA) is a numerical data analysis and statistical technique that is used for the study of non-linear dynamical systems [9]. The technique quantifies the number and duration of occurrences of a system presented by its state space trajectory. Recurrence analysis starts by quantifying the repeating patterns of the plot. Entropy of the distribution of the diagonal lines (ENTR), one of the measurements retrieved from the plots is the probability distribution p(l) of the diagonal line on the RQA plot is defined as: where, -N is the number of points on the state space trajectory l is the length of the diagonal line in the RQA plot We evaluate the RQA-based entropy measure to understand the entropy with respect to time-space evolution of protein conformations and its relationship to probability of discovering potential binding conformation.

Logistic Regression
Logistic regression (LR) is a probabilistic statistical ML technique used to perform predictive analysis on data that is categorical or has binary classes. The main objective of LR is to determine the best fitting model that describes the relationship between a dependent binary variable against a group of independent variables [10]. Let y be the binary outcome indicating failure/success with values labeled as 0 or 1, respectively, and p be the probability of y = 1 as p = 1. Conversely, y = 0 can be expressed as 1 − p. LR then models the outcome y based on the linear combination of the independent data variables b 1 , b 2 , . . . , b n and their respective parameter/weight values A 1 , A 2 , . . . , A n via maximum likelihood method as given by: LR technique is used in this work for supervised classification and prediction of potential binding vs. non-binding protein conformations selection.

SMOTE Algorithm
Synthetic minority Over-sampling Technique (SMOTE) is a well-liked method which helps in solving the class imbalance problem by creating new artificial data points rather than by over-sampling with replacement of the minority class [11]. The algorithm is elaborated below [12][13][14]: Oversampling of the minority class is carried out by taking the difference between the feature vector and its nearest neighbors.

•
Multiply the difference obtained in the last step by a random weight between 0 and 1, and then add it to the feature vector under consideration. This causes the selection of an arbitrary point along the line segment between the two specific features.
This approach of the algorithm coerces the decision region of the minority class to be more general and therefore results in a desired well-balanced dataset.

Gaussian Naive Bayes Classifier
Gaussian Naive Bayes Classifier (GB) is a conditional probabilistic ML technique that is widely used for supervised data classification. According to Bayes theorem, the posterior probability can be expressed as the product of prior probability and likelihood as described by: P(y|x) = (P(x|y) × P(y))/P(y) where, • P(y|x) is called the posterior probability. It gives the probability of y given that the data point x is true. • P(x|y) is called the likelihood. It gives the probability of point x given that the data point y is true.
• P(y) is called the prior probability of y. It gives the probability of y being true across all data points. • P(x) is the probability of data point x averaged over all of the value of y.
Using Equation (6), posterior probability can be calculated for different features in the dataset. In case, when the features in the dataset has continuous values then the likelihood of the features is assumed to be Gaussian [15] and, hence, conditional probability is given by: where, • µ y denotes the mean of class y • σ 2 y denotes the variance of class y Since GB technique is probabilistic in nature, it is used in this work for supervised classification and prediction that is biased towards identification of potential binding protein conformations.

K-Nearest Neighbor Classifier
The K-Nearest Neighbors (KNN) algorithm is a non-parametric ML technique that is widely used for its easy interpretation and low calculation time for classification and regression problems. Given a testing example, the KNN algorithm directly searches through all the training examples by calculating the distances between the testing examples and the training dataset in order to identify its nearest neighbors and give a classification output [16]. The Euclidean distance, as defined in Equation (8), is used as the distance metric which assigns the testing example to the class among its k nearest neighbors (where k is an integer). KNN technique is another supervised classification method used in this work for prediction of potential binding vs. non-binding protein conformations selection. where, • x is the new data pint • y i is the data point in the training set

Confusion Matrix
In the ML supervised learning domain, confusion matrix is often used to represent the summary of classification or prediction results. The confusion matrix terminology for a binary classification has four cases that are used to interpret the classification/prediction results as follows: Where class 0 refers to the non-binding conformations and class 1 denotes the potential binding conformations.

ML-Based Feature Selection and Two Stage Sampling-Based Classification Framework
The proposed methodology aims to identify the recurrent unique physio-chemical protein features observed in the target proteins ADORA2A, ADRB2 and OPRK1 considered in our study. For this, we combine the feature selection techniques discussed in Section 2 with the two stage sampling-based classifier framework from our previous work [4,5] as outlined in steps below and illustrated in Figure 1:

Dataset Description
In our work, proteins ADORA2A (Adenosine A2a Receptor), ADRB2 (Adrenoceptor Beta 2) and OPRK1 (Opioid Receptor Kappa 1) were used for experimental validation of our proposed methods. The conformations of the three proteins have been thoroughly characterized, and the protein conformations that i) will bind to ligands (binding conformations) and ii) will not bind to ligands (non-binding conformations), are known and have been documented and published [2]. ADORA2A: The dataset has 50 attributes and consists of 2998 molecular conformations among which 851 molecular conformations are "binding" and 2147 molecular conformations that are "non-binding".
ADRB2: The dataset has 51 attributes and consists of 2565 molecular conformations among which 156 are binding and 2411 molecular conformations are non-binding.
OPRK1: The dataset has 50 attributes and consists of 2998 molecular conformations among which 138 molecular conformations are binding and 2862 molecular conformations are non-binding. Table 1 below describes the protein descriptors for ADORA2A, ADRB2 and OPRK1 datasets that are used in this work.

Dataset Description
In our work, proteins ADORA2A (Adenosine A2a Receptor), ADRB2 (Adrenoceptor Beta 2) and OPRK1 (Opioid Receptor Kappa 1) were used for experimental validation of our proposed methods. The conformations of the three proteins have been thoroughly characterized, and the protein conformations that i) will bind to ligands (binding conformations) and ii) will not bind to ligands (non-binding conformations), are known and have been documented and published [2]. ADORA2A: The dataset has 50 attributes and consists of 2998 molecular conformations among which 851 molecular conformations are "binding" and 2147 molecular conformations that are "non-binding".
ADRB2: The dataset has 51 attributes and consists of 2565 molecular conformations among which 156 are binding and 2411 molecular conformations are non-binding.
OPRK1: The dataset has 50 attributes and consists of 2998 molecular conformations among which 138 molecular conformations are binding and 2862 molecular conformations are non-binding. Table 1 below describes the protein descriptors for ADORA2A, ADRB2 and OPRK1 datasets that are used in this work.

Enrichment Ratios
In order to validate the ML protein conformation selection/prediction framework, the TP and FN predictions from the ML prediction framework described in Section 3 were used to calculate the enrichment ratio. A base enrichment ratio is calculated to gauge the generic prediction performance efficacy in the absence of the ML protein conformation selection framework. The base enrichment ratio was calculated by taking the number of binding conformations and dividing by the total number of conformations, from [2]. Equation (9) is used to calculate a baseline enrichment ratio that would have been found in testing stages if ML algorithms had not been implemented. Equation (10) was used to calculate an enrichment ratio from the predictions returned by the ML prediction framework. The values returned from both Equations (9) and (10) were then used to calculate the final enrichment ratios returned by each of the four filters as defined in Equation (11). ML enrichment ratio baseline enrichment ratio = final enrichment ratio (11) All "Filters" in the tables and text below refer to the compound selection methods described in Section 4.2.

Enrichment Ratio Framework
The ML approaches discussed predict a binding/non-binding property for protein conformations without a specific "score" or a way to otherwise select rational subsets of the true positive (TP) and false negative (FN) predictions. However, in order to calculate the enrichment ratios as described in Equation (9) through (11), we have to select different subsets of the TP and FN results. In order to do this, we have chosen to base the subset data selection on the predicted protein:ligand interactions energies that were calculated and published previously [2]. The rationale is that, assuming that the calculated protein:ligand interaction energies are quantitatively correct, a "preferred" binding confirmation would be a conformations where the protein binds the ligand stronger (i.e., with better interaction energies), than in other conformations. Four different filters were used to calculate final enrichment ratios for proteins OPRK1, ADORA2A, and ADRB2 and are described below and are illustrated in Figure 2. used to calculate final enrichment ratios for proteins OPRK1, ADORA2A, and ADRB2 and are described below and are illustrated in Figure 2.
• Filter A: Takes all TP and FN conformations predicted by ML algorithms, and select a subset of a X% of the lowest energy for each individual conformation • Filter B: Takes all TP and FN conformations predicted by ML algorithm and takes a random Y% of those conformations and returns back to filter A. • Filter C: Takes all TP and FN conformations predicted by ML algorithm, sorts conformations by binding energy and takes a random X% of conformations with the lowest protein:ligand binding energy • Filter D: Takes all TP and FN conformations predicted by ML algorithm and uses the maximum binding energy calculated from filter C and selects all binding energies inferior to that energy.  Table 2 gives the overview of Enrichment ratios that were calculated using the predicted binding conformations from the ML framework for ADORA2A. The ML framework was trained on 30% of the dataset and tested on the remaining 70% of the dataset, as shown in Table S1 through Table S3. It can be seen that the LR + SMOTE − KNN classifier gave the maximum enrichment ratio of 11.0, using data selection filter C (and about the same values when using LR + SMOTE − GB classifier data selection filter B).  Table 2 gives the overview of Enrichment ratios that were calculated using the predicted binding conformations from the ML framework for ADORA2A. The ML framework was trained on 30% of the dataset and tested on the remaining 70% of the dataset, as shown in Table S1 through Table S3. It can be seen that the LR + SMOTE − KNN classifier gave the maximum enrichment ratio of 11.0, using data selection filter C (and about the same values when using LR + SMOTE − GB classifier data selection filter B).  Table 3 gives the overview of Enrichment ratios that were calculated using the predicted binding conformations from the ML framework for ADRB2. The ML framework was trained on 30% of the dataset and tested on the remaining 70% of the dataset. As in the ADORA2A case, the LR + SMOTE − KNN classifier gave the maximum enrichment ratio of 21.7, with data filter selection B.  Table 4 gives the overview of Enrichment ratios that were calculated using the predicted binding conformations from the ML framework for OPRK1. The ML framework was trained on 30% of the dataset and tested on the remaining 70% of the dataset. The LR + SMOTE − KNN classifier gave the maximum enrichment ratio of 20.1, using data filter selection A.  Tables 5-7 list out the physicochemical features common to different proteins. It can be observed from Table 5, that 5 features were common between ADORA2A and OPRK1. Table 6 lists the elements that were common between ADORA2A and ADRB2, it can be seen that only 1 feature is common between the two lists of Tables 5 and 6. Table 7 lists the elements that are common between ADRB2 and OPRK1, it can be observed that 3 features are common between the two proteins. No features were found in common between the three proteins. Table 5. Common selected features between ADORA2A and OPRK1 having a feature score of 3.

Computational Evaluation on OPRK1 Dataset
pro_asa_vdw pro_hyd_moment pro_asa_hyd pro_patch_neg_n pro_zquadrapole Table 6. Common selected features between ADORA2A and ADRB2 having a feature score of 3. pro_hyd_moment Table 7. Common selected features between ADRB2 and OPRK1 having a feature score of 3.

Conclusions
In this paper, we extended the use of the two-stage sampling-based classifier framework with additional feature scoring and feature selection methods in conjunction with Enrichment ratio framework, to identify the unique physico-chemical attributes of the binding conformation and based on those attributes understand what makes a protein conformation more conducive to protein:ligand binding process. The enrichments shown in Table 2 through Table 4 suggest that it is indeed possible to identify binding conformations in order of magnitude better than thought a random selection of protein conformations. This is a very encouraging result: it means that there are physico-chemical protein properties that will drive, at least in part, whether or not a protein conformation is more likely to be binding than non-binding. From a machine learning point of view, both LR + SMOTE − GB and LR + SMOTE − KNN classifiers demonstrated a reasonable prediction performance for binding protein conformations as described in (Table S1 through Table S3). In addition, the best enrichment is obtained for the smallest selection of binding conformations (0.5% to 1% of the conformation, see Table 2 through Table 4). This suggests that the "top" predicted binding conformations, in terms of strength protein: ligand energies as defined in Figure 2, contain the most "solid" data in terms of predictability power. This greatly simplifies the computational complexity and cost of this approach, as only a small fraction of the otherwise massive ensemble of conformations needs to be used in subsequent docking or analyzing.
There are however several challenges that limit, for the time being, the application of this approach to any protein conformation ensemble a priori. One limitation is that, as shown in Table 5 through Table 7, there does not seem to be a universal set of physicochemical properties that are consistent across all the proteins studied here. Rather, it appears that each protein has its own unique set of "preferred" physico-chemical features that go into driving conformation selection. While this is not in itself a problem (indeed, individual proteins may very well have different physical and chemical ways to interact with their ligands), it certainly complicates the application of this approach to a novel protein for which no a priori knowledge of ligands does exist, and on which the model could be trained.
In our opinion, this suggests that there is a need to continue this work with (i) other proteins for which (preferably many) known ligands exist, to see how diverse are the physico chemical features in each set, and (ii) using different physicochemical descriptors that the 40 descriptors used here. The physico-chemical descriptors used here are well validated to quantify global protein properties and discriminate between global conformation, but it may be advantageous to use physic-chemical features that are limited to the binding sites of the proteins, where the interaction of the protein with its ligands are taking place. This is not however a trivial task. However, the fundamental result of this current work, that apo-protein conformations possess properties that correlate with conformational selection, makes this effort worthwhile.
Another challenge of this approach is a consequence of the severe class imbalance of the data, which contain much more conformations that are not selected by the ligands than conformations that are selected by the ligands. As shown in Tables S1-S3, the overall model accuracy for all the test case proteins is low due to the class imbalance problem in the dataset, but despite that the model was able to identify the potential binding protein conformations with a smaller training size which is depicted by the sensitivity score in the classification tables in Supplementary Materials. Table S1 for ADORA2A shows that the sensitivity score of LR + SMOTE − KNN method was 80% for a training size of only 30%. This indicates that the LR + SMOTE + KNN method was able to correctly predict 80% of binding protein conformations with only a 30% training size. Thus, the ML-based protein conformation framework described here performs well overall in detection of binding protein conformations, which is what the current work is focusing on. These results provide a baseline against which subsequent performance improvement benefits can be evaluated, in a generalized protein conformational selection model (enrichment ratio framework) through the adoption of ML-based prediction models.
The motivation of this work is to understand the protein properties responsible for conformation selection. Applied to novel target proteins, this approach yielded an enrich-