Protein Conformational States: A First Principles Bayesian Method

Automated identification of protein conformational states from simulation of an ensemble of structures is a hard problem because it requires teaching a computer to recognize shapes. We adapt the naive Bayes classifier from the machine learning community for use on atom-to-atom pairwise contacts. The result is an unsupervised learning algorithm that samples a `distribution' over potential classification schemes. We apply the classifier to a series of test structures and one real protein, showing that it identifies the conformational transition with>95% accuracy in most cases. A nontrivial feature of our adaptation is a new connection to information entropy that allows us to vary the level of structural detail without spoiling the categorization. This is confirmed by comparing results as the number of atoms and time-samples are varied over 1.5 orders of magnitude. Further, the method's derivation from Bayesian analysis on the set of inter-atomic contacts makes it easy to understand and extend to more complex cases.


Introduction
The conventional description of protein dynamics asserts that proteins posses intrinsic conformational states.[1] An enzyme may cycle between catalytic and open states.[2] An ion channel may open and close its central pore.[3] A chaperone protein assists transformation of large, hydrophobic proteins from from initial, linear, to folded shapes.[4] X-ray and cryo electron-microscopy reveals conformations with small motions on the 1-5 Ångstrom level for those proteins that crystallize.[5] Neutron scattering and nuclear magnetic resonance structures of room temperature proteins show greater shape variability, but are usually able to classify structures into a few 'canonical' structures.
Advances in molecular modeling have made it possible to simulate the protein folding process, generating very large numbers of samples, free energy landscapes, and information on kinetics.Nevertheless, computational identification of distinct conformational states from molecular simulations has remained an active area of methodological research.
One of the principle methods developed for visualizing domain motions in proteins is DynDom [6,7].DynDom works from two input structures and determines relative domain rotations.This allows predicting transition motions from experimentally observed conformers, but not conformations from observed motions.On the other hand, many structureto-structure similarity classification methods have emerged for this problem.[8,9,10] A recent review of such dimensionality reduction methods for protein conformational spaces noted that nonlinear methods are generally better than Cartesian or linear ones, but that the complexity of assumptions behind those models makes them difficult to work with and adapt.[11,12] This work presents a complete inference method derived from one single statistical hypothesis: that conformational states are defined by sets of contacting residues.Specifically, the conformational state, k, uniquely determines which pairs of residues u,v, will be touching.Like a weighted coin flip, the contact probability is µ k,(u:v) -independently from all the other contacting pairs.Each conformational state is thus characterized by a vector, µ k , encoding the set of contacting pairs in state k.
The statistical model derived from this problem statement is termed a Bernoulli mixture model for binary feature classification.[13] The problem setup is similar to the Naive Bayes method.[14] However, be-cause the categories are not known in advance, this is an unsupervised learning and classification problem.
Bernoulli mixture models have been applied extensively in the field of text subject analysis, [15] optical character recognition, [14] and image feature classification.[13] Essentially all of these applications have been successful at building extremely accurate classification models.The latter work also presents a thorough summary of sampling methods.
However, there remain difficulties sampling the distribution over categories, µ, especially when the number of categories and reference classifications are not known in advance.The well-known expectationmaximization algorithm (EM) [16] is available in principle, but is not a replacement for sampling.Theoretical work on the EM method [17] shows that redundant categories will result in many circumstances.In this work, we have introduced a prior that eliminates redundant categorizations.
This work presents a novel sampling method for category inference, along with results demonstrating that it creates structurally meaningful categories with > 90% accuracy.Although the potential application space is vast, this work focuses on proving method robustness using well-defined synthetic test problems.Each follows a time sequence mimicking domain motions in proteins -so that the classification accuracy can be judged by correctly assigning categories in time-order.

Theory
A naïve Bayes model (for bit-strings) assumes that structural input samples, i = 1, . . ., N are generated by first selecting a conformational state, z i ∈ {1, . . ., K}, with probability π zi , and then independently deciding whether each point-to-point contact (x ij , j = 1, . . .M ), is made with probability µ zij If contact j is made in sample number i, then x ij = 1.Otherwise x ij = 0.The model parameters are θ = (K, π, µ).
It leads to a sampling distribution, The second line above notes that, once the categorization, z, is known, the sampling distribution is easy to  express in terms of feature counts in {i : z i = k}the set of samples assigned to category k, The first is the number of samples in set k, and the second is the number of times each contact is seen in that set.
According to Bayes' theorem, [18] we can turn this around to predict two important things -the probability that sample i belongs to category k, (read z given x and θ), and also the probability distribution over all possible parameters, where C(x) is an x-dependent normalization constant.Sampling this distribution provides everything -the categorizations, z, the conformational states, π, µ, and even a predicted number of categories, K.
We choose a prior probability, that enforces uniqueness of the categories.Here B(µ k , µ l ) = B kl is the Bhattacharyya similarity between distributions p and q, If two categories share the same distribution, then p = q, and B(p, q) = 1.This forces our estimator to return zero likelihood that µ k = µ l for any k = l.
The second part of Eq. 7, P (θ|U ), is a conventional prior used for Bernoulli mixture inference in the literature.We use α = M + 1 throughout this work.The Supplementary information contains a detailed justification for this choice.Essentially, it forces the likelihood for dividing a category into two parts to be asymptotically insensitive to the number of features, M .A proof of this fact, as well as a useful connection to the information entropy of compression is present in the supplementary information.

Sampling Method
Our sampling method is traditional Markov Chain Monte Carlo using four types of moves: a recategorization move, where categories, z, are assigned according to P (z|θxI), a reclassification move, where θ is sampled from P (θ|zxU ) and accepted with probability k<l (1 − B kl ), and one split and one join rule.The function, P (θ|zxU ), referred to here is just P (θ|zxI) without the Bhattacharyya distance terms, and with a different constant prefactor.
Generating split or join trial moves was done by randomly choosing either one category to split or choosing two categories to join.For splits, member i of category k is moved to set L with probability η xij (1 − η) 1−xij .We used η = 0.9, but any η ∈ (0.5, 1.0) should work in principle.If all elements end up in L or R, the partitioning is redone.To concentrate splitting on productive cases, we did not attempt to split categories with N kj = 0 or N kj = N k .Immediately after splitting or joining categories, a reclassification move (re-assigning θ) was performed.Category split moves were accepted Figure 2: Sketch of the three synthetic motions used for scaling tests with number of material points, P , and number of frames, N .The time axis proceeds left to right.For the 'glob' system, the sphere on the right moves behind, and then back out the right again while the top sphere moves slowly downward.using the Metropolis criterion, which is the smaller of P gen,join P split /P gen,split P joined or one.Explicit formulas for the move generation probabilities (P gen,split , etc.) are provided in the supplementary information.
Fig. 1 provides a graphical summary of this inference scheme.Each conformational sample is mapped to a bit-string, which is used as the basis for inferring µ.Inference proceeds by sampling potential parameters until a good explanation for the data is found.Trial moves that re-categorize and update µ look horizontally to find better category prototypes.Trial moves that split categories based on presence or absence of some features allow us to traverse category space vertically.

Test Systems
The ability of our sampling procedure to predict categories was tested on three geometrical systems: 'chomp' (a closing angle), 'heli' (a rotating line), and 'glob' (three rotating spheres).Each system was generated as a time-series of 1000 frames for approximately P total particles in 2 or 3-dimensional space.After generation, Gaussian random noise of width σ = 0.1 was added to every degree of freedom.
Figure 2 shows images of these three test trajectories.A complete description of the coordinate gener-ation methods is present in the Supplementary Information.
All three systems were processed into binary feature data by calculating pairwise distances between all points.Pairs of points within 2 distance units were translated to 1 (representing contact).When forming the feature vectors, x, we removed features, j, for which every sample showed the same result (all contacting or all disconnected).It is important to note that this removal changes M seen by the algorithmusually decreasing it well below P (P − 1)/2.
Critically, we repeated these classifications for a range of material points, P .This tested robustness of the unsupervised classification problem with respect to the amount of features available.Adding more points without changing the geometry of the problem shouldn't change the number of categories detected.
For each run, five independent MCMC chains were started from an assignment of all points to a single category.Each chain ran for 1000 steps.Samples were collected every 10 steps -starting from step 500.The acceptance probabilities for category split/join Monte Carlo moves varied around 10-13%.

Results
We analyzed the results of MC in two different ways.First, the categories assigned were tested for grouping in time.Since the contact lists (on which the categorization is based) varied slowly over time, we expect categories to come in 'runs.'Second, we computed histograms over the number of categories, K.This is a strong test of the method's sensitivity to the number of material points, P .
As expected, we found a high degree of correlation between categories and time for every case.Similar time-points were grouped into similar categories.To quantify these results, we counted transitions between category indices in time-order.For a perfect categorization, the number of transitions should equal the number of categories minus one.We computed the categorization accuracy in two ways.For each system, the left columns of tables 1 and 2 are 100 minus the percent of mis-categorized frames.For lossy categorization at time-boundaries, we expect oscillation between two values.We quantified this by forming a transition matrix between categories, and removing transitions along the 'most likely path'.Excluding this boundary oscillation we found nearly 100% accuracy for the categorizations.Those are shown in the right columns of tables 1 and 2.  1.

P
Integrating the posterior probability (Eq. 2 times Eq.7) over π leads to factors like Γ(N + αK) −1 , which seem prohibitively costly as N increases.We therefore wanted to check that the number of categories doesn't decrease as features or samples are added.Fig. 3 shows the sampled probability distributions over K, the number of categories, for increasing P (3a) and N (3b).Interestingly, for every system, about five conformational states were deduced at N = 1000.As N increased, however, more categories were deduced and the distribution spread to higher numbers.This is probably reasonable, since more values of the 'time' coordinate generated a more fine-grained motion.

Adenylate kinase open/closed transition
Finally, we tested the classification method against a protein with a well-characterized conformational transition.
Adenylate kinase (ADK) converts ATP, ADP, and AMP by closing around substrate molecules.[19] The transition from closed to open was simulated in Ref. [5] using steered molecular dynamics on a reaction coordinate interpolating between the electron density maps of PDB IDs 1AKE (Ref.[19], closed) and 4AKE (Ref.[2], open).The simulation data we used did not contain ligands, but did contain water and ions.Our analysis only made use of the alpha carbon (C α ) positions.
Features were calculated for each of N = 3900 equally spaced frames during steered dynamics by testing whether C α to C α distances were less than 5 Å.These structures contain P = 214 C α -s.Sampling was carried out as described in Sec. 3, but 8 independent MC chains were sampled for 1250 steps (instead of 5 for 1000).The acceptance probability of split/join moves was 17%.During sampling, we saved the parameters, θ, possessing maximum likelihood values at each K.
We then extracted conformational states with the highest probability for landing in each category as representative points for that category.Since the reference open and closed PDB conformations formed extreme points, our representative structures approached them more nearly as K increased.Fig. 4 shows that the two end-point conformations ended up very close to the open and closed states from the PDB.

Conclusion
The method developed here is ideally suited for the unsupervised structural classification problem.It has been derived from a first-principles Bayesian analysis of the set of atoms which interact within a structure.This work solved the central problem of defining an appropriate prior distribution over parameter space and implementing an efficient sampling method.
Tests on sample conformational transitions identified more categories than naïvely expected because it generated milestones along the motion's timecoordinate.However, all categorizations were shown to have excellent accuracy as judged by picking out the correct time-sequence.Finally, a test on Adenylate kinase verified that these conclusions easily generalize to protein motions.
The central results of this work showed that the method behaves well under large variations in the number of features and samples.These observations validate our choice for the prior distribution, since changes in P (θ|I) will have large effects on the distribution over number of categories, P (K|xI).They also show robustness of the MC sampling method itself, since relatively high acceptance rates were achieved.
Many future applications and development of this method are possible.Changes in the pair contacts between states could be analyzed more thoroughly, as in the DynDom method.[6] Probabilities of assignment to each conformational state can be used as reaction coordinates.We are presently applying the method to classify chemical compound space using binary MACCS fingerprints [20], and to characterize conformational space of SARS-CoV-2 proteins simulated using replica exchange molecular dynamics.[21] Now that the principle has been demonstrated, more informative classification schemes can be devised.Adding information like hydrogen-bonding, salt bridge formation, and secondary structure annotation will allow the Bayesian framework able to recognize categories more like a biochemist would.The method can also be focused on active domains and binding sites by adding more points and shorter distance cutoffs for key residues.The insensitivity to M shown in this work provides a high degree of confidence that any amount of additional data will improve the overall categorization without spoiling the classification already achieved using coarser-level data.Eq. 14 provides the motivation for choosing α = M + 1.With this choice, the (N k + 1) M term cancels.It leaves behind a relative information entropy for the N k .Essentially, At this point, different choices for f (N, K) lead to differences in the probability distribution over K, the number of categories.Rather than directly influencing these, we have chosen not to create additional assumptions about f (N, K) beyond the above choices for α and β.

Sampling Method
Since π can be trivially integrated out of the posterior distribution (Eq.9), we re-generate π on every split/join move, and consider only the integrated probability P (zµ|xI) as our sampling target.Each time a split/join move is generated, a re-categorization of samples in category k ({x k }) into {x L } and {x R }. is immediately followed by generation of µ L and µ R (or µ k for a join) from their respective Beta distributions (with samples N Lj and N L − N Lk , etc.).
We set up the acceptance probability for these split / join moves to satisfy the detailed balance condition, P acc (z µ |zµx) P acc (zµ|z µ x) = P gen (zµ|z µ x) P gen (z µ |zµx) Without loss of generality, assume z contains K + 1 categories and z contains K.We already have explicit formulas for P (zθ|xI) (modulo P (x|I)), so the only thing needed to calculate P acc is the move generation probabilities.
Given category k and feature j was chosen to define the split (see main text for a description of this process), the recategorization probability is, The factor of 1 2 is necessary because the labels L and R are interchangeable.p L (p R ) is the expression in the numerator when all z move from k to L (R).
Because any j could have been chosen, the overall probability of ending up with categories L and R is, To concentrate splitting on productive cases, we choose Q kj to be one for all k, j unless N kj = 0 or N kj = N k .In those cases Q kj is zero so that we don't split using a non-informative feature.Generating µ L and µ R adds the factor, P gen (z µ |zx) = P gen (z |z)P ({x L }|µ L )(N L + 1) M P ({x R }|µ L )(N R + 1) M e j S inf (Lj)+S inf (Rj) .
Here, we have made the intuitive definition, Most of this term will cancel P (zµ|xI) in the final expression for the acceptance probability.The recombination rule is attempted with frequency 1 − ξ(K + 1).A set u is chosen at random with probability 1/(K + 1) and recombined with set v = u chosen with probability 2/K(K + 1).The resulting generation probability is, P gen (z|z x) = (1 − ξ(K + 1))2 K(K + 1) .

Figure 1 :
Figure 1: Classification applied to a 4-site structure.Pairs of contacting points are mapped to binary fingerprints, x.The bottom images show the 2-category classification result for the inputs shown.Our category split MC move using (1:3) [or 1:4] succeeds here because those contacts are correlated with others.The small positive probability for 1:3 contacts hints that the left structures should be lumped with conformation 1.

Figure 3 :
Figure 3: Counts (out of 250 samples each) of the number of categories, K, as P or N are varied.The shape of the distribution function remains essentially constant as P varies, but tends to spread toward higher K with more samples.

Figure 4 :
Figure 4: Adenylate kinase (ADK) closed (a) and open (b) configurations.PDB structures (magenta) are compared to the most similar molecular dynamics classification results (green) found among the five representative structures for the K = 5 classification.

Table 1 :
Categorization accuracy for each system type and number of points, P .The left column for each system is the percent of category assignments identical to their previous time (not including K − 1 required).The right column for each is the percent of category assignments that do not switch between runs.

Table 2 :
Categorization accuracy for each test system as the sample number, N , varies.The number of points remains fixed at P = 222.Column labels are as in Table