Protein Conformational States—A First Principles Bayesian Method †

Automated identification of protein conformational states from simulation of an ensemble of structures is a hard problem because it requires teaching a computer to recognize shapes. We adapt the naïve Bayes classifier from the machine learning community for use on atom-to-atom pairwise contacts. The result is an unsupervised learning algorithm that samples a ‘distribution’ over potential classification schemes. We apply the classifier to a series of test structures and one real protein, showing that it identifies the conformational transition with >95% accuracy in most cases. A nontrivial feature of our adaptation is a new connection to information entropy that allows us to vary the level of structural detail without spoiling the categorization. This is confirmed by comparing results as the number of atoms and time-samples are varied over 1.5 orders of magnitude. Further, the method’s derivation from Bayesian analysis on the set of inter-atomic contacts makes it easy to understand and extend to more complex cases.


Introduction
The conventional description of protein dynamics asserts that proteins posses intrinsic conformational states [1]. An enzyme may cycle between catalytic and open states [2]. An ion channel may open and close its central pore [3]. A chaperone protein assists transformation of large, hydrophobic proteins from initial, linear, to folded shapes [4]. X-ray and cryo electron-microscopy reveals conformations with small motions on the 1-5 Ångstrom level for those proteins that crystallize [5]. Neutron scattering and nuclear magnetic resonance structures of room temperature proteins show greater shape variability, but are usually able to classify structures into a few 'canonical' structures.
Advances in molecular modeling have made it possible to simulate the protein folding process, generating very large numbers of samples, free energy landscapes, and information on kinetics. Nevertheless, computational identification of distinct conformational states from molecular simulations has remained an active area of methodological research. Approaches fall roughly into two classes-linear and nonlinear. Linear methods work with the real 3N-dimensional vector of atomic coordinates as a linear space, using pairwise alignments and Euclidean (RMSD) distances between structures. Nonlinear methods work with 'features' derived from nonlinear functions of the atomic coordinates-like pairwise distances between alpha-carbons. Well-known linear techniques include then independently deciding whether each point-to-point contact (x ij , j = 1, . . . , M), is made with probability µ z i j . If contact j is made in sample number i, then x ij = 1. The bit-vector, x i , is said to posess feature j. Otherwise x ij = 0, and the feature is absent. The model parameters are thus θ = (K, π, µ).
It leads to a sampling distribution, The second line above notes that, once the categorization, z, is known, the sampling distribution is easy to express in terms of feature counts in {i : z i = k}-the set of samples assigned to category k, The first is the number of samples in set k, and the second is the number of times each contact is seen in that set. According to Bayes' theorem [19], we can turn this around to predict two important things-the probability that sample i belongs to category k, (read z given x and θ), and also the probability distribution over all possible parameters, where C(x) is an x-dependent normalization constant. Sampling this distribution provides everything-the categorizations, z, the conformational states, π, µ, and even a predicted number of categories, K.
In Bayesian probability, a prior distribution has to be assumed by the researcher. The prior characterizes the parameter space, independently from any sampled data. Our prior distribution over parameters, introduced below, is P(θ|I). Since the parameters directly determine the sampling distribution, the prior does not affect it [P(xz|θ) = P(xz|θ I)]. Note that this work juggles between two different priors, I and U, because the inference problem is simpler using P(θ|U), but P(θ|I) eliminates redundant solutions.
We choose a prior probability, that enforces uniqueness of the categories. Here B(µ k , µ l ) = B kl is the Bhattacharyya similarity between distributions p and q, If two categories share the same distribution, then p = q, and B(p, q) = 1. This forces our estimator to return zero likelihood that µ k = µ l for any k = l.
The second part of Equation (7), P(θ|U), is a conventional prior used for Bernoulli mixture inference in the literature. We use α = M + 1 throughout this work. Appendix A.1 contains a detailed justification for this choice. Essentially, it forces the likelihood for dividing a category into two parts to be asymptotically insensitive to the number of features, M. A proof of this fact, as well as a useful connection to the information entropy of compression is also present in Appendix A.1.

Sampling Method
Our sampling method is traditional Markov Chain Monte Carlo using four types of moves: a recategorization move, where categories, z, are assigned according to P(z|θxI), a reclassification move, where θ is sampled from P(θ|zxU) and accepted with probability ∏ k<l (1 − B kl ), and one split and one join rule. The function, P(θ|zxU), referred to here is just P(θ|zxI) without the Bhattacharyya distance terms, and with a different constant prefactor.
For the split trial move, one of the categories, k, is split into two new categories. Every member of k is re-assigned into one of two new sets, labaled L or R. Join moves are the opposite of split moves. This re-categorization changes the set labels, z. Specifically, z i goes from k to L or R for every i formerly at z i = k.
Generating split or join trial moves was done by randomly choosing either one category to split or choosing two categories to join. For splits, member i of category k is moved to set L with probability We used η = 0.9, but any η ∈ (0.5, 1.0) should work in principle. If all elements end up in L or R, the partitioning is re-done. To concentrate splitting on productive cases, we did not attempt to split categories with N kj = 0 or N kj = N k . Immediately after splitting or joining categories, a reclassification move (re-assigning θ) was performed. Category split moves were accepted using the Metropolis criterion, which is the smaller of P gen,join P split /P gen,split P joined or one. Explicit formulas for the move generation probabilities (P gen,split , etc.) are provided in Appendix A.2. Figure 1 provides a graphical summary of this inference scheme. Each conformational sample is mapped to a bit-string, which is used as the basis for inferring µ. Inference proceeds by sampling potential parameters until a good explanation for the data is found. Trial moves that re-categorize and update µ look horizontally to find better category prototypes. Trial moves that split categories based on presence or absence of some features allow us to traverse category space vertically.  Figure 1. Classification applied to a 4-site structure. Five conformers (with frequencies shown) are input to the method. Bit-strings, x, for each conformer represent pairs of contacting residues (black squares correspond to dashed lines). Sampling the posterior distribution over (K, π, µ) provides a maximum-likelihood categorization at each K. We show both category prototypes, µ, determined by the algorithm for the K = 2 classification. Darker shading in µ represents contacts with higher probability of forming.

Test Systems
The ability of our sampling procedure to predict categories was tested on three geometrical systems: 'chomp' (a closing angle), 'heli' (a rotating line), and 'glob' (three rotating spheres). Each system was generated as a time-series of 1000 frames for approximately P total particles in 2 or 3-dimensional space. After generation, Gaussian random noise of width σ = 0.1 was added to every degree of freedom. Figure 2 shows images of these three test trajectories. A complete description of the coordinate generation methods is present in Appendix A.3. All three systems were processed into binary feature data by calculating pairwise distances between all points. Pairs of points within 2 distance units were translated to 1 (representing contact). When forming the feature vectors, x, we removed features, j, for which every sample showed the same result (all contacting or all disconnected). It is important to note that this removal changes M seen by the algorithm-usually decreasing it well below P(P − 1)/2. Figure 2. Sketch of the three synthetic motions used for testing. Lines, spheres, and colors are used to guide the eye, but the classification engine sees only the unlabeled points. The time axis proceeds left to right. For the 'glob' system, the sphere on the right moves behind, and then back to the right again while the top sphere moves slowly downward.
Critically, we repeated these classifications for a range of material points, P. This tested robustness of the unsupervised classification problem with respect to the amount of features available. Adding more points without changing the geometry of the problem should not change the number of categories detected.
For each run, five independent MCMC chains were started from an assignment of all points to a single category. Each chain ran for 1000 steps. Samples were collected every 10 steps-starting from step 500. The acceptance probabilities for category split/join Monte Carlo moves varied around 10-13%.

Results
We analyzed the results of MC in two different ways. First, the categories assigned were tested for grouping in time. Since the contact lists (on which the categorization is based) varied slowly over time, we expect categories to come in 'runs'. Second, we computed histograms over the number of categories, K. This is a strong test of the method's sensitivity to the number of material points, P.
As expected, we found a high degree of correlation between categories and time for every case. Similar time-points were grouped into similar categories. To quantify these results, we counted transitions between category indices in time-order. For a perfect categorization, the number of transitions should equal the number of categories minus one. We computed the categorization accuracy in two ways. For each system, the left columns of Tables 1 and 2 are 100 minus the percent of mis-categorized frames. For lossy categorization at time-boundaries, we expect oscillation between two values. We quantified this by forming a transition matrix between categories, and removing transitions along the 'most likely path'. Excluding this boundary oscillation we found nearly 100% accuracy for the categorizations. Those are shown in the right columns of Tables 1 and 2. Integrating the posterior probability (Equation (2) times Equation (7)) over π leads to factors like Γ(N + αK) −1 , which seem prohibitively costly as N increases. We therefore wanted to check that the number of categories does not decrease as features or samples are added. Figure 3 shows the sampled probability distributions over K, the number of categories, for increasing P (Figure 3a) and N (Figure 3b). Interestingly, for every system, about five conformational states were deduced at N = 1000. As N increased, however, more categories were deduced and the distribution spread to higher numbers. This is probably reasonable, since more values of the 'time' coordinate generated a more fine-grained motion.

Adenylate Kinase Open/Closed Transition
Finally, we tested the classification method against a protein with a well-characterized conformational transition. Adenylate kinase (ADK) converts ATP, ADP, and AMP by closing around substrate molecules [20]. The transition from closed to open was simulated in Reference [5] using steered molecular dynamics on a reaction coordinate interpolating between the electron density maps of PDB IDs 1AKE (Reference [20], closed) and 4AKE (Reference [2], open). The simulation data we used did not contain ligands, but did contain water and ions. Our analysis only made use of the alpha carbon (C α ) positions.
Features were calculated for each of N = 3900 equally spaced frames during steered dynamics by testing whether C α to C α distances were less than 5 Å. These structures contain P = 214 C α -s. Sampling was carried out as described in Section 3, but 8 independent MC chains were sampled for 1250 steps (instead of 5 for 1000). The acceptance probability of split/join moves was 17%. During sampling, we saved the parameters, θ, possessing maximum likelihood values at each K.
Our implementation (Reference [18]) is parallelized so that each thread carries out an independent Monte Carlo chain. The run-time for the ADK example was less than 30 min. Larger runs on proteins up to 400 residues using 75,000 sampled protein conformations have been carried out using this method in under an hour. The implementation also contains a series of vizualization tools to highlight the protein regions responsible for detected conformation-to-conformation differences.
We then extracted conformational states with the highest probability for landing in each category as representative points for that category. Since the reference open and closed PDB conformations formed extreme points, our representative structures approached them more nearly as K increased. Figure 4 shows that the two end-point conformations ended up very close to the open and closed states from the PDB.

Conclusions
The method developed here is ideally suited for the unsupervised structural classification problem. It has been derived from a first-principles Bayesian analysis of the set of atoms which interact within a structure. This work solved the central problem of defining an appropriate prior distribution over parameter space and implementing an efficient sampling method.
Tests on sample conformational transitions identified more categories than naïvely expected because it generated milestones along the motion's time-coordinate. However, all categorizations were shown to have excellent accuracy as judged by picking out the correct time-sequence. Finally, a test on Adenylate kinase verified that these conclusions easily generalize to protein motions.
The central results of this work showed that the method behaves well under large variations in the number of features and samples. These observations validate our choice for the prior distribution, since changes in P(θ|I) will have large effects on the distribution over number of categories, P(K|xI). They also show robustness of the MC sampling method itself, since relatively high acceptance rates were achieved.
Many future applications and the development of this method are possible. Changes in the pair contacts between states could be analyzed more thoroughly, as in the DynDom method [7]. Probabilities of assignment to each conformational state can be used as reaction coordinates. We are presently applying the method to classify chemical compound space using binary MACCS fingerprints [21], and to characterize conformational space of SARS-CoV-2 proteins simulated using replica exchange molecular dynamics [22]. Now that the principle has been demonstrated, more informative classification schemes can be devised. Adding information like hydrogen-bonding, salt bridge formation, and secondary structure annotation will allow the Bayesian framework able to recognize categories more like a biochemist would. The method can also be focused on active domains and binding sites by adding more points and shorter distance cutoffs for key residues. The insensitivity to M shown in this work provides a high degree of confidence that any amount of additional data will improve the overall categorization without spoiling the classification already achieved using coarser-level data.
Supplementary Materials: The source code for this classification method, along with the code to reproduce the test calculations in this paper, is publicly available from Reference [18]. with f (N, K) = Γ(αK)/Γ(α) K Γ(N + αK) (for β = 1). The probability per category on the far-right can be likened to an information entropy, but is offset because of the limit, lim n→∞ Γ(n + a) Γ(n)n a = 1. (A4) Inserting this limit on the right, we find This provides the motivation for choosing β = 1. This choice for β makes the categorization closer to an information entropy, depending only on the ratio N kj /N k . The categorization probability is now asymptotically like an information entropy, except for the extra factor of 1/(N k + 1). This factor is present for every feature, and creates an undesired dependence on M. We can cancel that factor by appealing to Equation (A4). This time using n = N k + 1 and a = α − 1 to approximate Γ(N k + α), Equation (A6) provides the motivation for choosing α = M + 1. With this choice, the (N k + 1) M term cancels. It leaves behind a relative information entropy for the N k . Essentially, At this point, different choices for f (N, K) lead to differences in the probability distribution over K, the number of categories. Rather than directly influencing these, we have chosen not to create additional assumptions about f (N, K) beyond the above choices for α and β.

Appendix A.2. Sampling Method
Since π can be trivially integrated out of the posterior distribution (Equation (A1)), we re-generate π on every split/join move, and consider only the integrated probability P(zµ|xI) as our sampling target. Each time a split/join move is generated, a re-categorization of samples in category k ({x k }) into {x L } and {x R }. is immediately followed by generation of µ L and µ R (or µ k for a join) from their respective Beta distributions (with samples N Lj and N L − N Lk , etc.).
We set up the acceptance probability for these split / join moves to satisfy the detailed balance condition, P acc (z µ |zµx) P acc (zµ|z µ x) = P gen (zµ|z µ x) P gen (z µ |zµx) Without loss of generality, assume z contains K + 1 categories and z contains K. We already have explicit formulas for P(zθ|xI) (modulo P(x|I)), so the only thing needed to calculate P acc is the move generation probabilities.
Given category k and feature j was chosen to define the split (see main text for a description of this process), the recategorization probability is, The factor of 1 2 is necessary because the labels L and R are interchangeable. p L (p R ) is the expression in the numerator when all z move from k to L (R).
Because any j could have been chosen, the overall probability of ending up with categories L and R is, To concentrate splitting on productive cases, we choose Q kj to be one for all k, j unless N kj = 0 or N kj = N k . In those cases Q kj is zero so that we do not split using a non-informative feature.
Generating µ L and µ R adds the factor, Here, we have made the intuitive definition, Most of this term will cancel P(zµ|xI) in the final expression for the acceptance probability. The recombination rule is attempted with frequency 1 − ξ(K + 1). A set u is chosen at random with probability 1/(K + 1) and recombined with set v = u chosen with probability 2/K(K + 1). The resulting generation probability is, Generating µ k adds the factor, P gen (zµ|z x) = P gen (z|z x)P({x k }|µ k )(N k + 1) M e ∑ j S inf (kj) .
As a final aid to computing the acceptance probability ratio (Equation (A8)), we provide a simplified expression for the likelihood ratio,

. Coordinate Trajectory Generation for Synthetic Test Systems
The 'chomp' system was 2D, with two line segments, both of length 4 units. The bottom segment (P/2 uniformly spaced particles) stretched along the x-axis and remained stationary. The top segment (P/2 − 1 uniformly spaced particles, since the origin is excluded), moved at angle θ(t) = π 4 cos(t), t ∈ [0, π]. This motion mimicks a harmonic oscillator-which spends a majority of its time at the two limits of its motion.
The 'heli' system was 3D. It also had two line segments with P/2 equally spaced particles each. The bottom segment remained as before, while the top segment was replaced by P/2 particles displaced from the bottom one by 1 distance unit along z. The top segment was rotated in the xy plane with θ(t) = t, t ∈ [0, π].
The 'glob' system was 3D, consisting of 3 spheres, each represented by P/3 points at radius 2. The point locations were taken from Lebedev quadrature rules, which possess octahedral symmetry. Sphere 1 remained fixed at the origin. Spheres 2 and 3 started in locations displaced along z or x, respectively, by 2.1 distance units. As t traversed [0, π], sphere 2 was rotated about the origin around the x-axis by angle θ 2 (t) = π 4 (1 − cos(t)), while sphere 3 was rotated about the origin around the z-axis by angle θ 3 (t) = π 4 (1 − cos(2t)). This motion creates something like three conformational states. Since sphere 3 moves twice as fast as sphere 2, sphere 2 can be at +z or −y while sphere 3 is at +x. A third conformation is at the center where sphere 3 is at +y.