# Toward Inferring Potts Models for Phylogenetically Correlated Sequence Data

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methods

- The first is that the propagator $P({\underline{A}}^{2}|{\underline{A}}^{1},\Delta t)$ associated to the Potts model is not known a priori. Many distinct microscopic dynamics might lead to the same equilibrium, but the exact evolutionary processes underlying correlated protein evolution are not known. Even if we would assume some dynamics, the propagator for arbitrary time differences $\Delta t$ would require to sum over all possible evolutionary trajectories going from ${\underline{A}}^{1}$ to ${\underline{A}}^{2}$—but this is intractable in practice.
- The second problem is that each use of the recursion relation (5) involves the summation over all possible sequences for each child node of node n. This amounts to summing over ${21}^{L}$ terms each time, with L being the sequence length.

#### 2.1. Approximating Dynamics: Independent-Site Evolution

#### 2.2. Approximating Dynamics: Independent-Pair Evolution

#### 2.3. Optimization: Maximizing the Likelihood

#### 2.4. From Corrected Frequencies to DCA Models

## 3. Results

#### 3.1. Design of a Toy Model

#### 3.2. Artificial Data

#### 3.3. Phylogenetic Inference Corrects the One- and Two-Point Statistics

#### 3.4. DCA Parameters are Recovered with Increased Accuracy

#### 3.5. Improvement in the Prediction of Single Mutant’s Energies

#### 3.6. Results on Protein Data

## 4. Discussion

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

DCA | Direct Coupling Analysis |

MCMC | Markov Chain Monte Carlo |

MSA | Multiple Sequence Alignment |

PPV | Positive Predictive Value |

## References

- Consortium, U. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res.
**2018**, 47, D506–D515. [Google Scholar] [CrossRef] - Reddy, T.B.; Thomas, A.D.; Stamatis, D.; Bertsch, J.; Isbandi, M.; Jansson, J.; Mallajosyula, J.; Pagani, I.; Lobos, E.A.; Kyrpides, N.C. The Genomes OnLine Database (GOLD) v. 5: A metadata management system based on a four level (meta) genome project classification. Nucleic Acids Res.
**2014**, 43, D1099–D1106. [Google Scholar] [CrossRef] - El-Gebali, S.; Mistry, J.; Bateman, A.; Eddy, S.R.; Luciani, A.; Potter, S.C.; Qureshi, M.; Richardson, L.J.; Salazar, G.A.; Smart, A.; et al. The Pfam protein families database in 2019. Nucleic Acids Res.
**2018**, 47, D427–D432. [Google Scholar] [CrossRef] - Eddy, S.R. Profile hidden Markov models. Bioinform. (Oxf. Engl.)
**1998**, 14, 755–763. [Google Scholar] [CrossRef] [PubMed] - Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
- De Juan, D.; Pazos, F.; Valencia, A. Emerging methods in protein co-evolution. Nat. Rev. Genet.
**2013**, 14, 249. [Google Scholar] [CrossRef] [PubMed] - Cocco, S.; Feinauer, C.; Figliuzzi, M.; Monasson, R.; Weigt, M. Inverse statistical physics of protein sequences: A key issues review. Rep. Prog. Phys.
**2018**, 81, 032601. [Google Scholar] [CrossRef] [PubMed] - Morcos, F.; Pagnani, A.; Lunt, B.; Bertolino, A.; Marks, D.S.; Sander, C.; Zecchina, R.; Onuchic, J.N.; Hwa, T.; Weigt, M. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA
**2011**, 108, E1293–E1301. [Google Scholar] [CrossRef] [PubMed] - Nguyen, H.C.; Zecchina, R.; Berg, J. Inverse statistical problems: From the inverse Ising problem to data science. Adv. Phys.
**2017**, 66, 197–261. [Google Scholar] [CrossRef] - Marks, D.S.; Hopf, T.A.; Sander, C. Protein structure prediction from sequence variation. Nat. Biotechnol.
**2012**, 30, 1072. [Google Scholar] [CrossRef] [PubMed] - Ovchinnikov, S.; Park, H.; Varghese, N.; Huang, P.S.; Pavlopoulos, G.A.; Kim, D.E.; Kamisetty, H.; Kyrpides, N.C.; Baker, D. Protein structure determination using metagenome sequence data. Science
**2017**, 355, 294–298. [Google Scholar] [CrossRef] [Green Version] - Levy, R.M.; Haldane, A.; Flynn, W.F. Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol.
**2017**, 43, 55–62. [Google Scholar] [CrossRef] [PubMed] - Felsenstein, J. Inferring Phylogenies; Sinauer Associates Sunderland: Sunderland, MA, USA, 2004; Volume 2. [Google Scholar]
- Qin, C.; Colwell, L.J. Power Law Tails in Phylogenetic Systems. Proc. Natl. Acad. Sci. USA
**2018**, 115, 690–695. [Google Scholar] [CrossRef] [PubMed] - Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol.
**1981**, 17, 368–376. [Google Scholar] [CrossRef] - van Nimwegen, E. Finding regulatory elements and regulatory motifs: A general probabilistic framework. BMC Bioinform.
**2007**, 8, S4. [Google Scholar] [CrossRef] - Delgoda, R.; Pulfer, J.D. A guided Monte Carlo search algorithm for global optimization of multidimensional functions. J. Chem. Inf. Comput. Sci.
**1998**, 38, 1087–1095. [Google Scholar] [CrossRef] - Weigt, M.; White, R.A.; Szurmant, H.; Hoch, J.A.; Hwa, T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc. Natl. Acad. Sci. USA
**2009**, 106, 67–72. [Google Scholar] [CrossRef] - Balakrishnan, S.; Kamisetty, H.; Carbonell, J.G.; Lee, S.I.; Langmead, C.J. Learning generative models for protein fold families. Proteins Struct. Funct. Bioinform.
**2011**, 79, 1061–1078. [Google Scholar] [CrossRef] - Ekeberg, M.; Lövkvist, C.; Lan, Y.; Weigt, M.; Aurell, E. Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Phys. Rev. E
**2013**, 87, 012707. [Google Scholar] [CrossRef] [Green Version] - Socolich, M.; Lockless, S.W.; Russ, W.P.; Lee, H.; Gardner, K.H.; Ranganathan, R. Evolutionary information for specifying a protein fold. Nature
**2005**, 437, 512. [Google Scholar] [CrossRef] - Erdős, P.; Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci.
**1960**, 5, 17–60. [Google Scholar] - Mann, J.K.; Barton, J.P.; Ferguson, A.L.; Omarjee, S.; Walker, B.D.; Chakraborty, A.; Ndung’u, T. The Fitness Landscape of HIV-1 Gag: Advanced Modeling Approaches and Validation of Model Predictions by In Vitro Testing. PLoS Comput. Biol.
**2014**, 10, e1003776. [Google Scholar] [CrossRef] [PubMed] - Morcos, F.; Schafer, N.P.; Cheng, R.R.; Onuchic, J.N.; Wolynes, P.G. Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection. Proc. Natl. Acad. Sci. USA
**2014**, 111, 12408–12413. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Figliuzzi, M.; Jacquier, H.; Schug, A.; Tenaillon, O.; Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol. Biol. Evol.
**2016**, 33, 268–280. [Google Scholar] [CrossRef] [PubMed] - Hopf, T.A.; Ingraham, J.B.; Poelwijk, F.J.; Schärfe, C.P.; Springer, M.; Sander, C.; Marks, D.S. Mutation effects predicted from sequence co-variation. Nat. Biotechnol.
**2017**, 35, 128–135. [Google Scholar] [CrossRef] [Green Version] - Feinauer, C.; Weigt, M. Context-Aware Prediction of Pathogenicity of Missense Mutations Involved in Human Disease. arXiv
**2017**, arXiv:1701.07246. [Google Scholar] - Ng, P.C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res.
**2003**, 31, 3812–3814. [Google Scholar] [CrossRef] - Adzhubei, I.A.; Schmidt, S.; Peshkin, L.; Ramensky, V.E.; Gerasimova, A.; Bork, P.; Kondrashov, A.S.; Sunyaev, S.R. A method and server for predicting damaging missense mutations. Nat. Methods
**2010**, 7, 248. [Google Scholar] [CrossRef] - Price, M.N.; Dehal, P.S.; Arkin, A.P. FastTree: Computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol.
**2009**, 26, 1641–1650. [Google Scholar] [CrossRef] - Price, M.N.; Dehal, P.S.; Arkin, A.P. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE
**2010**, 5, e9490. [Google Scholar] [CrossRef] - Baldassi, C.; Zamparo, M.; Feinauer, C.; Procaccini, A.; Zecchina, R.; Weigt, M.; Pagnani, A. Fast and accurate multivariate Gaussian modeling of protein families: Predicting residue contacts and protein-interaction partners. PLoS ONE
**2014**, 9, e92721. [Google Scholar] [CrossRef] - Cocco, S.; Monasson, R.; Weigt, M. From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS Comput. Biol.
**2013**, 9, e1003176. [Google Scholar] [CrossRef] [PubMed] - Tubiana, J.; Cocco, S.; Monasson, R. Learning protein constitutive motifs from sequence data. eLife
**2019**, 8, e39397. [Google Scholar] [CrossRef] [PubMed] - Shimagaki, K.; Weigt, M. Selection of sequence motifs and generative Hopfield-Potts models for protein families. Phys. Rev. E
**2019**, 100, 032128. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**Homologous proteins constituting a multiple-sequence alignment (MSA) are related by common ancestors through a phylogenetic tree.

**Figure 2.**(

**A**) Illustration of Equation (5): ${\mathcal{L}}^{n}(\underline{A})$, as represented on the left, is the probability of observing all sequences in the MSA having node n as common ancestor, given the sequence $\underline{A}$ of this ancestor. This probability can be decomposed into a product over contributions of node n’s children $m\in \mathcal{C}(m)$. For each child m, we have to consider the propagator $P(\underline{B}\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}\underline{A},\Delta {t}^{m})$ from n to m, times the probability ${\mathcal{L}}^{m}(\underline{B})$ associated with the subtree rooted in m, and summed over all possible configurations $\underline{B}$ of m. Note that the sum over each child can be done independently; therefore, Felsenstein’s algorithm runs in linear time in the number of internal nodes. (

**B**) Measuring Hamming distances and time separations between sequences: thanks to the stationary dynamics of Felsentein’s model, the time-dependence of the Hamming distance between a parental and a child configuration can be estimated from observed leaf configurations. To this end, for any two leaves, ${\underline{A}}^{m}$ and ${\underline{A}}^{n}$, we determine the Hamming distance ${d}_{H}({\underline{A}}^{m},{\underline{A}}^{n})$ and the time separation $\Delta t$, the latter by summing the lengths of all branches on the connecting path. Time binning and averaging are used to estimate the curve ${\overline{d}}_{H}(\Delta t)$.

**Figure 3.**Result of the single-site phylogenetic inference for $\mu \phantom{\rule{0.166667em}{0ex}}L\phantom{\rule{0.166667em}{0ex}}\Delta t=3$. (

**A**) Single-site statistics of a sample of ${P}^{0}$ coming from a tree, before (“Tree”), and after (“Inferred”) phylogenetic inference, against “true” single site statistics coming from the fair i.i.d. sample. (

**B**) Slope of the linear regression and Pearson correlation corresponding to the plot in panel (A), for the 30 repetitions of the experiment. The black-circled points correspond to the sample displayed in panel (A).

**Figure 4.**Result of the pairwise phylogenetic inference for $\mu \phantom{\rule{0.166667em}{0ex}}L\phantom{\rule{0.166667em}{0ex}}\Delta t=3$. (

**A**) Pairwise frequencies ${f}_{ij}(a,b)$ of a sample of ${P}^{0}$ coming from a tree, before (“Tree”), and after (“Inferred”) the phylogenetic inference, against “true” pairwise frequencies coming from the fair sample. (

**B**) Slope of the linear regression and Pearson correlation corresponding to the plot in panel (A), for the 30 repetitions of the experiment. The black-circled points correspond to the repetition displayed in panel (A). (

**C**) Same as panel (A) for connected correlations ${c}_{ij}={f}_{ij}-{f}_{i}{f}_{j}$. (

**D**) Same as panel (B) for connected correlations.

**Figure 5.**Direct coupling analysis (DCA) models inferred after single-site or pairwise phylogenetic correction for $\mu \phantom{\rule{0.166667em}{0ex}}L\phantom{\rule{0.166667em}{0ex}}\Delta t=3$. (

**A**) Pearson correlation between parameters of inferred and of true DCA models. y-axis: couplings ${J}_{ij}$; x-axis: fields ${h}_{i}$. One point corresponds to one repetition of the MCMC process on the tree, i.e., to one sample. (

**B**) Histogram of the symmetrized Kullback–Leibler divergences between inferred and true models for all samples. (

**C**) Positive predictive value for predicting non-zero couplings (i.e., “contacts”) using inferred DCA models. DCA inferred on the i.i.d. sample performs perfectly in this case.

**Figure 6.**Pearson correlation in predicting energies of single mutants averaged over sets of reference sequences for $\mu \phantom{\rule{0.166667em}{0ex}}L\phantom{\rule{0.166667em}{0ex}}\Delta t=3$. In the top panel, reference sequences are taken in the biased sample, i.e., among the leaves of the phylogenetic tree. In the bottom panel, reference sequences are taken in a fair sample of ${P}^{0}$. Predictions are made using four models: a profile model and a Potts model trained on the uncorrected biased sample, respectively (“Profile on tree” and “DCA on tree”, respectively), and using the corrected single site frequencies (“Profile + single site inf.” and “DCA + single site inf.”, respectively). Error bars indicate the standard deviation across the 30 repetitions of the tree sampling process.

**Figure 7.**Pearson correlation in predicting energies of single mutants for $\mu \phantom{\rule{0.166667em}{0ex}}L\phantom{\rule{0.166667em}{0ex}}\Delta t=3$ averaged over sets of reference sequences. In the top panel, reference sequences are taken in the biased sample, i.e., among the leaves of the phylogenetic tree. In the bottom panel, reference sequences are taken in a fair sample of ${P}^{0}$. Predictions are made using a DCA model inferred either directly on biased data, either using corrected single site frequencies, either using corrected pairwise frequencies. Error bars indicate the standard deviation across the 30 repetitions of the tree sampling process.

**Figure 8.**Positive predictive value for predicting contacts in representative structures for two protein families PF00013 and PF00046. The blue lines indicate a naive DCA method without any correction for phylogeny. The orange lines show results for the sequence reweighting scheme. The green lines show results after our phylogenetic inference scheme.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rodriguez Horta, E.; Barrat-Charlaix, P.; Weigt, M.
Toward Inferring Potts Models for Phylogenetically Correlated Sequence Data. *Entropy* **2019**, *21*, 1090.
https://doi.org/10.3390/e21111090

**AMA Style**

Rodriguez Horta E, Barrat-Charlaix P, Weigt M.
Toward Inferring Potts Models for Phylogenetically Correlated Sequence Data. *Entropy*. 2019; 21(11):1090.
https://doi.org/10.3390/e21111090

**Chicago/Turabian Style**

Rodriguez Horta, Edwin, Pierre Barrat-Charlaix, and Martin Weigt.
2019. "Toward Inferring Potts Models for Phylogenetically Correlated Sequence Data" *Entropy* 21, no. 11: 1090.
https://doi.org/10.3390/e21111090