#
Guaranteed Diversity and Optimality in Cost Function Network Based Computational Protein Design Methods^{ †}

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Computational Protein Design

`ref2015`and

`beta_nov_16`score functions [4,20] also integrate rotamer log-probabilities of apparition in natural structures, as provided in the Dunbrack library, in a specific energy term. To be optimized, the energy function should be easy to compute while remaining as accurate as possible, to predict relevant sequences. To try to meet these requirements, additive pairwise decomposable approximations of the energy have been chosen for protein design approaches [6,21]. The decomposable energy E of a sequence-conformation $\mathbf{r}=({r}_{1},\cdots ,{r}_{n})$ where ${r}_{i}$ is the rotamer used at the position i in the protein sequence can be written as:

## 3. CPD as a Weighted Constraint Satisfaction Problem

**Φ**is a set of potential functions. A potential function ${\phi}_{\mathbf{S}}$ maps ${\mathbf{D}}_{\mathbf{S}}$ to $[0,+\infty ]$. The joint potential function is defined as:

**Variables**- We add sequence variables to the network: ${\mathbf{X}}^{\prime}={\mathbf{X}}^{seq}\cup \mathbf{X}$, where ${\mathbf{X}}^{seq}=\left\{{X}_{i}^{seq}|{X}_{i}\in \mathbf{X}\right\}$. The value of ${X}_{i}^{seq}$ represents the amino acid type of the rotamer value of ${X}_{i}$.
**Domains**- $\mathbf{D}={\mathbf{D}}^{seq}\cup \mathbf{D}$ where ${\mathbf{D}}^{seq}=\left\{{\mathbf{D}}_{i}^{seq}\right|{\mathbf{D}}_{i}\in \mathbf{D}\}$ where the domain ${\mathbf{D}}_{i}^{seq}$ of ${X}_{i}^{seq}$ is the set of available amino acid types at position i.
**Constraints**- The new set of cost functions ${\mathbf{C}}^{\prime}$ is made of the initial functions $\mathbf{C}$; and sequence constraints that ensure that ${X}_{i}^{seq}$ is the amino acid type of rotamer ${X}_{i}$. Such a function ${c}_{{X}_{i},{X}_{i}^{seq}}$ just forbids (map to cost ⊤) pairs of values $(r,a)$ where the amino acid identity of rotamer r does not match a. All other pairs are mapped to cost 0.

## 4. Diversity and Optimality

#### 4.1. Measuring Diversity

**Definition**

**1.**

**Definition**

**2.**

**Definition**

**3.**

- its average dissimilarity: $\overline{d}\left(\mathbf{Z}\right)=\frac{2}{\left|\mathbf{Z}\right|\left(\right|\mathbf{Z}|-1)}\sum _{\mathbf{t}\ne {\mathbf{t}}^{\prime}\in \mathbf{Z}}d(\mathbf{t},{\mathbf{t}}^{\prime})$
- its minimum dissimilarity: $\stackrel{\u02c7}{d}\left(\mathbf{Z}\right)=\underset{\mathbf{t}\ne {\mathbf{t}}^{\prime}\in \mathbf{Z}}{min}d(\mathbf{t},{\mathbf{t}}^{\prime})$

**Definition**

**4.**

#### 4.2. Diversity Given Sequences of Interest

- A native functional sequence ${\mathbf{s}}_{nat}$ is known for the target backbone. The designer wants that less than ${\delta}_{nat}$ mutations be introduced on some sensitive region of the native protein, to avoid disrupting a crucial protein property.
- A patented sequence ${\mathbf{s}}_{pat}$ exists for the same function, and sequences with more than ${\delta}_{pat}$ mutations are required for the designed sequence to be usable without requiring a license.

- ${\mathrm{D}}_{\mathrm{IST}}({\mathbf{X}}^{seq},{\mathbf{s}}_{nat},H,-{\delta}_{nat})$
- ${\mathrm{D}}_{\mathrm{IST}}({\mathbf{X}}^{seq},{\mathbf{s}}_{pat},H,{\delta}_{pat})$

#### 4.3. Sets of Diverse and Good Quality Solutions

**Definition**

**5**

**.**Given a dissimilarity matrix D, an integer M and a dissimilarity threshold δ, the problem DiverseSet$(\mathcal{C},D,M,\delta )$ consists of producing a set $\mathbf{Z}$ of M solutions of $\mathcal{C}$ such that:

**Diversity**- For all $\mathbf{t}\ne {\mathbf{t}}^{\prime}\in \mathbf{Z}$, $\mathtt{d}(\mathbf{t},{\mathbf{t}}^{\prime})\u2a7e\delta $, i.e., ${\mathrm{D}}_{\mathrm{IST}}(\mathbf{t},{\mathbf{t}}^{\prime},D,\delta )=0$.
**Quality**- The solutions have minimum cost, i.e., $\underset{\mathbf{t}\in \mathbf{Z}}{\sum {}^{\top}}{C}_{\mathcal{C}}\left(\mathbf{t}\right)$ is minimum.

**Definition**

**6**

- The first solution $\mathbf{Z}\left[1\right]$ is the optimum of $\mathcal{C}$
- When solutions $\mathbf{Z}[1..(i-1\left)\right]$ are computed, $\mathbf{Z}\left[i\right]$ is such that:for all $1\u2a7dj<i,{\mathrm{D}}_{\mathrm{IST}}\left(\mathbf{Z}\right[i],\mathbf{Z}[j],D,\delta )=0$ and $\mathbf{Z}\left[i\right]$ has minimum cost.That is, $\mathbf{Z}\left[i\right]$ is the minimum cost solution, among assignments that are at distance at least δ from all the previously computed solutions.

## 5. Relation with Existing Work

**Definition**

**7**

**Theorem**

**1.**

**Proof.**

**Theorem**

**2.**

- Any assignment $\mathbf{t}$ of a CFN $\mathcal{C}=(\mathbf{X},\mathbf{D},\mathbf{C})$ is a δ-mode iff it is an optimal solution of the CFN $(\mathbf{X},\mathbf{D},\mathbf{C}\cup \{{\mathrm{D}}_{\mathrm{IST}}(\mathbf{X},\mathbf{t},H,-\delta )\left\}\right)$
- For bounded δ, this problem is in P.

**Proof.**

- The function ${\mathrm{D}}_{\mathrm{IST}}(\mathbf{X},\mathbf{t},H,-\delta )$ restricts $\mathbf{X}$ to be within $\delta $ of $\mathbf{t}$. If $\mathbf{t}$ is an optimal solution of $(\mathbf{X},\mathbf{C}\cup \{{\mathrm{D}}_{\mathrm{IST}}(\mathbf{X},\mathbf{t},H,-\delta \}\left)\right)$ then there is no better assignment than $\mathbf{t}$ in the $\delta $-radius Hamming ball and $\mathbf{t}$ is a $\delta $-mode.
- For bounded $\delta $, a CFN with n variables and at most d values in each domain, there is $O\left({\left(nd\right)}^{\delta}\right)$ tuples within the Hamming ball, because from $\mathbf{t}$, we can pick any variable (n choices) and change its value (d choices), $\delta $ times. Therefore, the problem of checking if $\mathbf{t}$ is optimal is in P.

## 6. Representing the Diversity Constraint

#### 6.1. Using Automata

- The alphabet is the set of possible values, i.e., the union of the variable domains $\Sigma ={\bigcup}_{i=1}^{n}{\mathbf{D}}_{i}$
- The set of states $\mathbf{Q}$ gathers $(\delta +1)\xb7(n+1)$ states denoted ${q}_{i}^{d}$:$$\mathbf{Q}=\left\{{q}_{i}^{d}|0\u2a7di\u2a7dn,0\u2a7dd\u2a7d\delta \right\}$$
- In the initial state, no value of $\mathbf{X}$ has been read, and the dissimilarity is 0:$${Q}_{0}={q}_{0}^{0}$$
- The assignment is accepted if it has dissimilarity from $\mathbf{t}$ higher than the threshold $\delta $, hence the accepting state:$$\mathbf{F}=\left\{{q}_{n}^{\delta}\right\}$$
- For every value r of ${X}_{i}$, the transition function $\Delta :\mathbf{Q}\times \Sigma \times \mathbf{Q}$ defines a 0-cost transition from ${q}_{i}^{d}$ to ${q}_{i+1}^{min(d+D(r,\mathbf{t}[i+1]),\delta )}$. All other transitions have infinite cost ⊤.

#### 6.2. Exploiting Automaton Function Decomposition

#### 6.3. Compressing the Encoding

- $m{d}_{n}=0$
- For $0\u2a7di<n$, $m{d}_{i}=m{d}_{i+1}+ma{x}_{v,{v}^{\prime}\in {\mathbf{D}}_{i+1}}D(v,{v}^{\prime})$

- One variable ${X}_{\mathbf{S}}$ per constraint ${c}_{\mathbf{S}}\in \mathbf{C}$:$${\mathbf{X}}^{\prime}=\left\{{X}_{\mathbf{S}}\right|{c}_{\mathbf{S}}\in \mathbf{C}\}$$
- Domain ${\mathbf{D}}_{{X}_{\mathbf{S}}}$ of variable ${X}_{\mathbf{S}}$ is the set of tuples $\mathbf{t}\in {\mathbf{D}}_{\mathbf{S}}$ that satisfy the constraint ${c}_{\mathbf{S}}$:$${\mathbf{D}}^{\prime}=\left\{{\mathbf{D}}_{{X}_{\mathbf{S}}}\right|{c}_{\mathbf{S}}\in \mathbf{C}\}\phantom{\rule{2.em}{0ex}}{\mathbf{D}}_{{X}_{\mathbf{S}}}=\{\mathbf{t}\in {c}_{\mathbf{S}}\}$$
- For each pair of constraints ${c}_{\mathbf{S}},{c}_{{\mathbf{S}}^{\prime}}\in \mathbf{C}$ with overlapping scopes $\mathbf{S}\cap {\mathbf{S}}^{\prime}\ne \u2300$, there is a constraint ${c}_{{\mathbf{X}}_{\mathbf{S}},{\mathbf{X}}_{{\mathbf{S}}^{\prime}}}$ that ensures that tuples assigned to ${X}_{\mathbf{S}}$ and ${X}_{{\mathbf{S}}^{\prime}}$ are compatible, i.e., they have the same values on the overlapping variables:$${\mathbf{C}}^{\prime}=\left\{{c}_{{\mathbf{X}}_{\mathbf{S}},{\mathbf{X}}_{{\mathbf{S}}^{\prime}}}|{X}_{\mathbf{S}},{X}_{{\mathbf{S}}^{\prime}}\in {\mathbf{X}}^{\prime},\mathbf{S}\cap {\mathbf{S}}^{\prime}\ne \varnothing \right\}$$$${c}_{{\mathbf{X}}_{\mathbf{S}},{\mathbf{X}}_{{\mathbf{S}}^{\prime}}}=\{(\mathbf{t},{\mathbf{t}}^{\prime})\in {\mathbf{D}}_{{X}_{\mathbf{S}}}\times {\mathbf{D}}_{{X}_{{\mathbf{S}}^{\prime}}}|\mathbf{t}[\mathbf{S}\cap {\mathbf{S}}^{\prime}]={\mathbf{t}}^{\prime}[\mathbf{S}\cap {\mathbf{S}}^{\prime}]\}$$

- All the variables in $\mathbf{X}$ and the variables ${X}_{\mathbf{S}}$ from the dual network (and associated domains):$${\mathbf{X}}^{\u2033}=\mathbf{X}\cup {\mathbf{X}}^{\prime}$$
- For any dual variable ${X}_{\mathbf{S}}$, and each ${X}_{i}\in \mathbf{S}$, the set of constraints ${\mathbf{C}}^{\u2033}$ contains a function involving ${X}_{i}$ and ${X}_{\mathbf{S}}$:$${c}_{{X}_{i}{X}_{\mathbf{S}}}:(v,\mathbf{t})\in {\mathbf{D}}_{i}\times {\mathbf{D}}_{{X}_{\mathbf{S}}}\mapsto \left\{\begin{array}{cc}0\hfill & \phantom{\rule{4.pt}{0ex}}\mathrm{if}\mathbf{t}\left[{X}_{i}\right]=v\hfill \\ \top \hfill & \phantom{\rule{4.pt}{0ex}}\mathrm{otherwise}.\hfill \end{array}\right.$$

## 7. Greedy DiverseSeq

- The CFN $\mathcal{C}$ is solved using branch-and-bound while maintaining soft local consistencies [26].
- If a solution $\mathbf{t}$ is found, it is added to the ongoing solution sequence $\mathbf{Z}$.
- If M solutions have been produced, the algorithm stops.
- Otherwise, the cost function ${\mathrm{D}}_{\mathrm{IST}}(\mathbf{X},\mathbf{t},D,\delta )$ is added to the previously solved problem.
- We loop and solve the problem again (Step 1)

Algorithm 1: Incremental production of DiverseSeq$(\mathcal{C},D,M,\delta )$ |

**Incrementality**- Since the problems solved are increasingly constrained, all the equivalence preserving transformations and pruning that have been applied to enforce local consistencies at iteration $i-1$ are still valid in the following iterations. Instead of restarting from a problem $\mathcal{C}=(\mathbf{X},\mathbf{D},\mathbf{C}\cup {\bigcup}_{1\le j<i}\left\{{\mathrm{D}}_{\mathrm{IST}}(\mathbf{X},\mathbf{Z}\left[j\right],D,\delta \}\right)$, we reuse the problem solved at iteration $i-1$ after it has been made locally consistent, add the ${\mathrm{D}}_{\mathrm{IST}}(\mathbf{X},\mathbf{Z}[i-1],D,\delta )$ constraint and reinforce local consistencies. As with incremental SAT solvers, adaptive variable ordering heuristics that have been trained at iteration $i-1$ are reused at iteration i.
**Lower bound**- Since the problems solved are increasingly constrained, we know that the optimal cost $o{c}^{i}$ obtained at iteration i cannot have a lower cost than the optimum cost $o{c}^{i-1}$ reported at iteration $i-1$. When large plateaus are present in the energy landscape, this allows stopping the search as soon as a solution of cost $o{c}^{i-1}$ is reached, avoiding a useless repeated proof of optimality.
**Upper bound prediction**- Even if there are no plateaus in the energy landscape, there may be large regions with similar variations in energy. In this case, the difference in energy between $o{c}^{i-1}$ and $o{c}^{i}$ will remain similar for several iterations. Let ${\Delta}_{i}^{h}={max}_{max(2,i-h)\le j<i}(o{c}^{j}-o{c}^{j-1})$ be the maximum variation observed in the last h iterations (we used $h=5$). At iteration i, we can first solve the problem with a temporary upper bound ${k}^{\prime}=min(k,o{c}_{i-1}+2.{\Delta}_{i}^{h})$ that should preserve a solution. If ${k}^{\prime}<k$, this will lead to increased determinism, additional pruning, and possibly exponential savings. Otherwise, if no solution is found, the problem is solved again with the original upper bound k. We call this predictive bounding.

## 8. Results

`-A`in toulbar2). The computational cost of VAC, although polynomial, is high, but amortized over the M resolutions. During tree search, the default existential directional arc consistency (EDAC) was used. All experiments were performed on one core of a Xeon Gold 6140 CPU at 2.30 GHz. Wall-clock times could be further reduced using a parallel implementation of the underlying Hybrid Best-First search engine [45], currently under development in

`toulbar2`.

`ref2015`score function [48]. Alternate rotamer libraries and score functions can be used if required as the algorithms presented here are not specialized for Rosetta (and not even for CPD, see [44]). The resulting networks have from 44 to 87 rotamer variables, and maximum domain sizes range from 294 to 446 rotamers. The number of variables is doubled after sequence variables are added.

`-rgap`and

`-agap`flags respectively).

## 9. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

CFN | Cost Function Network |

CPD | Computational Protein Design |

CSP | Constraint Satisfaction Problem |

MRF | Markov Random Field |

NSR | Native Sequence Recovery |

NSSR | Native Sequence Similarity Recovery |

WCSP | Weighted Constraint Satisfaction Problem |

## References

- Anfinsen, C.B. Principles that govern the folding of protein chains. Science
**1973**, 181, 223–230. [Google Scholar] [CrossRef] [PubMed][Green Version] - Pierce, N.A.; Winfree, E. Protein design is NP-hard. Protein Eng.
**2002**, 15, 779–782. [Google Scholar] [CrossRef] [PubMed][Green Version] - Van Laarhoven, P.J.; Aarts, E.H. Simulated annealing. In Simulated Annealing: Theory and Applications; Springer: Berlin/Heidelberg, Germany, 1987; pp. 7–15. [Google Scholar]
- Leaver-Fay, A.; Tyka, M.; Lewis, S.M.; Lange, O.F.; Thompson, J.; Jacak, R.; Kaufman, K.W.; Renfrew, P.D.; Smith, C.A.; Sheffler, W.; et al. ROSETTA3: An object-oriented software suite for the simulation and design of macromolecules. In Methods in Enzymology; Elsevier: Amsterdam, The Netherlands, 2011; Volume 487, pp. 545–574. [Google Scholar]
- Traoré, S.; Allouche, D.; André, I.; De Givry, S.; Katsirelos, G.; Schiex, T.; Barbe, S. A new framework for computational protein design through cost function network optimization. Bioinformatics
**2013**, 29, 2129–2136. [Google Scholar] [CrossRef][Green Version] - Allouche, D.; André, I.; Barbe, S.; Davies, J.; de Givry, S.; Katsirelos, G.; O’Sullivan, B.; Prestwich, S.; Schiex, T.; Traoré, S. Computational protein design as an optimization problem. Artif. Intell.
**2014**, 212, 59–79. [Google Scholar] [CrossRef] - Noguchi, H.; Addy, C.; Simoncini, D.; Wouters, S.; Mylemans, B.; Van Meervelt, L.; Schiex, T.; Zhang, K.Y.; Tame, J.R.; Voet, A.R. Computational design of symmetrical eight-bladed β-propeller proteins. IUCrJ
**2019**, 6, 46–55. [Google Scholar] [CrossRef][Green Version] - Schiex, T.; Fargier, H.; Verfaillie, G. Valued constraint satisfaction problems: Hard and easy problems. IJCAI (1)
**1995**, 95, 631–639. [Google Scholar] - Cooper, M.; de Givry, S.; Schiex, T. Graphical models: Queries, complexity, algorithms. Leibniz Int. Proc. Inform.
**2020**, 154, 4-1. [Google Scholar] - Bouchiba, Y.; Cortés, J.; Schiex, T.; Barbe, S. Molecular flexibility in computational protein design: An algorithmic perspective. Protein Eng. Des. Sel.
**2021**, 34, gzab011. [Google Scholar] [CrossRef] - Marcos, E.; Silva, D.A. Essentials of de novo protein design: Methods and applications. Wiley Interdiscip. Rev. Comput. Mol. Sci.
**2018**, 8, e1374. [Google Scholar] [CrossRef] - King, C.; Garza, E.N.; Mazor, R.; Linehan, J.L.; Pastan, I.; Pepper, M.; Baker, D. Removing T-cell epitopes with computational protein design. Proc. Natl. Acad. Sci. USA
**2014**, 111, 8577–8582. [Google Scholar] [CrossRef] [PubMed][Green Version] - Kirillov, A.; Shlezinger, D.; Vetrov, D.P.; Rother, C.; Savchynskyy, B. M-Best-Diverse Labelings for Submodular Energies and Beyond. In Proceedings of the Twenty-Ninth Conference on Neural Information Processing Systems, Quebec, QC, Canada, 7–12 December 2015; pp. 613–621. [Google Scholar]
- Bacchus, F.; Van Beek, P. On the conversion between non-binary and binary constraint satisfaction problems. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI), Madison, WI, USA, 26 July 1998; pp. 310–318. [Google Scholar]
- Larrosa, J.; Dechter, R. On the dual representation of non-binary semiring-based CSPs. In Proceedings of the CP’2000 Workshop on Soft Constraints, Singapore, 18 September 2000. [Google Scholar]
- Shapovalov, M.V.; Dunbrack, R.L., Jr. A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions. Structure
**2011**, 19, 844–858. [Google Scholar] [CrossRef] [PubMed][Green Version] - Lovell, S.C.; Word, J.M.; Richardson, J.S.; Richardson, D.C. The penultimate rotamer library. Proteins Struct. Funct. Bioinform.
**2000**, 40, 389–408. [Google Scholar] [CrossRef] - Case, D.A.; Cheatham, T.E., III; Darden, T.; Gohlke, H.; Luo, R.; Merz, K.M., Jr.; Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R.J. The Amber biomolecular simulation programs. J. Comput. Chem.
**2005**, 26, 1668–1688. [Google Scholar] [CrossRef] [PubMed][Green Version] - Brooks, B.R.; Brooks, C.L., III; Mackerell, A.D., Jr.; Nilsson, L.; Petrella, R.J.; Roux, B.; Won, Y.; Archontis, G.; Bartels, C.; Boresch, S.; et al. CHARMM: The biomolecular simulation program. J. Comput. Chem.
**2009**, 30, 1545–1614. [Google Scholar] [CrossRef] - Alford, R.F.; Leaver-Fay, A.; Jeliazkov, J.R.; O’Meara, M.J.; DiMaio, F.P.; Park, H.; Shapovalov, M.V.; Renfrew, P.D.; Mulligan, V.K.; Kappel, K.; et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput.
**2017**, 13, 3031–3048. [Google Scholar] [CrossRef] [PubMed] - Samish, I. Computational Protein Design; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
- Gainza, P.; Roberts, K.E.; Georgiev, I.; Lilien, R.H.; Keedy, D.A.; Chen, C.Y.; Reza, F.; Anderson, A.C.; Richardson, D.C.; Richardson, J.S.; et al. OSPREY: Protein design with ensembles, flexibility, and provable algorithms. In Methods in Enzymology; Elsevier: Amsterdam, The Netherlands, 2013; Volume 523, pp. 87–107. [Google Scholar]
- Pierce, N.A.; Spriet, J.A.; Desmet, J.; Mayo, S.L. Conformational splitting: A more powerful criterion for dead-end elimination. J. Comput. Chem.
**2000**, 21, 999–1009. [Google Scholar] [CrossRef] - Rossi, F.; van Beek, P.; Walsh, T. (Eds.) Handbook of Constraint Programming; Elsevier: Amsterdam, The Netherlands, 2006. [Google Scholar]
- Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT press: Cambridge, MA, USA, 2009. [Google Scholar]
- Cooper, M.C.; De Givry, S.; Sánchez, M.; Schiex, T.; Zytnicki, M.; Werner, T. Soft arc consistency revisited. Artif. Intell.
**2010**, 174, 449–478. [Google Scholar] [CrossRef] - Cooper, M.C.; De Givry, S.; Sánchez-Fibla, M.; Schiex, T.; Zytnicki, M. Virtual Arc Consistency for Weighted CSP. In Proceedings of the Twenty-third National Conference on Artificial Intelligence (AAAI), Chicago, IL, USA, 13–17 July 2008; pp. 253–258. [Google Scholar]
- Traoré, S.; Roberts, K.E.; Allouche, D.; Donald, B.R.; André, I.; Schiex, T.; Barbe, S. Fast search algorithms for computational protein design. J. Comput. Chem.
**2016**, 37, 1048–1058. [Google Scholar] [CrossRef] [PubMed][Green Version] - Traoré, S.; Allouche, D.; André, I.; Schiex, T.; Barbe, S. Deterministic Search Methods for Computational Protein Design. In Computational Protein Design; Springer: Berlin/Heidelberg, Germany, 2017; pp. 107–123. [Google Scholar]
- Henikoff, S.; Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA
**1992**, 89, 10915–10919. [Google Scholar] [CrossRef][Green Version] - Hebrard, E.; Hnich, B.; O’Sullivan, B.; Walsh, T. Finding diverse and similar solutions in constraint programming. In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI), Pittsburgh, PA, USA, 9–13 July 2005; Volume 5, pp. 372–377. [Google Scholar]
- Hebrard, E.; O’Sullivan, B.; Walsh, T. Distance Constraints in Constraint Satisfaction. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 6–12 January 2007; Volume 2007, pp. 106–111. [Google Scholar]
- Hadžić, T.; Holland, A.; O’Sullivan, B. Reasoning about optimal collections of solutions. In International Conference on Principles and Practice of Constraint Programming; Springer: Berlin/Heidelberg, Germany, 2009; pp. 409–423. [Google Scholar]
- Petit, T.; Trapp, A.C. Finding diverse solutions of high quality to constraint optimization problems. In Proceedings of the Twenty-fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
- Batra, D.; Yadollahpour, P.; Guzman-Rivera, A.; Shakhnarovich, G. Diverse M-best solutions in Markov Random Fields. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012; pp. 1–16. [Google Scholar]
- Prasad, A.; Jegelka, S.; Batra, D. Submodular meets structured: Finding diverse subsets in exponentially-large structured item sets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2645–2653. [Google Scholar]
- Kirillov, A.; Savchynskyy, B.; Schlesinger, D.; Vetrov, D.; Rother, C. Inferring M-best diverse labelings in a single one. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1814–1822. [Google Scholar]
- Chen, C.; Kolmogorov, V.; Zhu, Y.; Metaxas, D.; Lampert, C. Computing the M most probable modes of a graphical model. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 161–169. [Google Scholar]
- Chen, C.; Yuan, C.; Ye, Z.; Chen, C. Solving M-Modes in Loopy Graphs Using Tree Decompositions. In Proceedings of the International Conference on Probabilistic Graphical Models, Prague, Czech Republic, 11–14 September 2018; pp. 145–156. [Google Scholar]
- Chen, C.; Liu, H.; Metaxas, D.; Zhao, T. Mode estimation for high dimensional discrete tree graphical models. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1323–1331. [Google Scholar]
- Chen, C.; Yuan, C.; Chen, C. Solving M-Modes Using Heuristic Search. In Proceedings of the Twenty-fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 3584–3590. [Google Scholar]
- Pesant, G. A regular language membership constraint for finite sequences of variables. In International Conference on Principles and Practice of Constraint Programming; Springer: Berlin/Heidelberg, Germany, 2004; pp. 482–495. [Google Scholar]
- Allouche, D.; Bessiere, C.; Boizumault, P.; De Givry, S.; Gutierrez, P.; Lee, J.H.; Leung, K.L.; Loudni, S.; Métivier, J.P.; Schiex, T.; et al. Tractability-preserving transformations of global cost functions. Artif. Intell.
**2016**, 238, 166–189. [Google Scholar] [CrossRef][Green Version] - Ruffini, M.; Vucinic, J.; de Givry, S.; Katsirelos, G.; Barbe, S.; Schiex, T. Guaranteed Diversity & Quality for the Weighted CSP. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 18–25. [Google Scholar]
- Allouche, D.; De Givry, S.; Katsirelos, G.; Schiex, T.; Zytnicki, M. Anytime hybrid best-first search with tree decomposition for weighted CSP. In Proceedings of the International Conference on Principles and Practice of Constraint Programming, Cork, Ireland, 31 August–4 September 2015; pp. 12–29. [Google Scholar]
- Simoncini, D.; Allouche, D.; de Givry, S.; Delmas, C.; Barbe, S.; Schiex, T. Guaranteed discrete energy optimization on large protein design problems. J. Chem. Theory Comput.
**2015**, 11, 5980–5989. [Google Scholar] [CrossRef] [PubMed] - Ollikainen, N.; Kortemme, T. Computational protein design quantifies structural constraints on amino acid covariation. PLoS Comput. Biol.
**2013**, 9, e1003313. [Google Scholar] [CrossRef] [PubMed][Green Version] - Park, H.; Bradley, P.; Greisen Jr, P.; Liu, Y.; Mulligan, V.K.; Kim, D.E.; Baker, D.; DiMaio, F. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput.
**2016**, 12, 6201–6212. [Google Scholar] [CrossRef] [PubMed] - Pohl, I. Heuristic search viewed as path finding in a graph. Artif. Intell.
**1970**, 1, 193–204. [Google Scholar] [CrossRef] - Xu, J.; Berger, B. Fast and accurate algorithms for protein side-chain packing. J. ACM (JACM)
**2006**, 53, 533–557. [Google Scholar] [CrossRef] - Jou, J.D.; Jain, S.; Georgiev, I.S.; Donald, B.R. BWM*: A novel, provable, ensemble-based dynamic programming algorithm for sparse approximations of computational protein design. J. Comput. Biol.
**2016**, 23, 413–424. [Google Scholar] [CrossRef] - De Givry, S.; Schiex, T.; Verfaillie, G. Exploiting tree decomposition and soft local consistency in weighted CSP. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), Boston, Massachusetts, 16–20 July 2006; Volume 6, pp. 1–6. [Google Scholar]

**Figure 1.**An example of two protein sequences (top) where two mutable amino acids have been redesigned. At the first position, the amino acid D (an aspartic acid) has been changed to a L (leucine), in a specific conformation (orientation). At the second position, the arginine R, with its very long and flexible sidechain has been changed to a glutamine Q. The figure on the right illustrates the potential flexibility of the long arginine sidechain, showing a sample of several possible superimposed conformations, representing a fraction of all possible conformations for an arginine sidechain in existing rotamer libraries.

**Figure 2.**Input backbone and cost function network representation of a corresponding CPD instance with 6 mutable or flexible residues.

**Figure 3.**Weighted automaton representing ${\mathrm{D}}_{\mathrm{IST}}(\mathbf{X},\mathbf{t},H,\delta )$ where $\mathbf{X}$ is a set of 5 variables, with domains ${\mathbf{D}}_{i}=\{a,b,c\}$, $\mathbf{t}=aacba$, H represents the Hamming distance, and $\delta $ is set to 2. State ${q}_{i}^{d}$ means that values ${X}_{1}\cdots {X}_{i}$ are such that $H({X}_{1}\cdots {X}_{i},t[{X}_{1}\cdots {X}_{i}])=d$ (or $\u2a7e\delta $ if $d=\delta $). A labeled arrow $q\stackrel{(v,w)}{\u27f6}{q}^{\prime}$ means $\Delta (q,v,{q}^{\prime})=w$, i.e., there is a transition from q to ${q}^{\prime}$ with value v and weight w.

**Figure 4.**Hypergraph representation of the decomposition of a WRegular cost function with additional state variables ${Q}_{i}$ and transition-encoding ternary functions.

**Figure 6.**Comparison of the best NSR value obtained with ten 1-diverse sequences ($\delta =1$, blue curve) with the best NSR value obtained with libraries of ten sequences of increased diversity. Each plot corresponds to a specific additional value of $\delta $ ($\delta =2$ to 15, golden curve). Plots are ordered lexicographically from top-left to bottom-right, with increasing values of diversity ($\delta $). In each plot, the X-axis ranges over all tested backbones, sorted in increasing order of NSR value for the 1-diverse case and the Y-axis gives the corresponding NSR value. As the diversity requirement increases, the NSR value indicated by the golden curve increases also visibly.

**Figure 7.**Comparison of the best NSSR value obtained with ten 1-diverse sequences ($\delta =1$, blue curve) with the best NSSR value obtained with libraries of ten sequences of increased diversity. Each plot corresponds to a specific additional value of $\delta $ ($\delta =2$ to 15, golden curve). Plots are ordered lexicographically from top-left to bottom-right, with increasing values of diversity ($\delta $). In each plot, the X-axis ranges over all tested backbones, sorted in increasing order of NSSR value for the 1-diverse case and the Y-axis gives the corresponding NSSR value. As the diversity requirement $\delta $ increases, the NSSR value indicated by the golden curve increases also visibly.

**Figure 8.**The blue curves above give the absolute change in NSR (Y-axis,

**left**figure) and NSSR (Y-axis,

**right**figure) between the best 15-diverse and the best 1-diverse sequences found for each backbone. Backbones (on the X-axis) are ordered in increasing order of the corresponding measure. In the left figure, the bar-plot shows the difference between each backbone 1-diverse NSR and average 1-diverse NSR over all backbones. The corresponding NSR change scale appears on the right with $\pm 20\%$ labels. Red bars indicate a below-average 1-diverse NSR while blue bars indicate an above average 1-diverse NSR. The most improved NSRs, on the right of the left figure, mostly appear on weak (red, below average) 1-diverse NSRs.

**Figure 9.**Comparison of the computation times of sequence sets without diversity $\delta =1$, with sequence sets with diversity $\delta >1$. The color scale on the right indicates the corresponding value of $\delta $.

**Figure 10.**Comparison of the best NSR value obtained with ten 1-diverse sequences ($\delta =1$, blue curve) with the best NSR value obtained with libraries of ten sequences of increased diversity all predicted with an allowed gap top optimal energy of 3 kcal/mol. Each plot corresponds to a specific additional value of $\delta $ ($\delta =2$ to 15, golden curve). Plots are lexicographically ordered from top-left to bottom-right, with increasing values of diversity ($\delta $). In each plot, the X-axis ranges over all tested backbones, sorted in increasing order of NSR value for the 1-diverse case and the Y-axis gives the corresponding NSR value. As the diversity requirement $\delta $ increases, the NSR value indicated by the golden curve increases also visibly.

**Figure 11.**Comparison of the best NSSR value obtained with ten 1-diverse sequences ($\delta =1$, blue curve) with the best NSSR value obtained with libraries of ten sequences of increased diversity all predicted with an allowed gap top optimal energy of 3 kcal/mol. Each plot corresponds to a specific additional value of $\delta $ ($\delta =2$ to 15, golden curve). Plots are lexicographically ordered from top-left to bottom-right, with increasing values of diversity ($\delta $). In each plot, the X-axis ranges over all tested backbones, sorted in increasing order of NSSR value for the 1-diverse case and the Y-axis gives the corresponding NSSR value. As the diversity requirement $\delta $ increases, the NSSR value indicated by the golden curve increases also visibly.

**Figure 12.**Comparison of the computation times of sequence sets without diversity $\delta =1$, with suboptimal sequence sets with diversity $\delta >1$. An energy gap of 3 kcal/mol is allowed for actual optimum.

**Table 1.**List of protein structures used in our benchmark set, for full redesign: pdb identifier, domain length n (number of variables in the resulting CFN) and maximum domain size d.

PDB ID | n | d | PDB ID | n | d | PDB ID | n | d |
---|---|---|---|---|---|---|---|---|

1aho | 56 | 378 | 3i8z | 50 | 354 | 1ten | 81 | 392 |

2fjz | 53 | 324 | 2cg7 | 82 | 380 | 1ucs | 60 | 342 |

1b9w | 78 | 386 | 3rdy | 65 | 396 | 2bwf | 69 | 347 |

2gkt | 45 | 357 | 2erw | 47 | 446 | 2evb | 68 | 323 |

1f94 | 53 | 386 | 3vdj | 67 | 391 | 2o37 | 60 | 386 |

2pne | 77 | 401 | 2fht | 64 | 346 | 2o9s | 48 | 327 |

1hyp | 66 | 385 | 1bxy | 52 | 384 | 3f04 | 87 | 356 |

2pst | 61 | 357 | 1ctf | 68 | 349 | 3fym | 70 | 348 |

1uln | 66 | 367 | 1czp | 76 | 373 | 3gqs | 67 | 344 |

1uoy | 56 | 337 | 1fqt | 85 | 377 | 3gva | 87 | 348 |

2ca7 | 44 | 348 | 1guu | 47 | 350 | 3i2z | 67 | 360 |

1yzm | 46 | 294 | 1t8k | 68 | 361 |

**Table 2.**p-values for a unilateral Wilcoxon signed-rank test comparing the sample of best NSR (resp. NSSR) for each $\delta =2\cdots 15$ with $\delta =1$, for optimal and suboptimal (3 kcal/mol allowed energy gap to real optimum) resolution.

Exact Resolution | Subopt. Resolution | |||
---|---|---|---|---|

$\mathbf{\delta}$ | NSR | NSSR | NSR | NSSR |

2 | $2.88\times {10}^{-3}$ | $1.10\times {10}^{-3}$ | $4.11\times {10}^{-1}$ | $1.35\times {10}^{-1}$ |

3 | $3.87\times {10}^{-4}$ | $1.01\times {10}^{-4}$ | $1.14\times {10}^{-1}$ | $2.60\times {10}^{-2}$ |

4 | $4.42\times {10}^{-5}$ | $6.58\times {10}^{-5}$ | $4.48\times {10}^{-3}$ | $1.06\times {10}^{-3}$ |

5 | $8.11\times {10}^{-5}$ | $1.54\times {10}^{-5}$ | $1.98\times {10}^{-3}$ | $2.15\times {10}^{-3}$ |

6 | $1.51\times {10}^{-5}$ | $4.39\times {10}^{-6}$ | $7.47\times {10}^{-5}$ | $4.49\times {10}^{-5}$ |

7 | $1.88\times {10}^{-5}$ | $4.23\times {10}^{-6}$ | $8.86\times {10}^{-6}$ | $3.50\times {10}^{-5}$ |

8 | $1.27\times {10}^{-5}$ | $1.49\times {10}^{-6}$ | $8.19\times {10}^{-6}$ | $1.77\times {10}^{-5}$ |

9 | $2.76\times {10}^{-5}$ | $5.97\times {10}^{-6}$ | $2.07\times {10}^{-5}$ | $2.80\times {10}^{-6}$ |

10 | $1.14\times {10}^{-5}$ | $1.18\times {10}^{-5}$ | $1.78\times {10}^{-5}$ | $3.06\times {10}^{-5}$ |

11 | $4.27\times {10}^{-5}$ | $5.81\times {10}^{-7}$ | $2.32\times {10}^{-5}$ | $1.73\times {10}^{-5}$ |

12 | $6.63\times {10}^{-5}$ | $2.26\times {10}^{-6}$ | $1.75\times {10}^{-5}$ | $1.18\times {10}^{-5}$ |

13 | $4.43\times {10}^{-5}$ | $2.52\times {10}^{-6}$ | $2.48\times {10}^{-6}$ | $6.15\times {10}^{-6}$ |

14 | $2.29\times {10}^{-5}$ | $5.76\times {10}^{-6}$ | $5.26\times {10}^{-6}$ | $3.89\times {10}^{-7}$ |

15 | $3.92\times {10}^{-5}$ | $1.58\times {10}^{-6}$ | $2.68\times {10}^{-5}$ | $4.86\times {10}^{-5}$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ruffini, M.; Vucinic, J.; de Givry, S.; Katsirelos, G.; Barbe, S.; Schiex, T. Guaranteed Diversity and Optimality in Cost Function Network Based Computational Protein Design Methods. *Algorithms* **2021**, *14*, 168.
https://doi.org/10.3390/a14060168

**AMA Style**

Ruffini M, Vucinic J, de Givry S, Katsirelos G, Barbe S, Schiex T. Guaranteed Diversity and Optimality in Cost Function Network Based Computational Protein Design Methods. *Algorithms*. 2021; 14(6):168.
https://doi.org/10.3390/a14060168

**Chicago/Turabian Style**

Ruffini, Manon, Jelena Vucinic, Simon de Givry, George Katsirelos, Sophie Barbe, and Thomas Schiex. 2021. "Guaranteed Diversity and Optimality in Cost Function Network Based Computational Protein Design Methods" *Algorithms* 14, no. 6: 168.
https://doi.org/10.3390/a14060168