Guaranted Diversity and Optimality in Cost Function Network Based Computational Protein Design Methods

Proteins are the main active molecules of Life. While natural proteins play many roles, as enzymes or antibodies for example, there is a need to go beyond the repertoire of natural proteins to produce engineered proteins that precisely meet application requirements, in terms of function, stability, activity or other protein capacities. Computational Protein Design aims at designing new proteins from first principles, using full-atom molecular models. However, the size and complexity of proteins require approximations to make them amenable to energetic optimization queries. These approximations make the design process less reliable and a provable optimal solution may fail. In practice, expensive libraries of solutions are therefore generated and tested. In this paper, we explore the idea of generating libraries of provably diverse low energy solutions by extending Cost Function Network algorithms with dedicated automaton-based diversity constraints on a large set of realistic full protein redesign problems. We observe that it is possible to generate provably diverse libraries in reasonable time and that the produced libraries do enhance the Native Sequence Recovery, a traditional measure of design methods reliability. This paper is an extended version of [1].


Introduction
Proteins are complex molecules that govern much of how cells work, in humans, plants, and microbes. They are made of a succession of simple molecules called amino acids. All amino acids share a common constant linear body and a variable side-chain. The side-chain defines the nature of the amino acid. There are 20 natural amino acid types, each having a distinct side-chain offering specific physico-chemical properties. In a protein, the successive amino acids are connected one to the other by so-called peptidic bonds, defining a long linear polymer called the protein backbone. In solution, most proteins fold into a 3D shape, determined by the physico-chemical properties of their amino acid side-chains. Because of their large variety of functions, and their potentials for applications in medicine, environment, biofuels, green chemistry, etc, new protein sequences are sought, that present desired new or enhanced properties and functions. As function is closely related to three-dimensional (3D) structure [2], Computational Protein Design (CPD) methods aim at finding a sequence that folds into a target 3D structure, that corresponds to the desired properties and functions. A general formulation of this problem being highly intractable, simplifying assumptions have been made: the target protein structure (or backbone) is assumed to be rigid, the continuous space of flexibility of amino acids side-chains is represented as a discrete set of conformations called rotamers and the atomic forces that control the protein stability are represented as a decomposable energy function, defined as the sum of terms involving at most two bodies (atoms). The problem of design is then reduced to a purely discrete optimization problem: given a rigid backbone, one must find a combination of discrete side-chain natures and conformations (rotamers) that minimizes the energy. The resulting sequence-conformation is called the Global Minimum Energy Conformation (GMEC). A typical rotamer library for all 20 natural amino acids containing few hundreds of conformations, the discrete search space becomes very quickly challenging to explore and the problem has been shown to be NP-hard [3] (decision NP-complete). It has been naturally approached by stochastic optimization techniques such as Monte Carlo simulated annealing [4], as in the commonly used Rosetta software [5]. Such stochastic methods offer only asymptotic optimality guarantees. Another possible approach is to use provable optimization techniques, that instead offer finite-time deterministic guarantees. In the last decade, Constraint programming-based algorithms for solving the so-called Weighted Constraint Satisfaction Problem (WCSP) on Cost Function Networks (CFN) have been proposed to tackle CPD instances [6,7]. These provable methods have shown unprecedented efficiency at optimizing decomposable force fields on genuine protein design instances [7], leading to successfully characterized new proteins [8]. Cost Function Networks are one example of a larger family of mathematical models that aim at representing and analyzing decomposable functions, called Graphical Models [9,10].
Even if provable methods definitely remove the possibility of failed optimization, they cannot fight the simplifying assumptions that appear in the CPD problem formulation: the rigid backbone and discrete-chain conformations ignore the actual continuous protein flexibility. Similarly, the optimized pairwise decomposed energetic criterion only approximates the actual molecule energy. Therefore, even with provable methods, a library of highly relevant mutants is usually produced for further experimental testing, with the hope that the probability of identifying a functional protein will be increased. Provable Branch and Bound-based WCSP algorithms have the native ability of enumerating solutions within a threshold of the optimal solution. Empirically, one can observe that the set of sequences that lie within this threshold grows very quickly in size with the energy gap, but is mostly composed of sequences that are very similar to the optimal GMEC sequence. Ideally, a design library should be a set of low energy but also diverse solutions. The hope is that sequence diversity will improve the likelihood that a protein endowed of desired function is found. In this paper, we are therefore interested in algorithmic methods that can provide such a set of guaranteed diverse low energy solutions and then to empirically check if enforcing diversity in a library while optimizing energy does improve the protein design process.
Because of their important applications, protein sequences can be subject to patents. Ideally, a newly designed sequence should satisfy a minimum Hamming distance constraint to existing patented sequences. This again raises the need to produce sequences satisfying minimum distance requirement to given sequences. Finally, CPD input structures often come from existing, experimentally resolved natural (or native) proteins. In this case, a native sequence exists, that has usually acquired desirable properties following the billions of years of natural evolution and optimization it has been through. In many cases, to avoid disrupting the native protein properties (e.g. catalytic capacities), the protein designer may want to bound the maximum number of mutations introduced in the design. This raises the need to produce sequences satisfying maximum distance requirement to given sequences.
In this paper, given an initial rigid backbone, we consider the problem of producing a set of diverse low energy sequences that also provably satisfy a set of minimum and maximum distance requirements w.r.t. given sequences. We observe that, beyond structural biology and bioinformatics, this problem of producing a set of diverse solutions of a Graphical Model has been considered by many authors, either on discrete Boolean Graphical Models such as Constraint Networks (used in Constraint Programming), or on stochastic graphical models such as Markov Random Fields. While this shows that the interest for the problem of diverse solutions generation goes well beyond Computational Protein Design, we observe that these approaches either offer no guarantee, or are limited to specific tractable sub-classes of functions, such as sub-modular functions [11]. Our approach instead relies on the reduction of distance requirements to discrete automaton-based constraints that can be decomposed and compressed into ternary or binary constraints, using suitable dual and hidden encoding [12,13]. These constraints can then be processed natively by existing WCSP algorithms. We empirically evaluate this approach for the generation of a library of diverse sequences on a set of protein design problems. We first observe that this approach is capable of producing provably diverse sets of solutions on Computational Protein Design problems of realistic sizes in reasonable time. Going back to our initial aim, we also observe that sufficiently diverse libraries do offer better Native Sequence Recovery rates (NSR), a usual metric for protein design methods evaluation that measures how well it is able to reproduce Nature's optimization.

Computational Protein Design
A CPD instance is first composed of an input target 3D structure, defined by the Cartesian coordinates of all the atoms in the protein backbone. The target protein structure can come from an existing protein backbone, that was determined experimentally on an existing protein; or from a model that can be derived from existing 3D structures of similar proteins; or from a completely new structure, as is done in de novo design. Once a backbone has been chosen, the design space must be fixed. One may choose to do a full redesign, where the amino acids at all positions of the protein can be changed, or redesign only a subset of all positions, focusing on positions that are key for the targeted function. Overall, each position in the protein sequence will be set by the designer as either fixed, flexible, or mutable. If the position is fixed, the side-chain is fixed and rigid: the amino acid type and orientation are determined in the input target structure. If the position is flexible, the residue type is fixed to the amino acid type of the input structure, but the side-chain might adopt several conformations in space. If the position is mutable, all or a restricted set of amino acid types are allowed at the position, along with different conformations of their side-chain. Because of the supposed rigidity of the backbone, the sequence-conformation search space is therefore characterized by two decision levels: the sequence space, which corresponds to all the possible sequences s enabled by the mutable positions, and the conformation space, which must be searched to identify the best side-chain conformation at each flexible or mutable position. The possible conformations, or rotamers, for each amino acid are indexed in so-called rotamer libraries, such as the Dunbrack [14] or the Penultimate libraries [15]. Each library gathers a finite set of conformations, capturing a representative subset of all frequently adopted conformations in experimentally determined structures. In the Rosetta design software [5], that relies on the Dunbrack library, a fully mutable position will be typically associated with around 400 possible rotamers. Designing a 10-residue peptide actually requires the exploration of 400 10 ≈ 10 26 conformations. An example of two protein sequences (top) where two mutable amino acids have been redesigned. A the first position, the amino acid D (an aspartic acid) has been changed to a L (leucine), in a specific conformation (orientation). At the second position, the arginine R, with its very long and flexible side chain has been changed to a glutamine Q. The figure on the right illustrates the potential flexibility of the long arginine side chain, showing a sample of several possible superimposed conformations, representing a fraction of all possible conformations for an arginine side chain in existing rotamer libraries.
Given a backbone structure and a rotamer library, the CPD problem seeks a stable and functional sequence-conformation. The protein functionality is assumed to result from its conformation and its stability is captured by an energy function E, that allows to compute the energy of any sequence-conformations on the target backbone. The task at hand is the minimization of this energy function. The optimal sequence is the best possible choice for the target rigid backbone. To model the energy, score functions are used. They can be physics-based, as the energetic force fields AMBER [16] and CHARMM [17]. They capture various atomic interactions including bond and torsion angle potentials, van der Walls potentials, electrostatic interactions, hydrogen bonds forces and entropic solvent effects. Score functions may also be enriched by so-called "knowledge-based" energy terms, that result from the statistical analysis of known protein structures. For instance, Rosetta ref2015 and beta_nov_16 score functions [5] also integrate rotamer logprobabilities of apparition in natural structures, as provided in the Dunbrack library, in a specific energy term. To be optimized, the energy function should be easy to compute while remaining as accurate as possible, so as to predict relevant sequences. To try to meet these requirements, additive pairwise decomposable approximations of the energy has been chosen for protein design approaches [7,18]. The decomposable energy E of a sequenceconformation r = (r 1 , . . . , r n ) where r i is the rotamer used at the position i in the protein sequence can be written as: The term E ∅ is a constant that captures interactions within the rigid backbone. For 1 i n, the unary (or one body) terms E i capture the interactions between the rotamer r i at position i and the backbone, as well as interactions internal to the rotamer r i . For 1 i < j n, the binary terms E ij capture the interactions between rotamers r i and r j at positions i and j respectively. These energy terms only vary with the rotamers, thanks to the rigid backbone assumption. Protein design dedicated software, such as OSPREY [19] or Rosetta [5], compute all the constant, unary, and binary energy terms, for each rotamer and combination of rotamers, for each position and pair of positions. This requires only a quadratic time in the protein length. Once computed, these values are stored in so-called energy matrices. During the exploration of the sequence-conformation space, conformation energies can be efficiently computed by summing the relevant energy terms fetched from the energy matrix. CPD methods aim at finding the optimum conformation, called the global minimum energy conformation (GMEC). Despite all these simplifications, this problem remains decision NP-complete [20].

CPD as a Weighted Constraint Satisfaction Problem
A Cost Function Network (CFN) C is a mathematical model that aims at representing functions of many discrete variables that decompose as a sum of simple functions (with small arity or concise representation). It is a member of a larger family of mathematical models called graphical models [10], that all rely on multivariate function decomposability. A CFN is defined as a triple C = (X, D, C) where X = (X 1 , . . . , X n ) is a set of variables, D = (D 1 , . . . , D n ) is a set of finite domains, and C is a set of cost functions. Each variable X i takes its values in the domain D i . Each cost function c S ∈ C is a non negative integer function that depends on the variables in S, called the scope of the function. Given a set of variables S ⊂ X, the set D S ∏ X i ∈S D i denotes the Cartesian product of the domains of the variables in S. For a tuple t ∈ Y, with S ⊂ Y ⊂ X, the tuple t[S] denotes the projection of t over the variables of S. A cost function c S ∈ C maps tuples of D S to integer costs in {0, . . . , }. In this paper, we assume, as is usual in most graphical models, that the default representation of a cost function c S is a multidimensional cost table (or tensor) that contains the cost of every possible assignment of the variables in its scope. This representation requires space that grows exponentially in the cost function arity |S| which explains why 5 of 26 arity is often assumed to be at most two. The joint function is defined as the bounded sum of all cost functions in C: where the bounded sum + is defined with a + b = min(a + b, ). The maximum cost ∈ N ∪ {+∞} is used for forbidden partial assignments and represents a sort of infinite or unbearable cost. Cost functions that take their values in {0, } represent hard constraints. The Weighted Constraint Satisfaction Problem consists of finding the assignment of all the variables in X with minimum cost: Notice that when the maximum cost = 1, the cost function network becomes a constraint network [21], where cost functions encode only constraints. Tuples that are assigned cost 0 are valid, i.e., they satisfy the constraint, and tuples that are assigned cost 1 = are forbidden. The Constraint Satisfaction Problem then consists of finding a satisfying assignment, one that satisfies all the constraints.
Graphical models also encompass stochastic networks, such as discrete Markov Random Fields (MRF) and Bayesian Nets [22]. A discrete Markov Random Field is a graphical model M = (X, D, Φ) where X = (X 1 , . . . , X n ) is a set of random variables, D = (D 1 , . . . , D n ) is a set of finite domains, and Φ is a set of potential functions. A potential function ϕ S maps D S to [0, +∞]. The joint potential function is defined as: Instead of the sum used to combine functions in Cost Function Networks, the product is used here. The normalization of the potential function P by the partition function Z = ∑ t∈D X P(t) defines the probability function p = 1 Z P of the MRF. The Maximum A Posteriori probability corresponds to the assignment with maximum probability (and maximum potential) max t p(t).
MRF can be expressed using so-called energy functions e s ∈ E, a logarithmic transformation e S = − log ϕ S of the potential functions. The potential function is then an exponential of the energy P(t) = exp − ∑ e S ∈E e S . While potential functions are multiplied together, energies simply add up. Therefore, Cost function networks are closely related to the energetic expression of Markov random fields. The main difference lies in the the fact that CFNs deal with non negative integers only, whereas MRF energies are real numbers. If = +∞, a CFN can be transformed into an MRF through an exponential transformation, and given a precision factor, an MRF can be transformed into a CFN through a log transform. Zero potentials are mapped to cost and minimum energy means maximum probability.
Given a cost function network, the weighted constraint satisfaction problem can be answered by the exploration of a search tree in which nodes are CFNs induced by conditioning, i.e., domain reductions on the initial network. A Branch-and-Bound strategy is used to explore the search tree, that relies on a lower bound on the optimum assignment cost in the current sub-tree. If the lower bound is higher than the best joint cost found so far, it means that no better solution is to be found in the subtree, and it can be pruned. Each time a new solution is found, the maximum cost is updated to the corresponding cost, as we are only interested in finding solutions with lower cost. Ordering strategies are crucial and can lead to huge improvement in the empirical computation time: decisions should be made, that lead to low cost solutions (and decrease the maximum cost ) and that enable early pruning. The efficiency of the branch-and-bound strategy relies on strength of the lower bound on solution costs. In CFNs, since cost functions are non-negative, the empty-scoped cost function c ∅ provides a naive lower bound on the optimum cost. To efficiently compute tight lower bounds, local reasoning strategies are used, that aim at pushing as much cost as possible in the constant cost c ∅ , for better pruning. They are based on so-called "equivalence preserving transformations", that perform cost transfers between cost functions, while maintaining the solutions joint costs unchanged [23], i.e., the joint cost function is preserved (these operations are called reparameterizations in MRFs). Specific sequences of equivalence preserving transformations can be applied to a CFN to improve the lower bound c ∅ until a specific target property is reached on the CFN. These properties, called local consistencies, aim at creating zero costs, while improving c ∅ . These sequences of operations should converge to a fixpoint (closure). Various local consistency properties have been introduced, as node consistency, arc consistency, full directional arc consistency, existential directional arc consistency, or virtual arc consistency [23]. For binary CFNs, that involve functions of arity at most two, these properties can be enforced in polynomial time in the size of the network. Among these, Virtual arc consistency has been shown to solve the WCSP on networks composed of submodular functions [24]. Note however that the complexity of local consistency enforcing remains exponential in the arity of the cost functions (as is the size of the corresponding tensors).
Sometimes, one may need to include specific functions with a large scope in the description of the joint function. Because of the exponential size of the tensor description, these so-called global cost functions must be represented with dedicated concise descriptions and require dedicated algorithms for local consistency propagation. More formally, a global cost function, denoted GCF(S, A), is a family of cost functions, with scope S and possible parameters A. A global cost function is said to be tractable when its minimum can be computed in polynomial time.
The CFN formulation of computational protein design is straightforward [6,7,25,26] (see Figure 2): given a CPD instance with pairwise decomposable energy function E = E ∅ + ∑ 1 i n E i + ∑ 1 i<j n E ij , let C = (X, D, C) be a cost function network with variables X = (X 1 , . . . , X n ), with one variable X i for each flexible or mutable position i in the protein sequence, domains D = (D 1 , . . . , D n ) where the domain D i of the variable X i consists of the available rotamers at position i, i.e., the amino acid types and their side-chain conformations. The cost functions are the empty-scoped, unary and binary energy terms: Energy terms, which are floating point numbers, are transformed into non-negative integer values by being shifted by a constant, multiplied by a large precision factor, and having their residual decimal truncated.
In this encoding, variables represent rotamers, combining information on the nature and the geometry (conformation) of the side-chain. In practice, it is often useful to add extra variables that represent the sequence information alone. The CFN C can be transformed into C = (X , D , C ), that embeds sequence variables, as follows: Variables We add sequence variables to the network: represents the amino acid type of the rotamer value of X i .
is the set of available amino acid types at position i.

Constraints
The new set of cost functions C is made of the initial functions C; and sequence constraints, that ensure that X seq i is the amino acid type of rotamer X i . Such a function c X i ,X seq i just forbids (map to cost ) pairs of values (r, a) where the amino acid identity of rotamer r does not match a. All other pairs are mapped to cost 0.
Such sequence variables depend functionally on the rotamer variables. They do not modify the search space and merely offer a facility to define properties on the protein sequence, if needed, as will be the case here.

Diversity and Optimality
In this section, we assume that we have a CFN C = (X, D, C) and we want to express diversity requirements on the values taken by variables in S ⊂ X. In the case of CPD, these variables will be the previously introduced sequence variables.

Measuring diversity
The task at hand is the production of a set of diverse and low cost solutions of C. First, we need a measure of diversity between pairs of solutions, and among sets of solutions.
The simplest diversity measure between pairs of solutions is the Hamming distance, defined hereafter. It counts the number of variables in S that take different values in two solutions. In the CPD framework, sequence variables represent amino acid identities: the Hamming distance measures the number of introduced mutations (or substitutions), a very usual notion in protein engineering. Definition 1. Given a set of variables S ⊂ X and two assignments t and t ∈ D S , the Hamming distance between t and t is defined as follows: The Hamming distance can be generalized to take into account dissimilarity scores between values. The resulting distance is a semi-metric, defined as a sum of variable-wise dissimilarities, as follows: Definition 2. Given a zero-diagonal symmetric positive matrix D, that defines value dissimilarities, and two assignments t, t ∈ D S , the weighted-Hamming distance between t and t is defined as: In computational biology, protein sequences are often compared using dedicated similarity matrices, such as the BLOSUM62 matrix [27]. A protein similarity matrix S can be transformed into a dissimilarity matrix D such that D i,j = (S i,i + S j,j )/2 − S i,j . Definition 3. Given a set Z of solutions, we define: We are aiming at producing a library of solutions that is guaranteed to be diverse. The average dissimilarity does not match this need: a set of solutions might have a satisfying average dissimilarity value, with several occurrences of the same assignment, and one or a few very dissimilar ones. So, to guarantee diversity, the minimum dissimilarity will be the diversity measure used throughout this paper.
So, producing a set of diverse solutions requires that all solution pairs have their distance above a given threshold. This can be encoded in cost functions representing constraints, taking their values in {0, } only: Definition 4. Given two sets of variables S, S of the same cardinality, a dissimilarity matrix D and a diversity threshold δ, we define the global cost function: Allowing both positive and negative threshold δ allows the DIST cost function to express either minimum or maximum diversity constraints. When δ > 0, the cost function expresses a minimum dissimilarity requirement between the assignments t and t : If δ < 0, the cost function represents the fact that t and t must be similar, with a dissimilarity lower than the absolute value of δ:

Diversity given sequences of interest
In the CPD context, minimum and maximum distance requirements with known sequences may be useful in practice in at least two situations.

•
A native functional sequence s nat is known for the target backbone. The designer wants that less than δ nat mutations be introduced on some sensitive region of the native protein, in order to avoid disrupting a crucial protein property. • A patented sequence s pat exists for the same function, and sequences with more than δ pat mutations are required for the designed sequence to be usable without requiring a license.
The distance here is the Hamming distance based on matrix H which equals 1 everywhere, except for its zero diagonal. Using sequence variables, the following diversity constraint-encoding cost functions need to be added to the CFN model:

Sets of diverse and good quality solutions
The problem of producing a set of diverse and good quality solutions, i.e., such that all pairs of solutions satisfy the diversity constraint, and the solutions have minimum cost, can be expressed as follows: Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 April 2021 doi:10.20944/preprints202104.0250.v1 Definition 5 (DIVERSESET ). Given a dissimilarity matrix D, an integer M and a dissimilarity threshold δ, the problem DIVERSESET(C, D, M, δ) consists in producing a set Z of M solutions of C such that: For a CFN C with n variables, solving DIVERSESET requires to simultaneously decide the value of nM variables. It can be solved by making M copies of C with variable sets X 1 to X M and adding If the upper bound is finite, all its occurrences must be replaced by M.( − 1) + 1. While very elegant, this approach yields a CFN instance where the number of variables is multiplied by the number of wanted solutions. The WCSP and CPD problems being NP-hard, one can expect that the resulting model will be quickly challenging to solve. We empirically confirmed this on very tiny instances. We therefore decided to solve a relaxed version of DIVERSESET : an iterative algorithm provides a greedy approximation of the problem that preserves most of the required guarantees: The set of solutions DIVERSESEQ requires to optimally assign n variables M times, instead of the n.M variables. Given the NP-hardness of the WCSP, solving DIVERSESEQ may be exponentially faster than DIVERSESET while still providing guarantees that distance constraints are satisfied together with a weakened form of optimality, conditional on the solutions found in previous iterations. The solution set is still guaranteed to contain the GMEC (the first solution produced).

Relation with existing work
In the case of Boolean functions ( = 1), the work of [28] considers the optimization of the solution set cardinality M or the diversity threshold δ using average or minimum dissimilarity. The authors prove that enforcing Arc Consistency on a constraint requiring sufficient average dissimilarityd is polynomial but NP-complete for minimum dissimilarity d. They evaluate an algorithm for incremental production of a set maximizingd. The papers [29] and [30] later addressed the same problems using global constraints and knowledge compilation techniques. Being purely Boolean, these approaches cannot directly capture cost (or energy) which is crucial for CPD. More recently, [31] proposed a Constraint Optimization Problem approach to provide diverse high-quality solutions. Their approach however trades diversity for quality while diversity is a requirement in our case.
The idea of producing diverse solutions has also been explored in the more closely related area of discrete stochastic Graphical Models (such as Markov Random Fields). In the paper of Batra et al. [32], the Lagrangian relaxation of minimum dissimilarity constraints is shown to add only unary cost functions to the model. This approach can be adapted to cost function networks, but a non zero duality gap remains and ultimately, no guarantee can be offered. This work was extended in [33] using higher-order functions to approximately optimize a trade-off between diversity and quality. More recently, [34] addressed the DIVERSESET problem, but using optimization techniques that either provide Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 April 2021 doi:10.20944/preprints202104.0250.v1 no guarantee or are restricted to tractable variants of the WCSP problem, defined by submodular functions [11].
In the end, we observe that none of these approaches simultaneously provides guarantees on quality and diversity. Closest to our target, [35] considered the problem of incrementally producing the set of the best M δ-modes of the joint distribution J N (X).

Definition 7 ([36]).
A solution t is said to be a δ-mode iff there exists no better solution than t in the Hamming ball of radius δ centered in t (implying that t is a local minimum).
In [35][36][37][38], an exact dynamic programming algorithm, combined with an A* heuristic search and tree-decomposition was proposed to exactly solve this problem with the Hamming distance. This algorithm relies however on NP-complete lower bounds and is restricted to a fixed variable order, a restriction that is known to often drastically hamper solving efficiency. It however provides a diversity guarantee: indeed, a δ-mode will always be strictly more than δ away from another one and will be produced by greedily solving DIVERSESEQ. Proof. If a δ-mode t is not in the solution of DIVERSESEQ(C, H, M , δ + 1) , this must be because it gets forbidden by a DIST constraint. Consider the iteration i which forbids t for the first time: a solution with a cost lower than the cost of t was produced (else t would have been produced instead) but this solution is strictly less than δ + 1 away from t (since t gets forbidden). But this contradicts the fact that t is a δ-mode.
For a sufficiently large M, the sequence Z of DIVERSESEQ(N, H, M, δ + 1) solutions will therefore contain all δ-modes and possibly some extra solutions. Interestingly, it is not difficult to separate modes from non-modes. Theorem 2.

1.
Any assignment t of a CFN C = (X, D, C) is a δ-mode iff it is an optimal solution of the CFN (X, D, C ∪ {DIST(X, t, H, −δ)}) 2.
For bounded δ, this problem is in P.
Proof. 1. The function DIST(X, t, H, −δ) restricts X to be within δ of t. If t is an optimal solution of (X, C ∪ {DIST(X, t, H, −δ})) then there is no better assignment than t in the δ-radius Hamming ball and t is a δ-mode.
2. For bounded δ, a CFN with n variables and at most d values in each domain, there is O((nd) δ ) tuples within the Hamming ball, because from t, we can pick any variable (n choices) and change its value (d choices), δ times. Therefore the problem of checking if t is optimal is in P.

Representing the diversity constraint
The key to guaranteeing quality and diversity of solutions in a cost function network is the dissimilarity cost function DIST. Given a complete assignment t, a dissimilarity D and a threshold δ, we need to concisely encode the global diversity constraint DIST(X, t, D, δ).

Using automata
Given a solution t, a dissimilarity matrix D and a diversity threshold δ > 0, the cost function DIST(·, t, D, δ) needs to be added to the cost function network. Note that the function may involve all the network variables: it is a global cost function and its representation as one huge, exponential size tensor is not possible. To encode this function concisely, we exploit the fact that the set of authorized tuples defines a regular language, that can be encoded into a finite state automaton and then decomposed in ternary functions [39,40]. Here, we use a weighted automaton to define the weighted regular language of all tuples with their associated cost. A weighted automaton A = (Σ, Q, ∆, Q 0 , F) encoding DIST(X, t, D, δ) can be defined as follows: • The alphabet is the set of possible values, i.e., the union of the variable domains The set of states Q gathers (δ + 1) · (n + 1) states denoted q d i : In the initial state, no value of X has been read, and the dissimilarity is 0: The assignment is accepted if it has dissimilarity from t higher than the threshold δ, hence the accepting state: For every value r of X i , the transition function ∆ : Q × Σ × Q defines a 0-cost transition . All other transitions have infinite cost . −→ q means ∆(q, v, q ) = w, i.e., there is a transition from q to q with value v and weight w.
This weighted automaton contains O(n · (δ + 1) · d) finite cost transitions, were d is the maximum domain size. An assignment t of X is accepted if and only if d(t , t) δ; and the automaton represents the DIST cost function. An example of a DIST encoding automaton is given in Figure 3.

Exploiting automaton function decomposition
It is known that the WREGULAR cost function, encoding automaton A, can be decomposed into a sequence of ternary cost functions [40]. The decomposition is achieved by adding n + 1 state variables Q 0 , . . . , Q n , and n ternary cost functions w and only if there exists a transition from q i to q i+1 in A labeled with value x i+1 and cost c. Variable Q 0 is restricted to the starting states and variable Q n to the accepting states. Additional variables and ternary functions are represented in Figure 5. The resulting set of ternary functions is logically equivalent to the original global constraint.
The DIST function satisfies however several properties that can be exploited to further improve its space and time efficiency. One can first notice that the set of forbidden solutions does not depend on the order of the variables (the distance measure is the sum of indepen-Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 April 2021 doi:10.20944/preprints202104.0250.v1 dent variable-wise dissimilarities). Therefore, the order of the variables in the automaton can be chosen freely. We can use the so-called DAC-ordering [40]. This order is known to preserve the strength of the lower bounds produced by CFN local consistencies when applied to the decomposition (instead of the initial full DIST function, with its potentially exponential size table). Then, in the case of the DIST cost function, each state variable has (δ + 1) · (n + 1) values (the number of states in the automaton) and each ternary function cost table describes costs of (δ + 1) 2 · (n + 1) 2 · d tuples, where d is the domain size. To speed up the resolution through better soft local consistency enforcing, we exploit the properties of DIST and the dissimilarity matrix D.

Compressing the encoding
The encoding of a DIST cost function in a sequence of n ternary functions, described in cost tables of size (δ + 1) 2 · (n + 1) 2 · d can be reduced along several lines.
First, for DIST, we know that states s d i can only be reached after i transitions, i.e., the reachable states of variable Q i are the states in the i-th column in the DIST automaton (see Figure 3). The domains of the variables Q i can be reduced to the δ + 1 states s d i : Furthermore, our semi-metrics are defined by a non-decreasing sum of non-negative elements of D. Therefore, any state q d i can reach the accepting state q δ n if and only if the maximum dissimilarity md i that can be achieved from variable i to variable n is larger than the remaining diversity to reach δ − d. All such maximum dissimilarities md i can be pre-computed in one pass over all variables in X as follows: In the Hamming case, the distance can increase by 1 at most, i.e., max v,v ∈D i+1 H(v, v ) = 1, therefore md i = n − i.
A symmetric argument holds for the starting state q 0 0 . These simplifications reduce ternary cost tables to O((δ + 1) 2 · d).
For a given dissimilarity matrix D, let #D denote the number of distinct values that appear in D. If variables have domains of maximum size d and ignoring the useless 0 matrix, we know that 2 #D 1 + d·(d+1) 2 . However, distance matrices are usually more structured. For example, the BLOSUM62 similarity matrix leads to #D = 12 levels.
In the Hamming case, there are #H = 2 dissimilarity levels. This means that a state q d i can only reach states q d i+1 or q d+1 i+1 . This sparsity of the transition matrix can be exploited, provided it is made visible. This can be achieved using extended variants of the dual and hidden encoding of constraint networks [12,13]. These transformations, detailed hereafter, are known to preserve the set of solutions and their costs.
In constraint networks, the dual representation of a constraint network X = (X, D, C) is a new network X = (X , D , C ) with: Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 April 2021 doi:10.20944/preprints202104.0250.v1 Direct (ternary) decomposition Hidden representation Dual representation Figure 5. Representation of the ternary decomposition, and its dual and hidden representations.

•
One variable X S per constraint c S ∈ C: Domain D X S of variable X S is the set of tuples t ∈ D S that satisfy the constraint c S : For each pair of constraints c S , c S ∈ C with overlapping scopes S ∩ S = ∅, there is a constraint c X S ,X S that ensures that tuples assigned to X S and X S are compatible, i.e., they have the same values on the overlapping variables: We apply this transformation to the reduced w A Q i ,X i+1 ,Q i+1 functions (see Figure 5). The dual variable of w A Q i ,X i+1 ,Q i+1 is a variable X A i that contains all pairs (q, q ) ∈ Q i × Q i+1 such that there is a transition from q to q in A. For the Hamming case, the variable X A i has at most 2δ + 1 values. It is connected to X i by a pairwise function: where ∆ is the weighted transition function of the automaton A.
In this new dual representation, for every pair of consecutive dual variables X A i−1 and X A i , we add a function on these two variables to ensure that the arriving state of X A i−1 is the starting state of X A i : In the worst case, this function has size O(#D 2 · δ 2 ) (O(δ 2 ) in the Hamming case). Only n extra variables are required.
The hidden representation of a constraint network X = (X, D, C) is a network X = (X , D , C ) with:

•
All the variables in X and the variables X S from the dual network (and associated domains): For any dual variable X S , and each X i ∈ S, the set of constraints C contains a function involving X i and X S : Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 April 2021 doi:10.20944/preprints202104.0250.v1 As before, this transformation is applied to the reduced w A Q i ,X i+1 ,Q i+1 functions only (see Figure 5). In this new hidden representation, we keep variables Q i and create two pairwise functions involving each Q i and respectively X A i and X A i+1 : These functions ensure that the state value of Q i is consistent with the arriving state of the transition represented in X A i−1 and the starting state of X A i . In the worst case, these functions have size O(#D · δ 2 ) (O(δ 2 ) in the Hamming case).
The dedicated dual and hidden representations require the description of

Greedy DiverseSeq
The task at hand is the resolution of DIVERSESET(C, D, M, δ), i.e., the generation of a set of M solutions with minimum cost, that satisfy a minimum pairwise diversity constraint. The exact computation being too expensive, we are tackling a greedy computation of DIVERSESEQ(C, D, M, δ), a set of diverse good solutions that approximates DIVERSESET . The DIVERSESEQ computation is iterative: The CFN C is solved using branch-and-bound while maintaining soft local consistencies [23].

2.
If a solution t is found, it is added to the ongoing solution sequence Z.

3.
If M solutions have been produced, the algorithm stops.

4.
Otherwise, the cost function DIST(X, t, D, δ) is added to the previously solved problem.

5.
We loop and solve the problem again (Step 1)

At
Step 2, if no solution exists, the sequence of solutions Z can provably not be extended to length M and the problem has no solution (but a shorter sequence has been produced). This basic schema has been improved in three different ways (see Algorithm 1): Incrementality Since the problems solved are increasingly constrained, all the equivalence preserving transformations and pruning that have been applied to enforce local consistencies at iteration i − 1 are still valid in the following iterations. Instead of restarting from a problem C = (X, D, C ∪ 1≤j<i {DIST(X, Z[j], D, δ}), we reuse the problem solved at iteration i − 1 after it has been made locally consistent, add the DIST(X, Z[i − 1], D, δ) constraint and reinforce local consistencies. Similarly to incremental SAT solvers, adaptive variable ordering heuristics that have been trained at iteration i − 1 are reused at iteration i.

Lower bound
Since the problems solved are increasingly constrained, we know that the optimal cost oc i obtained at iteration i cannot have a lower cost than the optimum cost oc i−1 reported at iteration i − 1. When large plateaus are present in the energy landscape, this allows stopping the search as soon as a solution of cost oc i−1 is reached, avoiding a useless repeated proof of optimality.
Upper bound prediction Even if there are no plateaus in the energy landscape, there may be large regions with similar variations in energy. In this case, the difference in energy between oc i−1 and oc i will remain similar for several iterations. Let ∆ h i = max max(2,i−h)≤j<i (oc j − oc j−1 ) be the maximum variation observed in the last h iterations (we used h = 5). At iteration i, we can first solve the problem with a temporary upper bound k = min(k, oc i−1 + 2.∆ h i ) that should preserve a solution. If k < k, this will lead to increased determinism, additional pruning, and possibly Each of these three improvements has the capacity to offer exponential time savings, and all are used in the experiments presented in the next sections.

Results
We implemented the iterative approach described above in its direct (ternary) decomposition, hidden and dual representations in the CFN open-source solver toulbar2 and experimented with it on various CFNs representing real Bayesian Networks [1]. All three decompositions offered comparable efficiency but empirically, as expected, the dual encoding was almost systematically more efficient. It is now the default for diversity encoding in toulbar2. All toulbar2 preprocessing algorithms dedicated to exact optimization that do not preserve suboptimal solutions were deactivated at the root node (variable elimination, dead-end elimination, variable merging). We chose to enforce strong virtual arc consistency (flag -A in toulbar2). The computational cost of VAC, although polynomial, is high, but amortized over the M resolutions. During tree search, the default existential directional arc consistency (EDAC) was used. All experiments were performed on one core of a Xeon Gold 6140 CPU at 2.30 GHz.
Following our main motivation for protein design, we extracted two sets of prepared protein backbones for full redesigns from the benchmark sets built by [41] and [42] with the aim of checking if, as expected, diverse libraries can improve the overall design process. In the benchmark of monomeric proteins of less than 100 residues, with an X-ray resolved structure below 2 Å, with no missing or nonstandard residues and no ligand from [41], we selected the 20 proteins that had required the least CPU-time to solve, as indicated in the Excel sheet provided in the supplementary information of the paper. The harder instances from [42] correspond to proteins with diverse secondary structure compositions and fold  1aho  56 378  3i8z  50 354  1ten  81 392  2fjz  53 324  2cg7  82 380  1ucs  60 342  1b9w  78 386  3rdy  65 396  2bwf  69 347  2gkt  45 357  2erw  47 446  2evb  68 323  1f94  53 386  3vdj  67 391  2o37  60 386  2pne  77 401  2fht  64 346  2o9s  48 327  1hyp  66 385  1bxy  52 384  3f04  87 356  2pst  61 357  1ctf  68 349  3fym  70 348  1uln  66 367  1czp  76 373  3gqs  67  classes. We selected the 17 instances that required less than 24 hours of CPU-time for the full redesigned GMEC to be computed by toulbar2. These instances are listed in Table 1.
Full redesign was performed on each protein structure, and CFN instances were generated using the Dunbrack library [14] and Rosetta ref2015 score function [43]. The resulting networks have from 44 to 87 rotamer variables, and maximum domain sizes range from 294 to 446 rotamers. The number of variables is doubled after sequence variables are added.
Predictive bounding contribution M = 10 solutions with diversity threshold δ = 10 for each problem from [41] were generated, with and without predictive bounding. The worst CPU-time spent on the resolution without predictive bounding was 32 minutes. It was reduced to 17 minutes with predictive bounding. The average computation time was 201s per problem. This shows that predictive bounding provides a simple and efficient boost and that real CPD instances can be solved in a reasonable time, even when relatively large diversity requirements are used.

Diversity improves prediction quality
For all instances, sets of M = 10 solutions were generated with diversity threshold δ ranging from 1 to 15. For δ = 1, the set of solutions produced is just the set of the 10 best (minimum energy) sequences.
These CPD problems use real protein backbones, determined experimentally. A native sequence exists for these backbones, therefore it is possible to measure the improvements diversity brings in terms of recovering native sequences, known to be folded and functional. Two measures are often used to assess computational protein design methods. The Native Sequence Recovery (NSR) is the percentage of amino acid residues in the designed protein which are identical to the amino acid residues in the native sequence that folds on the target backbone. The NSR can be enhanced by taking into account similarity scores between the amino acid types. Such scores are provided by similarity matrices, like BLOSUM62 [27]. The Native Sequence Similarity Recovery (NSSR) is the fraction of positions where the designed and native sequences have a positive similarity score. NSR and NSSR measure how much the redesigned protein resembles the natural in terms of sequence. If solution diversity helps, the maximum NSR/NSSR over the 10 sequences should improve when δ is large compared to when δ = 1, as long as the costs remain close to the optimum. A solution cost too far from the optimum, which could be generated because of a diversity threshold set too high, would mean a poor quality of the solution. Even with δ = 15, the maximum Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 April 2021 doi:10.20944/preprints202104.0250.v1 Figure 6. Comparison of the best NSR value among the 10 best sequences (δ = 1) with the best NSR value of sequences predicted with diversity δ = 2 . . . 15, for all protein structures. Sequences are ranked by increasing best NSR among enumerated sequences.   For each protein, and each diversity threshold δ = 2 . . . 15, we compared the best NSR (resp. NSSR) that could be obtained with δ to the best obtained with diversity threshold 1, i.e., the simple enumeration of the 10 best sequences. Results are plotted in Figure 6 (resp. Figure 7). While somewhat noisy, they show a clear general increase in the best NSR (resp. NSSR) when diverse sequences are produced, even with small δ. To validate the statistical significance of the diversity improvement of sequence quality, p-values for a unilateral Wilcoxon signed-rank test comparing the sample of best NSR (resp. NSSR) for each δ = 2 . . . 15 with δ = 1 were computed. They are shown in Table 2 and confirm the improvement brought by increasingly diverse libraries.
The improvements are more clearly visible when one compares the absolute improvement in NSR (and NSSR) obtained when one compares the best sequences produced inside a library using a guaranteed diversity of 15 versus just 1, as illustrated in Figure 8. On most backbones (X-axis), the increased diversity yields a clear improvement in the NSR (Y-axis), with the largest absolute improvement exceeding 15% in NSR. For a small fraction of backbones, there is absolutely no improvement. These are backbones for which the GMEC actually provides the best NSR in both the 1-diverse and the 15-diverse cases. Then an even smaller fraction of backbones show an actual decrease in NSR: one close to GMEC solution did better than any of the 15-diverse sequences. The degradation here is very limited and likely corresponds to substitutions to similar amino acids. This is confirmed by the NSSR curve that takes into account amino acid properties through the underlying BLOSUM matrix used. Here, even fewer cases show a degradation in NSSR.
Diversity also has the advantage that it is more likely to provide improved sequences when the predicted 1-diverse sequences are poor. Indeed, with a initial sequence with NSR equal to r, the introduction of a random mutation will move us away from the native in r% of cases (we mutate a correct amino acid) and otherwise (we mutate a wrong amino acid) leave us with a wrong amino acid again (in 18/19 cases, leaving the NSR unchanged) or get us closer to the native sequence with 1/19 probability. On average, a mutation should therefore decrease the number of correct positions with probability (r − 1−r 19 ), which increases with r and is positive as soon as the sequence has NSR higher than 5% ( 1 20   ten backbones with the highest improvement in NSR, nine had a 1-diverse NSR below the average 1-diverse NSR. Conversely, only 50% of the ten less improved backbones had a below-average 1-diverse NSR. While these improvements underline the approximate nature of the energy function, showing that it is often worth to explore the sequence space beyond the GMEC, they also confirm that energy, on average, does guide us towards native sequences: instead of degrading NSR as soon as r > 1 20 , energy optimization pushes the introduced mutations to improve NSR in most cases, even with 15 introduced mutations and initial NSRs ranging from 20 to 60%.  Figure 8. The curves above give the absolute change in NSR (left) and NSSR (right) between the best 15-diverse and the best 1-diverse sequences found for each backbone. Backbones (on the X-axis) are ordered in increasing order of the corresponding measure. On the left, in gray, the bar-plot shows the corresponding 1-diverse NSR compared to the average 1-diverse NSR over all backbones. One can observe that the most improved NSRs, on the right, mostly appear on weak (below average) 1-diverse NSRs.
Each dissimilarity constraint adds n extra variables to the network (with the dual representation). These variable domain sizes increase with the diversity threshold δ and contribute to the construction of increasingly large CFN instances that need to be solved. Computation times are plotted in Figure 9. As expected, a threshold increase leads to an exponential increase in the computation time. The small set of points on the top right corner of the plot correspond to the protein structure 3FYM. This protein, with 70 amino acids, is not the largest in our benchmark set, a reminder that size is not necessarily the best predictor of empirical computational cost in NP-hard problem solving. On this backbone, for high diversity thresholds, the 24h computation time limit was reached and less than 10 sequences were produced.
Given that the optimized function is an approximation of the real intractable energy, solving the problem to optimality might seem exaggerated. The requirement for optimality that we have used in previous experiments can be trivially relaxed to a relative or absolute approximation guarantee using artificially tightened pruning rules as originally proposed in [44] in the so-called Weighted-A* algorithm. This pruning mechanism is already implemented in the toulbar2 solver (using the -rgap and -agap flags respectively).
For diversity threshold δ = 2 . . . 15, we generated sets of 10 suboptimal diverse sequences that satisfy the diversity constraints, but allowing for a 3 kcal/mol energy gap to the optimum. Our optimization are still provable, but the optimality guarantee is now reduced to a bounded sub-optimality guarantee. Empirically, the maximum energy degradation we observed with the global minimum energy over the 1 diverse sequences produced never exceeded 5.65 kcal/mol (with an average energy difference of 3.86 kcal/mol). This is only slightly more than the 4.3 kcal/mol worse degradation (average 2.1 kcal/mol) of the resolution, when exact optimum are used.
We compared these samples of suboptimal sequences to the set of 10 exact best sequences. Results for NSR and NSSR are shown in Figures 10 and 11 respectively, and corresponding p-values are displayed in Table 2 (unilateral Wilcoxon signed rank test). With dissimilarity threshold δ 6, it is clear that the set of diverse suboptimal sequences have better quality than the 10 best enumerated sequences. Moreover, as shown in Figure 12, for harder instances, suboptimal diverse resolution becomes faster than exact enumeration.
So, when predicting a library of sequences, if the instance is hard, it seems empirically wise to generate suboptimal diverse sequences, instead of enumerating minimum energy sequences, without diversity. Doing so, there is a higher chance of predicting better sequences "in practice" faster.

Conclusions
Producing a library of diverse solutions is a very usual requirement when an approximate or learned model is used for optimal decoding. In this paper, we show that with an incremental provable CFN approach that directly tackles a series of decision NP-complete problems, using diversity constraints represented as weighted automata that are densely encoded in a dedicated dual encoding together with predictive bounding, it is possible to produce sequences of solutions that satisfy guarantees on diversity on realistic full redesign Computational Protein Design instances. This guarantee is obtained on dense problems, with non-permutated-submodular functions while also guaranteeing that each new solution produced is the best given the previously identified solutions.
We also showed that the stream of diverse solutions that our algorithm produces can be filtered and each solution efficiently identified as being a δ-mode or not. δ-mode represent local minima, each defining its own basin in the protein sequence energy landscape. Beyond their direct use for design, the guaranteed diversity provided by our algorithm could also be of interest to perform more systematic analyses of such energy landscapes.
On real protein design problems, we observe that small and large diversity requirements do improve the quality of sequence libraries when native proteins are fully redesigned. Moreover, large diversity requirements on suboptimal sequences also improve the quality of sequence libraries, compared to a simple enumeration of the minimum energy sequences. In the context of optimizing an approximate or learned function, the requirement for an optimal cost solution may be considered as exaggerated. Our guaranteed suboptimal resolution is useful, given that even computationally expensive approaches with asymptotic convergence results such as Simulated Annealing may fail with unbounded error [41].
Two directions could extend this work. Beyond the language of permutated submodular functions, the other important tractable class of CFN is the class of CFN with a graph Figure 10. Comparison of the best NSR value among the 10 best sequences (δ = 1) with the best NSR value of sequences predicted with diversity δ = 2 . . . 15 and allowed optimality gap of 3kcal/mol, for all protein structures. Sequences are ranked by increasing best NSR among enumerated sequences. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 April 2021 doi:10.20944/preprints202104.0250.v1 Figure 11. Comparison of the best NSSR value among the 10 best sequences (δ = 1) with the best NSSR value of sequences predicted with diversity δ = 2 . . . 15 and allowed optimality gap of 3kcal/mol, for all protein structures. Sequences are ranked by increasing best NSSR among enumerated sequences.
of bounded tree-width. This parameter is exploited in several protein packing [45] and design [46] algorithms and is also exploited in dedicated branch and bound algorithms, also implemented in toulbar2 [47? ,48]. These tree-search algorithms are able to trade space for time and are empirically usable on problems with a tree-width that is too large to make pure dynamic programming applicable, mostly because of its space-complexity (in O(d w ) where d is the domain size and w is the width of the tree-decomposition used). On such problems, it would be desirable to show that the decomposed ternary or binary functions we use for encoding DIST can be arranged in such a way that tree-width can be preserved or more likely, not exaggeratedly increased. This would enable the efficient production of diverse solutions for otherwise unsolved structured instances. Another direction would be to identify a relaxed version of the NP-hard DIV min constraint that would produce tighter bounds than those produced here, on the conjunction of DIST constraints. Despite non-negligible efforts in this direction using relaxed Multi-Valued Decision Diagrams, we have been unable to reach an empirically convincing result here.  Data Availability Statement: None at this point. The benchmarks we use are available on existing repositories and the code of toulbar2, that includes our contributions, is available on GitHub at https://github.com/toulbar2/toulbar2 under an OSI MIT license.