A similar reasoning allows one to analytically compute site-specific distributions (profiles) of amino acids at different positions in a protein family. If we approximate

P(

A_{1} ⋯

A_{L}) as the product of single site distributions,

P(

A_{1} ⋯

A_{L})

≈ Π

_{i} P_{i} (

A_{i}), we can compute

P_{i}(

A_{i}) as the distributions that maximize the entropy in sequence space for a given value of stability, Δ

G. To simplify this computation, I and coworkers adopted the hydrophobic approximation, which consists in approximating the contact interaction parameters with their main spectral component, which is related to hydrophobicity [

64]:

where

ϵ < 0 and

h(

a) is correlated with several empirical hydrophobicity scales [

65]. In this way, the energy transforms into the quadratic form

E =

ϵ∑

_{ij} C_{ij}h_{i}h_{j}, with

h_{i} =

h(

A_{i}), and we can analytically determine the sequence that minimizes the energy for a fixed value of

${\sum}_{i}{h}_{i}^{2}$ and fixed average hydrophobicity, constraints imposed in order to limit the free energy of the misfolded ensemble. This is the sequence whose hydrophobicity profile,

h_{i} (a sequence signature), is proportional to the effective connectivity (EC),

c_{i} [

66], a structural signature, in turn strongly correlated with the principal eigenvector of the contact matrix [

65]. The condition that the stability of the protein is fixed can be then substituted by the simpler condition that the average hydrophobicity is proportional to the EC,

$\overline{{h}_{i}}={\sum}_{a}h(a){P}_{i}(a)=\alpha {c}_{i}+b$. The distribution,

P_{i}(

a), can then be computed as the distribution of maximal entropy subject to the constraint on its mean value. The result is a Boltzmann-like distribution,

P_{i}(

a) ∝ exp(−

β_{i}h (

a)). In the absence of selection,

P_{i}(

a) would be the distribution given by the mutational process. Therefore,

P_{i}(

a) is the distribution with minimum Kullback–Leibler divergence from the mutational distribution,

P_{mut}(

a),

i.e.,

P_{i}(

a) ∝

P_{mut}(

a) exp(−

β_{i}h(

a)), where the selection coefficient,

β_{i}, expresses the strength of natural selection at each position (the largest is |

β_{i}|, the more the distribution deviates from the one induced by mutation), and it can be determined by imposing the constraint ∑

_{a} h(

a)

P_{i}(

a) =

αc_{i} +

b. We have verified that this distribution is in very good agreement with the site-specific distributions obtained through simulated evolution with stability constraints [

36,

41], and it is in good agreement with the distribution that is obtained aligning sites of proteins with a known structure that have similar values of effective connectivity [

66]. For a given protein family, the maximum likelihood fit of the observed profile,

f_{i}(

a), to the above equation allows for the determining of the 21 parameters,

P_{mut}(

a) (one of these parameters is given by the normalization condition),

α and

b, and to compute the exponent,

β_{i}. Note that

β_{i} depends on the mutational distribution,

P_{mut}(

a). For instance, if mutations favor hydrophobic amino acids, selection will be stronger at exposed positions where selection favors hydrophilic residues. Conversely, if mutations favor hydrophilic amino acids, selection will be stronger at bulk positions, where hydrophobic residues are preferred [

67].

#### 7.1. Relationship between Chain Length and Positive Design

Surface residues form fewer contacts than bulk residues. For globular proteins, the surface-to-volume ratio decreases with chain length as

L^{−1/3}. Therefore, longer proteins tend to have more contacts per residue:

N_{c}/

L ≈

c (1 −

bL^{−1/3}) (see

Figure 2A), and they can more easily compensate for the loss of conformational entropy upon folding, which is proportional to

LS_{U}. This observation led us to predict that proteins with a larger number of contacts per residue, and in particular, longer proteins, need to optimize their native contacts less in order to achieve the same level of stability If only the unfolded ensemble is thermodynamically relevant, as for proteins that fold with two-states thermodynamics, it holds Δ

G/

L ≈ ∑

_{i}_{<}_{j}${C}_{ij}^{\text{nat}}U({A}_{i},{A}_{j})-T{S}_{U}=\langle U({A}_{i},{A}_{j})|{C}_{ij}^{\text{nat}}\rangle {N}_{c}^{\text{nat}}/L-{S}_{U}$, where

$\langle U({A}_{i},{A}_{j})|{C}_{ij}^{\text{nat}}\rangle ={\sum}_{ij}{C}_{ij}^{\text{nat}}U({A}_{i},{A}_{j})/{\sum}_{ij}{C}_{ij}^{\text{nat}}$ is the mean energy of native contacts.

**Figure 2.**
The number of contacts per residues,

N_{C}/

L, increases with chain length (

**A**), but the mean hydrophobicity reaches a maximum and then decreases for very long proteins (

**B**). The predicted native energy per contact (

**C**) and Z score of the native energy (D) increase with the number of contacts per residues,

NC/

L,

i.e., native contacts become weaker and less optimized for more compact and longer proteins. Same data, as in [

68], are used, consisting of 4,528 non-redundant proteins with a known structure.

**Figure 2.**
The number of contacts per residues,

N_{C}/

L, increases with chain length (

**A**), but the mean hydrophobicity reaches a maximum and then decreases for very long proteins (

**B**). The predicted native energy per contact (

**C**) and Z score of the native energy (D) increase with the number of contacts per residues,

NC/

L,

i.e., native contacts become weaker and less optimized for more compact and longer proteins. Same data, as in [

68], are used, consisting of 4,528 non-redundant proteins with a known structure.

As we have seen above, when the physical temperature is low or Δ

G is very negative, so that the fitness,

f, is close to saturation, there is a neutral evolutionary regime in which we expect that proteins achieve only the marginal stability that allows their functioning. In this regime, we expect that the absolute value of

$\langle U({A}_{i},{A}_{j})|{C}_{ij}^{\text{nat}}\rangle $ decreases with

N_{c}/

L or, which is the same, with chain length; in other words, individual native contacts are expected to be weaker for longer proteins. This prediction has been verified for a representative set of of proteins in the PDB [

68]; see

Figure 2C. Not only the average value, but also minus the Z score of native interactions with respect to all possible pairwise interactions, decreases with

N_{c}/

L,

i.e., native interactions are less optimized; see

Figure 2D. Conversely, as we saw above, for longer proteins negative design becomes more demanding, since the freezing temperature of the misfolded ensemble increases with protein length. Consistently, we find that the average hydrophobicity, 〈

h〉, first increases with chain length, since the number of bulk

versus surface residues increases, but then it reaches a maximum and decreases,

i.e., very long proteins tend to be less hydrophobic [

68], which has the effect of reducing the stability of the misfolded ensemble (see

Figure 2B).