Sdphound, a Mutual Information-based Method to Investigate Specificity-determining Positions

Figure 1. Snapshot of the Excel worksheet created by SDPhound in the case of a single position run. Results of the application of the algorithm are reported in various forms, one of them is an Excel Worksheet that contains all the relevant information related to the run. Best ranking positions and run parameters are shown as well as the estimation of the conditional probability of belonging to a speci-ficity class given that any specific symbol, amino acid or " pigeonhole " , is found at the current position. Conditional probability is estimated from the alignment itself in the frequentist approximation. In case where the identity substitution matrix is used, a " −1.0 " can appear in the conditional probability cells, indicating that the specific amino acid is not present in the alignment and therefore no information can be directly inferred for that mutation. Two typical examples are shown in Figure 1 and 2.


Introduction
Considerable efforts in current biophysical research are devoted to identifying viable procedures to engineer mutations in a protein so as to manipulate its properties in desired directions. Most available methods rely on the assumption that the functional, or structural, specificities among homologs depend on relatively few crucial residues that are conserved among proteins sharing the same feature. The problem thus becomes recognizing such residues. The combinatorial number of possible mutations even in relatively small proteins often makes purely experimental approaches, such as random mutagenesis, not affordable and it compels to create reliable and efficient computational tools to assist experiments by predicting which residues are more likely to affect the desired property.
Several in silico approaches for predicting sites with some functional importance in proteins are available [1]. Some need the sequence and some need the knowledge of both sequence and structure. Of particular relevance in this context is a family of methods based on information theory, that build upon the Maximum Likelihood Estimator scheme [2]. These techniques exploit the growing amount of protein sequence data available for a wide variety of organisms. The general strategy consists in performing a statistical analysis of a multiple sequence alignment of proteins of a certain family and in relying on the assumption that since mutations at SDPs change the function of the protein, they are generally conserved between proteins with the same function, but tend to be distinct for proteins with different functions [3]. The correlation between the presence of a residue at a given position in the alignment and the inclusion of the protein in a class, identified by a specific function or quaternary structure, is evaluated through the joint probability P p (α, i) for the event "amino acid α occurs at position p and the protein belongs to the i th class". The figure of merit in the analysis is provided by the average mutual information: The sum over i spans the number of specificity classes and α covers the amino acid set. P p (α) is the probability to find residue α at position p and P (i) the probability that a sequence belongs to the i th class. The definition above establishes a measure of the specificity content of a position in the alignment: larger values of I p are expected to indicate more relevant positions in identifying a given subfamily. Unfortunately, actual values of the probabilities in eq. (1) are not known. They can only be estimated and the main difference among mutual information-based methods, lies in the choice of the most appropriate estimator for the joint probability.
In this work, we describe a protocol, called SDPhound, to identify SDPs and analyze the physical characteristics that may be responsible for their role. The core of the procedure is a prescription for constructing an estimator of P p (α, i) that is rather general and rigorous in its probabilistic interpretation. Once this estimator is available, the specificity content of the different positions can be ranked via the mutual information. Although the formal structure of the estimator is fixed, variations on the specific functions used in it make it possible to implement a set of steps that, appropriately combined, allow to first screen and then characterize putative SDPs. The steps will be described in detail in sections 2. and 3., a preliminary summary is as follows. The estimator of P p (α, i) is based on the frequency of appearance of a residue at a given position in a given alignment. We combine this basic ingredient with different smoothing functions that can dress the estimated probability to account for non relevant statistical fluctuations. A reasonable stability of positions ranking with respect to the choice of the smoothing functions is a good indicator of the robustness of the predictions with respect to, for example, bias due to the finite size of the aligned set. The structure of the estimator can also be exploited, together with the concept of substitution classes, to investigate the physical and/or chemical factors responsible for specificity both by an a priori classification according to some predefined properties and by an automatically generated partitioning that preserves the mutual information content. The procedure can also be generalized to describe possible pairwise correlation effects among SDPs. The performance of SDPhound is assessed in three applications, as illustrated in sections 4. and 5. We begin by validating the approach by comparing its performance to that of a popular alternative scheme due to Kalinina et al. [2] in the applications they chose as benchmarks [2]. In particular, we investigate positions responsible for specific features of the Major Intrinsic Protein (MIP) and the bacterial transcription factor LacI families. In both cases, after verifying that SDPhound performs as well as the more established method in ranking putative SDPs, we shall refine the analysis by examining the physical characteristics of the positions. This second step is, to the best of our knowledge, an original feature of SDPhound. We shall further apply the method to identify a set of mutations able to affect the multimeric state of the Intrinsically Fluorescent Protein (IFP) family. This is a problem of considerable interest since IFP are extremely important in molecular biology and biotechnology where they are used in a variety of in vivo visualization techniques [4][5][6]. The first discovered member of the family was the Green Fluorescent Protein (GFP) from the jellyfish AEquorea Victoria (avGFP) [7]. It has been proved that IFP are almost ideal tags for confocal microscopy and that they can be genetically fused to other proteins and expressed in living cells or organisms without disturbing their physiology [5]. For this reason, in the last 15 years keen attention was devoted to the GFP technology, leading to several dozens avGFP mutants with improved photostability and different optical properties [6]. Moreover, it was recently recognized that homologs of this protein exist in a variety of different sea animals belonging to Cnidaria and to Bilatera [8,9]. Although the average identity within the family is only 40%, all of them share the same tertiary fold. The IFP quaternary structure is particularly interesting: the wild-type (WT) proteins exist in nature as tetramers, dimers or, rarely, monomers. A clear characterization of the biological role of multimerization for the IFP is lacking. In Molecular Biology applications, however, the monomeric form is desirable in order to produce small fluorescent probes that can easily be transfected into a cell. More than a hundred fluorescent proteins (WT, homologs and mutants) are known and their structural properties are intensively studied and relatively well understood. These proteins therefore are very well suited to be studied with statistical techniques for the design of mutants with specific properties. In this work, we specialize our method to analyze the positions responsible for transforming a multimeric fluorescent protein in a monomeric one. We also look for mutations able to split a tetramer in two dimers. The predicted mutations compare well with those already recognized as effective by random mutagenesis. Furthermore, a new position, deserving experimental verification, is indicated and independently validated via the MMPBSA technique [10]. After the discussion of the results for the IFP family, conclusions and acknowledgements close the paper.

Approach
In the following, we summarize the sequence of steps performed in a typical application of SDPhound, while the precise definition of the various theoretical quantities associated with these steps is postponed to the next section. As stated in the introduction, the ultimate goal of the algorithm we propose is to rank putative specificity determining positions based on their mutual information score. To that end, prior to the application of the SDPhound protocol, a set of homologous proteins selected from the literature or public databases is divided into classes whose members share similar specificity. These homologs undergo multiple alignment and the alignment is then used as an external input in the software implementation of SDPhound to determine the mutual information content of each position. Calculating the mutual information requires to define an estimator for the joint probability in eq. (1). We introduce different realizations of a general prescription for the structure of such probability. This prescription, which contains no tunable parameters, can refer to single positions of the alignment, or consider pairwise correlation among them. The probability estimator can weigh the frequencies of occurrence of amino-acids at given positions in any class by using a scheme that employs BLOSUM substitution matrices [11] (see next section for details). This accounts for the presence of "similar" amino acids and it mitigates the effects of finite size of the available statistical sample and of background similarity of homologs on the frequency-based probability estimation. To further refine the assessment of the statistical significance of ranking, a well known problem for any mutual-information method [12], we use a Z-score criterion described at the end of next section.
Once the most interesting positions are selected, the focus is shifted on identifying the relevant physical characteristics associated with them. To this aim, the concept of residue substitution classes [13] is introduced by grouping the various amino acids in sets, "pigeonholes", identified on the basis of some predetermined properties (e. g. hydrophobicity, charge, size). High ranking positions in these "observable"-based runs are expected to be sensitive to the corresponding property, rather than to residue identity. Furthermore, to explore new properties that were not considered by the manual assignment to the pigeonholes, an automated procedure that maximizes the mutual information content of the high ranking positions with respect to pigeonhole content is outlined. As it is shown in the Supplementary Information (SI), the software implementation of our method, SDPhound, automatically provides a pictorial representation of the results, which are reported in the form of Microsoft Excel worksheets and html pages. The MATLAB code for SDPhound is available at http://homepage.sns.it/∼rocchia/

The estimator for the probability
The key step in setting up the mutual information for a given problem is the definition of the joint probability for the events whose correlation one is trying to establish. In this section, we introduce one such probability so as to make more rigorous the prescription given by Kalinina and co-workers [2] while retaining the single amino acid substitution scheme they advocate. Moreover, we generalize the joint probability in two ways: (1) by considering concerted substitutions of more than one amino acid at a time; (2) by introducing substitution classes ("pigeonholes") that correspond to specific properties of the amino acids and inferring from the presence of a residue belonging to a given pigeonhole at a given position the physical properties conferring specificity.

Probability estimator for ranking of individual positions
The estimator of the the joint probability in eq. (1) is defined as The equation above can be read as follows: the total probability to find amino acid α at position p is given by the probability, approximated by f p (β, i), to find an amino acid β at that position times the probability, m(β → α), that α substitutes β. Such a function thus describes the event "a given amino acid, α, or a similar one, β, occurs at position p and the protein belongs to class i". The sum extends to 24 due to the possible presence in the alignment of non standard symbols such as B, Z, X, and of the gap symbol "−". In our applications, the symbols B, Z, and X are included in the definition of the probability above but they are left out of the space of the events since they bring in redundant information (i.e. we are only including mutually exclusive events).
In actual calculations, f p (β, i) is the relative frequency of occurrence of β in position p of a protein belonging to class i. Bias of the overall alignment can be treated via a weighing procedure as suggested in [15]. The substitution probability m(β → α) is introduced with the intent to smooth the bare relative frequency of occurrence of amino acid α at position p by taking into account residue substitutions occurring at that position weighted according to their similarity. This quantity can be defined in different ways. A natural choice, given its interpretation, is q β,α is closely related, in a sense that will be clear shortly, to the Clustered Target Frequencies matrix provided by the Blocks WWW Server, created by S. Henikoff. It represents the probability of occurrence of two amino acids α and β in the same column of a prototypical block alignment [11]. The importance of a smoothing scheme has been recognized and used in the past, most notably in SDPpred, the approach that we choose to benchmark the performance of our method. In this previous work, however, the substitution matrix is used in the log odds form and contains an ad hoc parameter [2]. Our suggestion is parameter free and enables the right hand side of eq. 3 to be interpreted as an actual probability. Gaps, "amino acid" number 24 in the sum above, are treated according to the following prescriptions: a column in the alignment is neglected if more than 30% of its constituents are gaps. In the remaining cases, gaps are considered as an additional amino acid with substitution probabilities obtained by suitably enlarging and rescaling q β,α . Namely, one row and column are added to this matrix, in such a way that the, fictitious, substitution probabilities toward gaps are proportional to the overall percentage of gaps in the alignment. Values of the substitution probability are determined accounting for the individual propensity (or resistance) of each amino acid to mutate, which is represented by the diagonal elements of the q β,α matrix. Finally, the extended matrix thus obtained is suitably normalized. We preferred this approach to alternatives in the literature, where, for instance, m(gap → gap) = 1 and m(gap → α) = m(α → gap) = 0 ∀α, since in this latter case the "gapped" positions present spuriously high levels of mutual information. Note that the above prescriptions ensure that: (i.e. the probability to find any amino acid, or a gap, at a given position is one, as it should). Note also that, if we assume: In section 5. we will show the effects of different choices for m(α → β) on our system by employing the identity matrix and the BLOSUM45, BLOSUM62 matrices. These last two matrices are often used in the literature. In addition to these choices, we also introduced a "local" BLOSUM matrix, called Bloc, whose elements are the q α,β relative to the specific alignment that is being examined. In the presence of an alignment constructed from a sufficiently numerous family, this matrix may keep into account the family's peculiarities better than the non specialized BLOSUMs.

Probability estimator for ranking of correlated positions
Considering correlated mutations of pairs of residues at different positions involves a straightforward generalization of the scheme described above. It is in fact sufficient to replace the event "amino acid α occurs at position p" with "amino acid α occurs at position p and amino acid β occurs at position q". Eq. (2) then becomes and the mutual information related to the pair position {p, q} can be calculated as To reduce the impact of marginal mutual information contributions to the pair position score, we maximize the following expression: This definition is very closely related to the one of the "triple mutual information" [16] or "mutual interaction" [17]. We then define the "pair substitution probability" as: This choice, beside being the simplest, can be justified as follows: the joint probability of finding residue α at position p and residue β at position q in group i can be written as where P q ((β, i)|α p ) is the conditional probability of finding, in group i, β at q given that α was in p. Both probabilities on the right hand side of the equation above can be written in analogy with the definitions introduced previously: (The use of Henikoff-like matrix element in the definitions above is justified since each one of them refers to single amino acid counts.) Finally, noting that the sample estimate of eq. (10) also leads to we found the choice of eq. (9) consistent and reasonable.

Probability estimator for ranking based on physical properties of the amino acids
The concept of substitution class [13] can be used to reduce the alphabet of symbols in our alignment and it allows to capture general features required at a given position to maintain specificity. This point is particularly relevant in our work, for its practical implications. In fact, once a SPD is found, determining the chemical-physical property mainly responsible for its specificity can give hints on the mutations to be tried to push the protein toward another class. This is indeed one of the main application we envision for our method. Additionally, the concept of reducing the 20-letter amino acid alphabet to a few letter one is very general in the theoretical modeling of protein structure, especially in Coarse Grained representations [14]. As illustrated in the following subsections, this reduction can be done either based on a set of physical or chemical characteristics identified a priori or by letting SDPhound find the most appropriate set of substitution classes for a given alignment.
A priori partitioning of amino acids on a physical basis Suppose we characterize and group the amino acids based on some convenient physical property such as size, charge, hydrophobicity, etc. as detailed in the SI. Let σ α indicate one of these classes, that we shall call "pigeonholes" so as to avoid confusion with the specificity classes. Following the same line of thought of subsection 3.2., we define as the probability to find a member of pigeonhole σ α in position p in the sequence belonging to class i. For example, if σ α corresponds to "being a positively charged amino acid", this is the probability to find a basic residue in position p for a class i protein. f p (σ β , i) is the relative occurrence frequency at position p of a generic member of the pigeonhole σ β and M (σ β → σ α ) is the substitution probability of any member of pigeonhole σ β with one belonging to σ α . We define this probability as: The probability to pick the element β within the pigeonhole σ β is estimated according to the usual frequentist approach: Once the probability (13) is calculated, it can be used to obtain the mutual information Given the meaning of the probability employed, a ranking of the positions based on the mutual information above rewards positions depending on the physical characteristic associated to the pigeonhole. This generalization of standard mutual information based ranking paves the way for an efficient new way of analysing SDPs, because it is able to translate SDP relevance in terms of physical meaning. As we shall show in the applications, this suggests how to target mutations in a more focused way, alternative to purely random mutagenesis.

Optimal automatic pigeonholing
An amino acid partitioning based on some a priori chosen property, like the one introduced in the previous subsection, may not capture the characteristic interplay of phenomena responsible for specificity. Therefore, an automated procedure searching for an optimal amino acid distribution among a predefined number of otherwise arbitrary pigeonholes may be preferable. Here we suggest an algorithm that performs this search according to an iterative prescription that minimizes a suitable objective function, defined as the average mutual information of a prescribed number of best ranking positions of the standard run. An iterative cycle is composed as follows: a first partitioning is created by a random assignment of an approximately equal number of amino acids to each pigeonhole. Then, the target function is evaluated. A new partition is generated by a random displacement of an amino acid from the original pigeonhole to a different one. The new value of the target function is computed and compared with the previous one. If this value has increased, the current partitioning is retained, and used as the first step of a new iteration. Otherwise, the new partitioning is rejected and the iteration starts from the original one. The cycle is repeated until the target function becomes stationary. An a posteriori analysis of the obtained partitioning can be made to identify the relevant common feature within each pigeonhole.

Statistical significance calculation
To evaluate the statistical significance of the estimated mutual information for a given position, i, we follow the standard procedure of comparing it to a reference value < I sh i >. The latter is obtained by randomly shuffling residues within columns in the alignment, calculating the mutual information corresponding to each shuffled set and then averaging this quantity over the number of shuffles. By eliminating any correlation between the position and specificity, this procedure provides a measure of the background similarity of the protein set [18]. The computed mutual information is then ranked according to the zeta score where σ(I sh i ) is the standard deviation of the shuffled mutual information. We have also explored alternative ranking schemes that enforce the additional hypothesis of linear correlation among the values of the mutual information obtained upon shuffling, such as the background correlation removal suggested in [2,12]. These have, however, proved less informative than the simple method outlined above. Providing an a priori criterion to identify the maximum number of meaningful SDPs based solely on the statistical information gathered from the runs is a delicate point of the statistical analysis. In this work, a Z-scores frequency histogram is built to estimate their distribution. Positions with (high) Z-scores that have probability smaller than 0.1 are retained as putative SDPs. This simple indicator provided, in our preliminary study, more stable and consistent results than alternative methods, such as the Bernoulli estimator suggested by [12].

Results I: MIP and LacI families
The performance of SDPhound both in identifying putative specificity determining positions and in providing an indication of the relevant physical characteristics associated with them is evaluated on three different protein families. The first two, the MIP and LacI, are established benchmarks [2,19,20] for validating mutual information based methods. In this section we begin by comparing the rankings for the positions obtained with our method to those obtained feeding the same alignments to the SDPpred Web Server [21]. This is a well established tool in the field and, as mentioned before, we choose it as a reference method because it too is based on mutual information and introduces an estimator for the joint probability (different from ours) that allows for non-uniform weight of amino acids substitutions. As shown in the next two subsections, the performance of the two methods is very similar, corroborating the reliability of the new scheme presented in this work. After this test, we move to using SDPhound for investigating the characteristics of a suggested SDP. Validation of the results in this case is more delicate and we resort to an a posteriori analysis based on visual inspection and physical arguments. In this case too, SDPhound provides interesting information as we shall discuss in the following. The third case we study is that of the IFP family. For this family, the validation of the performance with respect to positions ranking is on quite solid grounds since it is based in large part on comparison with mutations that are experimentally known to affect the property we shall be investigating. Given the non-standard nature of the test set and the biophysical relevance of the problem, results for this case are presented separately in section 5..

The MIP family
Members of the MIP family assist the transport of both water and small neutral solutes through the cellular membrane. There are about six MIP subfamilies, the two major being the aquaporins (AQP) and the glycerol-uptake facilitators (GLP). Here, we ask SDPhound to identify SDPs responsible for the distinction between these two large subfamilies and then compare our results to those presented in the Kalinina et al. paper. The alignment used as input for the different runs is the same as in references [2,20], where all the necessary details on the nature of the set, containing 60 members of the family, can be found. In table 1 we compare the ranking of putative specificity determining positions obtained using SDPhound.  *  108  137  108  12  135  191  *  211  186  137  13  137  135  137  20  211  14  194  20  30  30  43  15  191  *  137  208  48  *  136  16  20  194  186  134  199  *  17  24  24  20  24  194  18  26  237  135  216  24  19  216  199  *  199  *  43  20  20  237  34  235  202  *  200  *  21  34  132  34  237  193  22  233  199  *  134  194  194  23  194  31  26  208  34  24  100  43  56  34  257 Columns 1-4 report the outcome of successive runs performed using the different substitution matrices described in section 3.2.: BLOSUM45 (B45), BLOSUM62 (B62), the identity (Id), and the "local" BLOSUM matrix (Bloc). The last column reports the results provided by the SDPpred web server. The asterisk in the column marked S in the table indicates a position identified as significant according to the criterion established in [2] to assess the performance of SDPpred, i.e. proximity to the glycerol molecules bound inside the pore channel in the crystal structure 1FX8 (see reference for details). The second row of the table heading indicates the number of statistically significant positions in the ranking based on the Z-score criterion illustrated in section 3.5. (columns 1-4), and by the analogous indicator in the SDPpred scheme (column 5). As it can be seen from the table, both the identity matrix based run of SDPhound and SDPpred identify 8 significant positions. The performance of the SDPhound runs that use the alternative substitution matrices is very similar, and it is in fact difficult to assess if there is a difference given the limited statistics available. Table (2) presents the ranking produced by SDPhound when the mutual information used in the ranking is constructed using the probability defined in eq (13). The method points blindly -i.e. not guided by the researcher's insight into the problemto a set of positions. Based on the ranking, an a posteriori analysis can be performed to evaluate the characteristics of the isolated positions. Among the higher ranking positions, 48 is singled out. The relevance of this site on the functioning of the pore selectivity of the transported molecule has already been remarked experimentally [22,23]. The residue at this position is part of the bottleneck of the transport channel. In GLPs, the pore cavity is almost exclusively constituted by hydrophobic and few positively charged residues, but in AQPs it is also constituted by some polar ones; in either case, the pore has a slightly different location through the protein. Pigeonholing suggests SDP 48 is important by way of its charge/polarity, while its dimension also has a minor influence. Table 2. Substitution class based runs over SDPs relevant to aquaporins and the glyceroluptake facilitators distinction. For each position, the ranking in the run related to hydrophobicity (hyd), charge (crg), and size is shown. "−" means that in that run, the position ranked below 40 th . A color code is used to help reading: the positions are colored according to which "quality" (hydrophobicity=red, charge/polarity=green or size=blue) is more relevant to that position. In ambiguous cases "intermediate" colors are used: magenta=blue+red (hyd+size), yellow=red+green (hyd+crg) and cyan=green+blue (crg+size). For practical and representation purposes, we identified a significance threshold of 10 for coloring specific positions. When a position ranks above 10 in all of the three features, it is outlined in bold. This is reasonable since, in such a narrow channel, even small changes in polarity and conformation can discriminate between different substrates, altering the balance between the affinity towards a substrate and the efficiency of its transport mechanism. For instance, in some AQPs a more bulky hydrophobic amino acid in this position may completely shut the channel. Two other interesting positions highlighted by the method are 201 and 202. These are located near the bottleneck generated by residue 48. These residues are relevant, respectively, for charge and size. While charge of 201 may modulate substrate affinity, dimension of residue 202 helps explain why the transport channel has different size and position in the GLP and AQP subfamilies.

The LacI family
As a second test of the usefulness of SDPhound, we apply it to investigating the LacI family, another classic testing ground for these methods. The LacI family is a rather large family of bacterial transcription factors, which comprises 54 orthologous and paralogous proteins divided in 15 subfamilies (AraR, KdgR, CcpA, DegA, YjmH, RbsR, PurR, CytR, GalSR, AscG, LacI, TreT, GntR, IdnR and FruR), populated with a number of members which varies from 2 (for the AraR, KdgR, YjmH, LacI and IdnR subfamilies) to 12 (for the CcpA subfamily). Further details on the nature of the set can be found in the references mentioned above, while here we summarize only the minimal data required for benchmarking SDPhound. The sequence alignment was the same as in [2,19,20]. Also for this family, key positions for discriminating among the mentioned subfamilies are searched for. And also in this case, the evaluation of positions indicated by Gelfand and coworkers in [2] is used to compare the performances of our method and of SDPpred. In that work, SDPs were mapped on 3D X-ray structures 1WET, 1JWL and 1BYK and classified into 4 groups, that we indicate with the numbering 1 to 4 in the score columns, marked as S, of Table 3. The classification is as follows: 1) experimentally validated SDPs, 2) SDPs of a domain, with a distance < 5Å from the effector, the substrate or another subunit within a multimer in at least one of the 3D structures and < 7Å in the other two, 3) SDPs in the vicinity of one of the domains, and 4) SDPs with no apparent function. As it can be seen from the Table, [24] have investigated position 190, demonstrating its role in corepressor binding. This position was not predicted by SDPpred but was identified both by the B45 and the Bloc SDPhound runs. When analyzed with the pigeonholing runs, several interesting features of the relevant positions emerge, as summarized in Table 4. The experimentally validated positions (15,16,55, and 147) rank quite high in all physical characterization, pointing to the high specificity of the residues at those locations. Position 55 hosts a residue in close contact with operator DNA. The positive charge of K55 in 1WET, as well as other polar residues able to donate a hydrogen bond found in the same position in other ortho/paralogous proteins, explains why it has been found by mutagenesis as a position influencing affinity for the operator DNA [25]. Glasfeld and coworkers demostrated that the positively charged Lys55 is able to discriminate among different operator DNA sequences. Firstly, it increases affinity for the electrostatically negative minor groove of the operator DNA; secondly, it selectively binds operator DNA's that does not contain guanine in the position Lys55 interacts with [26]. Nonpolar residues, such as alanine, appear in other paralogous proteins that can bind guanine-containing operator DNA sequences.
SDPhound not only predicts this position, but it correctly points out that the physical property determining its specificity is based on its charge/polarity. We would like to emphasize that such result is obtained by SPDhound without any structural information of the protein or the interacting DNA. Among the other positions in Table 4, 98 stands out both for charge and size relevance. Inspection of the Xray structures reveals that residue 98 (or its homologous 100 in PDB structure 1JWL) is located both at the opening of the substrate pocket and in contact with another protein in the dimer. Modulation of size and charge, then, can enforce in different proteins a sophisticated selectivity towards different substrates. Also interesting is the classification of position 107, which is pointed at as relevant due mostly to hydrophobicity: the location of the corresponding residue at the interface could be due to the need of tightening or loosening the bond between the proteins in the complex and, thus, their ability to cooperate.

Sequence alignment
The only external input necessary to perform SDPhound runs is a multiple sequence alignment. A common feature of sequence-based bioinformatic methods is the potentially high sensitivity to the quality of the given alignment. Therefore, particular care has to be devoted to maximize its accuracy before attempting the statistical analysis. This can be sometimes a difficult task (as, for example, in the case of receptors with poor structural characterization and varying sequences); the description of what we did in the IFP case to gain confidence in the reliability of the input follows. We adopted a combined structurebased and sequence-based alignment procedure. First we structurally aligned the protein sequences for which 3D structure is available using the "STAMP" multiple alignment module of VMD1.8.3 [28,29]. This procedure is based on the alignment of the structurally conserved regions, thus it is likely to bring more information and accuracy in the alignment. For each of the proteins with unknown crystal structure, we determined the closest homolog in a non redundant sub-set of the crystallized proteins set. Sequence-based alignment is done with the FASTA suite of programs.
Subsequently, we aligned with a standard multiple-alignment sequence based algorithm all of the proteins with the same crystal homolog. Finally we merged all the sequence-based alignments following the first structure-based alignment to produce the input data. When no crystallographic data are available, this procedure reduces to a standard multiple sequence based alignment. Conversely, when all the sequences correspond to a crystallographic structure, this procedure reduces to a standard structure-based alignment. Figure 2 reports a selection of the alignment. The proteins are shown and the SDPs are outlined with color coding as in Table 5. The whole alignment is given in the SI.

Monomerization of multimeric IFP
Within the IFP, different monomer to tetramer transitions can occur. Here we focus on the one that is expected to characterize close homologs of the DsRed protein, Figure 1. A set of single position runs aiming at identifying putative SDPs that favor the monomeric character was performed as a first test of the procedure. The sequences were divided in two classes consisting of 26 monomeric and 66 multimeric proteins. Average identities were 51.6%, 73.6% and 49.9% for the whole set, the first and the second class, respectively. Table 5 shows the 20 best ranking positions, estimated according to the different choices of similarity matrix described in the Methods. As in the previous section, for classification and assessment purposes, we associated a numeric code, S, to each position. Consistent with the fact that experimentally validated positions can be categorized as belonging to the multimer forming interface or not, S values are defined as: S=1 experimentally validated, at the multimer-forming interface; S=2 experimentally validated, not at the multimer-forming interface; S=3 untested, at the multimer-forming interface; S=4 untested, not at the multimer-forming interface.  Table 6 (complete alignment given in the SI). The whole IFP set has an overall average identity of 40.8%. The sequence identity is larger in the first part of the sequence up to position ∼100, and smaller in the second half, except for isolated residues such as the highly conserved E197 (numbering according to DsRed). Table 5, different choices of the substitution matrix lead to similar performances, pointing to a good stability of the scheme. We believe that the (slightly) better performance of the Identity matrix is due to the presence in the set of several point mutants at some promising positions. This enrichment of local amino acid distribution brings a level of detail that can be better seized by a sharper discrimination between the different amino acids. Position 147 is assigned a star since it has been identified in experiments, but it can lead to non-fluorescent proteins [27,30] and so it deserves a special classification. In Table 5 we also compared SDPhound results with those obtained by the SDPpred Web Server [2] using the same alignment and under the same conditions of shuffle number and of percentage of gaps exclusion. SDPpred returns a slightly lower number of hits and predicts only 9 of its positions to be statistically significant. Differences could arise from the different probability smoothing procedure and from the removal of background correlation, which does not seem to improve the results when performed on our set (see Sect. 4. of SI). The last row in the table reports how many of the 33 positions that are known to affect the aggregation state ranked higher than 33 in the various runs. As it can be seen from the Table, all runs succeed in pointing out the majority of the relevant positions with only minor fluctuations in the quality of the results. Table 6. Substitution class based runs over SPDs relevant for multimerization in IFPs family. Ranking and colors as in Table 2. role of SDPs, with no other input but the sequence alignment. Hydrophobicity is important for positions 194, 192 and 156. It is apparent that nonpolar residues located at an interface can favor multimerization by decreasing the desolvation penalty of the external surface of the protein; this is the case for F194K or Y194K mutations. Charge is relevant for positions 164, 153, 117, and 223; in particular, mutations such as A164R or Y164R can introduce an electrostatic repulsion between monomers that prevents their aggregation. Steric hindrance can be a third cause for the decrease of the rate of binding between single units and the pigeonholing procedure spots positions 117, 194 and 164. It is worth noting that different physical features can be relevant for the same SDP, as is the case with positions 117, 194 and 164, thus restricting the possible identities of the corresponding residues. A164R mutation can, for instance, disrupt the hydrophobic packing of residues on the surfaces of monomers within a tetramer. In some cases, such as 83 and 8, none of the selected physical properties exhibits a significant score. This suggests that none of the partitionings is able to capture the peculiarity relevant in those last cases and it prompts to find alternatives to an a priori definition of the characteristics of the pigeonholes. With that in mind, the "optimal automatic pigeonholing" described in section 3.4. was used to generate new substitution classes. We applied it to the set of 92 sequences analyzed in subsection 5.2., setting the number of bins to 5, aiming to maximize the average mutual information of the 10 best ranking positions, regardless of which they are. The method provided a clear-cut partitioning only for some amino acids, as shown in Figure 3, more information on the run can be found in the caption. The obtained partitioning corresponds partially to the traditional amino acid groupings, such as the Taylor's Venn Diagrams [31]. In particular, ARG and LYS residues are grouped together and the individuality of CYS is revealed. Differences in partitionings, however, would not be new and have already been discussed [13]. Since we maximized the average information content of the 10 best ranking positions, the classes we sketch represent the intermingling of individual amino acid peculiarities that are relevant to the monomerization propensity of IFP; our procedure can also be utilized to focus on the properties of one single position at a time. Figure 3. Results of the "automatic pigeonholing" procedure. The process reduces the alphabet from 20, ignoring the gaps, to 5 symbols (the rows in the table) while maximizing the average mutual information of the 10 best ranking positions. Interestingly, the corresponding reduction in the average information content is remarkably small: from 0.421 to 0.407. The space of possible partitionings was explored with 2.0 10 6 iterations. The table summarizes the results of the best 5% of amino acid partitionings: for each the percentage frequency of occupation for each class is shown. Colored residues have been unambiguously assigned, whereas the remaining ones show a more complex pattern of correlation. Table 7. SDPs inferred from 66 Di-and Tetrameric DsRed homologs. The Table shows the 24 best ranking SDPs. Scores discriminate experimentally validated positions( * ) from those responsible for generic multimerization( †) and those beneficial or even specific to the dimerto-tetramer step( ‡), according to [27]. Positions with different scores are colored differently to facilitate reading. relevant pairs of positions id given in Figure 2 of SI; it is worth mentioning that an already known position, namely 177, as well as two "new" ones, 135 and 81, appear to consistently correlate with many other positions.

From tetrameric to dimeric form
We now consider putative SDPs of the tetrameric vs. dimeric form of IFP and analyze some new mutations. To that end, two groups of 13 dimers and 53 tetramers, again sharing the same interface as in DsRed, (44.2% and 52.9% average identity, respectively) were selected from the alignment and underwent our procedure. Comparison between experiments and our results, reported in Table 7, is limited by the fact that the literature focuses on the generic distinction between monomers and multimers and that tetramer-to-monomer and tetramer-to-dimer roles very likely overlap. Out of the first 20 most significant positions predicted by the various flavours of the method implemented in SDPhound, including the BLOSUM based ones due to their potentially innovative character, we selected those not yet tried in experiments. Then, we investigated with the pigeonholing procedure (Table 8) the physical features influencing their specificity. We focused on position 158 since it ranks as the best non experimentally validated position in Table 7 that also belongs to the tetramerization interface in DsRed homologs. The results concerning SDP 158 in Table 8 indicate that dimerization is favoured by changes in size and, to a lesser extent, in charge. In particular, inspection of the sequences revealed that aspartic acid appears only in the dimeric class at position 158 (DTRKCI vs. TRKIV), leading us to select the K158D mutation as a promising candidate to further weaken dimer-dimer binding. We decided to obtain a further assessment of this mutation, validating it independently with the MMPBSA technique [10], a well-established and quite powerful tool to estimate binding free energies such as those involved in oligomerization processes. In this context, free energy is evaluated as an average over molecular dynamics trajectories of G = E M M + G P B,polar + G SA,nonpolar − T S solute .
Details concerning the meaning of the individual terms, the implementation and the actual calculation can be found in the SI. Here, we only point out that binding free energy ∆G i can be estimated for each mutant as the free energy difference between tetramers and dimers, so that ∆∆G i,W T is the binding free energy difference of the i th mutant with respect to WT DsRed. To benchmark MMPBSA, we created, in silico, three mutants, namely I125R/V127T, dimer2* and mRFP1* (see caption of Figure 4 for details), for which the tetrameric character can be confidently excluded (∆∆G i,W T > 0). Despite the rather short length (400 ps) of the dynamics (hence the relatively large error bars), the calculations, shown in Figure 4 together with the details about exact mutations, reproduce the expected trends and support the ability of the I125R mutation to disrupt the tetramer (∆∆G I125R/V 127T,W T =40.79 kcal/mol), consistently with what described in [27]. For dimer2* and mRFP1* we report slightly lower values (∆∆G dimer2 * ,W T =26.69 kcal/mol and ∆∆G mRF P 1 * ,W T =23.23 kcal/mol) than for the I125R/V127T mutant; this is reasonable since in those cases mutagenesis introduced many mutations that, in addition to providing control over oligomerization tendency, preserved other essential properties, such as correct and fast protein folding, chromophore maturation and detectable fluorescence. More interesting is the study of the tetramerization free energies of K158D mutant as well as the physical insights that emerge. ∆∆G K158D,W T =26.67 kcal/mol actually indicates a very promising mutation, not likely to affect the fluorescence due to the location of the residue far from the chromophore. Moreover, from the tetramer crystal one can see that there is a H-bond between the basic K158 of one monomer and the carbonyl group of N128 of the associating partner; such interaction only seems possible if position 158 hosts a large and positive residue, such as K or R, consistent with what indicated by Table 8. Table 8. Substitution class based runs over SDPs relevant to tetramer to dimer transformation. Ranking and color coding as in Table 2. We chose to study the positions that had not been validated experimentally, since these are the new possibilities for putative mutations. For each position, its ranking in any specific run is shown. "−" means that in that run, the position ranked below 40 th .  Structures for all mutants were generated from crystal structure of this latter (1GGX PDB code). For dimer2* (T21S, H41T, C117T, I125R, V127T and S131P) and mRFP1* (T21S, H41T, C117E, I125R, V127T,  R153E, V156A, H162K, A164R, L174D, I180T, Y192A, Y194K, H222S, L223T, F224G,  L225A), only experimentally validated mutations according to [27] corresponding to solvent exposed residues were considered.

Conclusions
SDPhound, an articulate and flexible protocol for SDP identification and analysis within homologue proteins, is presented. Results of this work are twofold: i) a methodological improvement over preexisting techniques that use the mutual information to reveal SDPs, and ii) the characterization of the role of known and unknown residues responsible for specificity. The procedure makes it possible to obtain useful insights from the data and to make sense of the results both in probabilistic and physical terms. It can be generalized to investigate correlations among position pairs. When applied to the standard test cases of the MPI and LacI families and to the original analysis of the multimerization tendency within the IFP family, SDPhound correctly identifies mutations that are recognized to be relevant and/or experimentally known to influence specificity. It further succeeds in pointing to the physical characteristics of several relevant residues in all the applications considered in this work. In the case of IFP it also suggests several previously uninvestigated positions as determinant for the multimerization state of these proteins, both in the case of the multimer to monomer transition and in that of the tetramer to dimer change. The hierarchy of operations described in sections 3. and 5. allows to draw a reliable and rather stringent profile of the set of residues that are responsible for the specific function, or structure, within a set of homologs, making SDPhound a useful tool to complement both experimental techniques and other computational biology approaches.