Fast Modeling of Binding Affinities by Means of Superposing Significant Interaction Rules (SSIR) Method

The Superposing Significant Interaction Rules (SSIR) method is described. It is a general combinatorial and symbolic procedure able to rank compounds belonging to combinatorial analogue series. The procedure generates structure-activity relationship (SAR) models and also serves as an inverse SAR tool. The method is fast and can deal with large databases. SSIR operates from statistical significances calculated from the available library of compounds and according to the previously attached molecular labels of interest or non-interest. The required symbolic codification allows dealing with almost any combinatorial data set, even in a confidential manner, if desired. The application example categorizes molecules as binding or non-binding, and consensus ranking SAR models are generated from training and two distinct cross-validation methods: leave-one-out and balanced leave-two-out (BL2O), the latter being suited for the treatment of binary properties.


Introduction
Methods exist to mine data of analogue series or combinatorial data sets, for instance, those based on SAR maps [1,2] or R-group polymorphisms [3], among others [4,5]. However, none is as simple as the Superposing Significant Interaction Rules (SSIR) method, a new systematic procedure able to rank analogue series that, in turn, constitutes an inverse SAR tool. The SSIR ideas was originated after the recent experience in a Design of Experiments (DoE) context treating molecular families sharing a common scaffold [6][7][8].
The SSIR method conceptualizes a combinatorial family as a series of sites (factors in DoE vocabulary [9]), each one having the ability to accommodate one of a set of various residues (levels). From this knowledge, combination rules of presence/absence of certain residues in sites are categorized as being significant or not. The SAR model consists of all the rules being categorized as significant. Each rule grants an additional positive or negative vote to each molecule that matches it. Hence, each analogue collects a series of signed votes that, once added up, establish a molecular ranking scale. It is expected that the ranked molecular series will correlate with the interest/non-interest molecular tags attached to the molecules according to the values of the analyzed property.

Results and Discussion
As application, here it is presented an example of Binding Activities modeling. In reference [5] a method is presented for obtaining SAR relationships of analogue series based on the analysis of dual-activity difference maps. The authors present as an example the application over a set of 106 pyrrolidine bis-diketopiperazines tested against two formylpeptide receptors (FPR). Table 1 presents the substitution codifications along the four available molecular scaffold sites.
The details of both experimental endpoints can be found in the aforementioned reference. Following the notation of the original work, the binding activities (Ki) here will be denoted as FPR1 (related to antibacterial inflammation and malignant glioma cell metastasis) and FPR2 (associated with chronic inflammation in systemic amyloidosis, Alzheimer's disease and prion diseases). The goal of Medina-Franco et al. [5] was to compare both properties by working with differences arising from molecular pairs and looking for activity switches (i.e., specific substitutions that have opposite effects on the activity of the compounds against two biological targets) and selectivity switches (minor structural modifications that drastically invert the selectivity pattern of two compounds). Here, each single property will be modeled. Table 2 shows the codified set of 106 compounds. The original compound formulations can be found in Table S1 of the original article's [5] supporting information. It is worth noting that the SSIR method can be applied systematically without the need for special preparatory operations (other methods require molecular minimizations, alignments, descriptor calculations and so on). The method's symbolic nature means it is only necessary to arbitrarily codify the molecular substituents and decide which analogues are declared as being of interest (see materials and methods section below). These characteristics allow the SSIR method to model sets confidentially by masking the original information or molecule codification. pyrrolidine bis-diketopiperazines tested against two formylpeptide receptors (FPR). Table 1 presents the substitution codifications along the four available molecular scaffold sites. The details of both experimental endpoints can be found in the aforementioned reference. Following the notation of the original work, the binding activities (Ki) here will be denoted as FPR1 (related to antibacterial inflammation and malignant glioma cell metastasis) and FPR2 (associated with chronic inflammation in systemic amyloidosis, Alzheimer's disease and prion diseases). The goal of Medina-Franco et al. [5] was to compare both properties by working with differences arising from molecular pairs and looking for activity switches (i.e., specific substitutions that have opposite effects on the activity of the compounds against two biological targets) and selectivity switches (minor structural modifications that drastically invert the selectivity pattern of two compounds). Here, each single property will be modeled. Table 2 shows the codified set of 106 compounds. The original compound formulations can be found in Table S1 of the original article's [5] supporting information. It is worth noting that the SSIR method can be applied systematically without the need for special preparatory operations (other methods require molecular minimizations, alignments, descriptor calculations and so on). The method's symbolic nature means it is only necessary to arbitrarily codify the molecular substituents and decide which analogues are declared as being of interest (see materials and methods section below). These characteristics allow the SSIR method to model sets confidentially by masking the original information or molecule codification. Table 1. Molecular substitution codifications. Note that each letter represents a distinct substituent depending on the substitution site. The library has four diversity points and the expanded set covers M = 5ˆ8ˆ9ˆ17 = 6120 compounds. In the reference, a = 106 analogues are reported. In this set, the analogues of interest have been defined as those presenting low value of Ki expressed in terms of concentration in nM units. In both cases the b = 32 compounds (ca. 30%) presenting the lowest values were chosen as being of interest (property values lesser or equal to 411 and 410 nM for FPR1 and FPR2, respectively, and marked in Table 2 with asterisks in columns pKi1 and pKi2). The number of rules of order 1 (negations not allowed, see materials and methods section below) are 5 + 8 + 9 + 17 = 39. For orders 2-4 the number of rules are 531, 3029 and 6120, respectively. If negations are allowed, the numbers of possible rules increase to 78, 2124, 24232 and 97920, respectively. Figure 1 shows the distribution of p-values attached to the rules of order 4 (negation terms allowed) for both properties. It is noteworthy that property FPR2 reaches rules having much lower p values. This behavior is also found for other rule orders. The presence of more significant rules suggests that FPR2 could be better modeled.

106
BDBA 4.000 4.000 a The 32 compounds of interest (Ki1 ≤ 411) are marked with an asterisk. b The 32 compounds of interest (Ki2 ≤ 410) are marked with an asterisk.
The library has four diversity points and the expanded set covers M = 5 × 8 × 9 × 17 = 6120 compounds. In the reference, a = 106 analogues are reported. In this set, the analogues of interest have been defined as those presenting low value of Ki expressed in terms of concentration in nM units. In both cases the b = 32 compounds (ca. 30%) presenting the lowest values were chosen as being of interest (property values lesser or equal to 411 and 410 nM for FPR1 and FPR2, respectively, and marked in Table 2 with asterisks in columns pKi1 and pKi2). The number of rules of order 1 (negations not allowed, see materials and methods section below) are 5 + 8 + 9 + 17 = 39. For orders 2-4 the number of rules are 531, 3029 and 6120, respectively. If negations are allowed, the numbers of possible rules increase to 78, 2124, 24232 and 97920, respectively. Figure 1 shows the distribution of p-values attached to the rules of order 4 (negation terms allowed) for both properties. It is noteworthy that property FPR2 reaches rules having much lower p values. This behavior is also found for other rule orders. The presence of more significant rules suggests that FPR2 could be better modeled.  Table 3 shows the area under the receiver operating characteristic (AU-ROC) values [10][11][12][13] attached to the obtained ranking classification. Exploring the generation of rules of order 1, 2 and 3, immediate results were obtained for fitting and leave-one-out (L1O) tests. For all cases, the cutoff pvalue was set to pc = 0.005. The total number of significant rules entering each calculation is given between brackets. Along the L1O or balanced leave-two-out (BL2O) cycles (see section 3), certain rules present in fit are sometimes automatically discarded or some new significant rules appear as a result of the extraction and replacement steps. Hence, the total number of significant rules found along the cycles usually increases with respect to the single training calculation. Each BL2O calculation required 2368 cycles. In Table 3, the number of well classified pairs, ties and bad pair rankings encountered along the BL2O loops are explicitly indicated. For instance, regarding the FPR1 property, the BL2O involving rules of order 3 leads to 1909 well internally classified pairs, 2 ties and 457 incorrect pair rankings. For FPR2, the counts were 2253, 0 and 115, respectively. Those counts are  Table 3 shows the area under the receiver operating characteristic (AU-ROC) values [10][11][12][13] attached to the obtained ranking classification. Exploring the generation of rules of order 1, 2 and 3, immediate results were obtained for fitting and leave-one-out (L1O) tests. For all cases, the cutoff p-value was set to p c = 0.005. The total number of significant rules entering each calculation is given between brackets. Along the L1O or balanced leave-two-out (BL2O) cycles (see Section 3), certain rules present in fit are sometimes automatically discarded or some new significant rules appear as a result of the extraction and replacement steps. Hence, the total number of significant rules found along the cycles usually increases with respect to the single training calculation. Each BL2O calculation required 2368 cycles. In Table 3, the number of well classified pairs, ties and bad pair rankings encountered along the BL2O loops are explicitly indicated. For instance, regarding the FPR1 property, the BL2O involving rules of order 3 leads to 1909 well internally classified pairs, 2 ties and 457 incorrect pair rankings. For FPR2, the counts were 2253, 0 and 115, respectively. Those counts are related to AU-ROC values because it is well-known that, for a single fitting calculation, given a couple of molecules (one of interest and the other of non interest), the AU-ROC corresponds to the a posteriori probability that the classifier correctly sorts the pair [13].

Training and Cross-Validation
In all cases, the second property is clearly modeled better by SSIR. As mentioned, this may be because the rules for FPR2 reach more significant (i.e., small) p-values (see Figure 1). In other words, the analogues defined as of being of interest for FPR2 seem to be much more related to particular substituent combinations. Table 3. Area under the receiver operating characteristic (AU-ROC) values for several calculations for properties FPR1 and FPR2. The threshold p c value was set to 0.005 and negation terms were allowed in rules. The number of accepted rules along the loops is given in brackets. For the balanced leave-two-out (BL2O) cross-validation process, the number of well classified pairs, ties and bad pair rankings encountered along the cycles are indicated between slashes. See text for more details.

Property
Rule related to AU-ROC values because it is well-known that, for a single fitting calculation, given a couple of molecules (one of interest and the other of non interest), the AU-ROC corresponds to the a posteriori probability that the classifier correctly sorts the pair [13]. In all cases, the second property is clearly modeled better by SSIR. As mentioned, this may be because the rules for FPR2 reach more significant (i.e., small) p-values (see Figure 1). In other words, the analogues defined as of being of interest for FPR2 seem to be much more related to particular substituent combinations. Table 3. Area under the receiver operating characteristic (AU-ROC) values for several calculations for properties FPR1 and FPR2. The threshold pc value was set to 0.005 and negation terms were allowed in rules. The number of accepted rules along the loops is given in brackets. For the balanced leavetwo-out (BL2O) cross-validation process, the number of well classified pairs, ties and bad pair rankings encountered along the cycles are indicated between slashes. See text for more details.  For this library, rules of order 2 are well suited to reveal general patterns attached to activity values of interest. Table 4 lists the first most significant rules of order 2 found for the FPR1 property. The systematic presence of G substituent becomes evident at position 2 (S-benzyl) attached to rules having a positive vote. Even more, the negation of G substituent at this position is systematically accompanied by a negative rule vote. The remaining rules mainly ask to avoid residue B (R-2naphthylmethyl) at the same position. Other diverse combinations complete the full set of 117 selected rules having p ≤ 0.005. Inspection of the whole set of significant rules reveals that position 2 is the most relevant one when modeling the FPR1 property. This kind of information can be useful for some applications, for instance when a compound must be optimized in order to refine other molecular properties. For this library, rules of order 2 are well suited to reveal general patterns attached to activity values of interest. Table 4 lists the first most significant rules of order 2 found for the FPR1 property. The systematic presence of G substituent becomes evident at position 2 (S-benzyl) attached to rules having a positive vote. Even more, the negation of G substituent at this position is systematically accompanied by a negative rule vote. The remaining rules mainly ask to avoid residue B (R-2-naphthylmethyl) at the same position. Other diverse combinations complete the full set of 117 selected rules having p ď 0.005. Inspection of the whole set of significant rules reveals that position 2 is the most relevant one when modeling the FPR1 property. This kind of information can be useful for some applications, for instance when a compound must be optimized in order to refine other molecular properties.  Table 5 lists the first most relevant rules of order 2 when modeling the FPR2 property. The pattern found in this list is the presence of residue C at the first substitution site (S-isopropyl). Again, the negation of residue C at this specific site is attached to a negative vote for the rules. Despite this particular rule behavior, molecular position 1 is not the most relevant one, as seen from an inspection of the full set of 447 selected rules. In this group of significant rules, one encounters a diversity of residues to be placed or avoided at specific positions. Hence, the FPR2 property is modeled from several "points of view" regarding combinations of substituents. This variety of choices confers better modeling options to the property and, in this case, a more robust final SSIR model. The results in Table 3 have been checked by means of randomization tests. These tests consist of randomly scrambling all the molecules' interest/non-interest labels and redoing the modeling calculations from scratch 1000 times, i.e., generating all the rules again from the beginning and recalculating all the probabilistic p values. Figure 3 shows the fake AU-ROC values obtained for the FPR1 (Figure 3a) and FPR2 (Figure 3b) properties through L1O predictions. The calculation involves the rules of order 2 (p c = 0.005). During the cycles, a SSIR model could only be reproduced 428 (Figure 3a) or 409 (Figure 3b) times. For the other cases, all the rules' significances were greater than the threshold p c . The "randomized" models are represented by a point in the graph whereas the correct model is represented by a cross. The points always present lesser AU-ROC values (vertical axis) than the correct model (except for a model for the FPR1 property). The graph also shows how the number of rules found per randomized test is lesser than 117 ( Figure 3a) and 447 (Figure 3b), the number of rules defining the correct model. For FPR1 a fake model consisted of 102 rules and 105 for FPR2. All this data confirms again that the FPR2 property is much better modeled than FPR1, as the corresponding cross in Figure 3b is clearly further away from the cloud of randomized points.

Property Rule Order Overall
Possibly, FPR1 property is being overparametrized as several L1O-AU-ROC values fall near the unscrambled test. It has to be noted that, in some cases, the random scrambling of molecular interest/non-interest tags leads to situations that can be partially modeled by SSIR. For instance, if the original tags pointing to molecules having low property values (analogues of interest) are mainly placed on molecules having higher property values during scrambling, then the situation becomes a sort of complementary version of the original one and SSIR is able to model it. Unfortunately, another undesirable possibility is left to chance: in some cases the random placement of the tags can set the analogues of interest in a partially correlated way with respect to one or more substituents. This situation will also be modeled by SSIR, generating fake rules. Therefore, the combination of both methodologies, cross-validation and randomization tests, are to be taken into account when modeling with SSIR in order to detect spurious models. model is represented by a cross. The points always present lesser AU-ROC values (vertical axis) than the correct model (except for a model for the FPR1 property). The graph also shows how the number of rules found per randomized test is lesser than 117 ( Figure 3a) and 447 (Figure 3b), the number of rules defining the correct model. For FPR1 a fake model consisted of 102 rules and 105 for FPR2. All this data confirms again that the FPR2 property is much better modeled than FPR1, as the corresponding cross in Figure 3b is clearly further away from the cloud of randomized points.

Inverse Structure-Activity Relationships (SAR): New Analogue Proposals
The SSIR method is an inverse SAR tool [14] because of its ability to suggest new compounds. For the model of rules of order 3 (p c = 0.005, negations allowed), the SSIR program was asked for predictions of new combinatorial analogues having a high number of positive votes. The SSIR program generated all the remaining 6014 items that were not training compounds. For this set of new analogues, the number of votes coming from the rules ranged from´376 to +397 for FPR1 and from´1101 to +1283 for FPR2. Each list provides a ranking attached to either property. Another calculation was conducted to generate external analogues. To prevent extrapolations outside the training set chemical space, consideration was only given to compounds having at most one single substitution difference (in any site) respect to at least three training compounds (of course, other choices are possible). A total of 511 analogues fulfill this condition. For this set of new analogues, the number of collected votes ranged from´396 to +397 for FPR1 and from´1101 to +1278 for FPR2. As presented, the lists of proposed molecules ranked by SSIR help to prioritize structures for synthesis, screening, database pruning or selection. As shown, the rules can be combined to detect common prioritized structures. This optimization task needs not be immediate or easy in general, especially in those cases where the multiple objectives are contradictory (negatively correlated). It is not the goal of this article to deal with the topic of multiobjective optimization [15] and the related issues will be published elsewhere.

Advantages and Drawbacks of the Method
One of the advantages of the SSIR method (described in Section 3) is that the input data can be prepared fast because pre-process tasks are minimal. Starting the process is immediate because conformational analysis, molecular superpositions and index calculations are not needed. The symbolic treatment can be interpreted as a sort of encryption. Hence, the modeling procedure can be offered to a third party in a confidential manner, i.e., without revealing the molecular database being studied.
Of course, the method has drawbacks. Apart from those cited above, the molecular space couldn't be explored beyond the substituents codified in the training database. This conditions the eventual test or validation molecular set structure (other methods, such as Inductive Logic Programming [16], allow the application of generated rules to molecules presenting new substituents). It is also worth mentioning that the results depend on the balancing of the database and on the degree of library dilution with respect to the full definable set of molecules. The method cannot deal with libraries of analogues presenting only one single substitution site. Our current research focuses on getting selected rules of higher order, as exhaustive generation is not possible in many cases due to combinatorial explosion. Work is also done to apply the method to continuous property values.
SSIR constitutes a rules search engine that has been presented here in a net context. The inner procedure can be improved, and additional benefits are attainable with the help of other techniques mainly devoted to rule management.

Libraries, Sublibraries, Rules and Negation Terms
A congeneric molecular series sharing a common scaffold with n anchorage sites is visualized by the SSIR method as a structure having n factors. In turn, each substitution point i is able to accommodate m i residues or building blocks (relevant sites are only those for which m i > 1). In this manner the total number of analogues definable in the library is M " and each analogue is identified by the list of sorted residues, for instance A 1 B 2 B 3 ...C n or, simply, ABB...C, as the position of each letter specifies the substitution point.
The SSIR procedure assumes that the molecular property values depend on the effect of some relevant substituents placed at some relevant sites, but also that these non-linear effects can be expanded by superposing rules involving only a few sites.
For illustrative purposes, let us consider a toy library obtained by a combination of residues in 3 sites (see Figure 4). The full set of molecules belonging to the library is given by the Cartesian product R = R 1ˆR2ˆR3 where the levels for each site are represented in turn by the site sets R 1 = {A,B}, R 2 = {A,B,C} and R 3 = {A,B,C,D}. The symbols associated with each residue, despite being repeated, generally stand for distinct codified entities along the sites. If some residues are identical for two or more sites, the repeated notation always remains unambiguous because the corresponding influence over the molecular properties is distinct due to the positional effects. For this example, m 1 = 2, m 2 = 3 and m 3 = 4. Hence, the complete library R contains M = 2ˆ3ˆ4 = 24 analogues collected in the universal set R = {AAA, AAB, AAC, ..., BCC, BCD}. It should be noted here that the entire library is generally not at the disposal of the researcher; usually only a fraction of it is known (i.e., already synthesized and with the molecular property evaluated). and each analogue is identified by the list of sorted residues, for instance A1B2B3...Cn or, simply, ABB...C, as the position of each letter specifies the substitution point.
The SSIR procedure assumes that the molecular property values depend on the effect of some relevant substituents placed at some relevant sites, but also that these non-linear effects can be expanded by superposing rules involving only a few sites.
For illustrative purposes, let us consider a toy library obtained by a combination of residues in 3 sites (see Figure 4). The full set of molecules belonging to the library is given by the Cartesian product R = R1 × R2 × R3 where the levels for each site are represented in turn by the site sets R1 = {A,B}, R2 = {A,B,C} and R3 = {A,B,C,D}. The symbols associated with each residue, despite being repeated, generally stand for distinct codified entities along the sites. If some residues are identical for two or more sites, the repeated notation always remains unambiguous because the corresponding influence over the molecular properties is distinct due to the positional effects. For this example, m1 = 2, m2 = 3 and m3 = 4. Hence, the complete library R contains M = 2 × 3 × 4 = 24 analogues collected in the universal set R = {AAA, AAB, AAC, ..., BCC, BCD}. It should be noted here that the entire library is generally not at the disposal of the researcher; usually only a fraction of it is known (i.e., already synthesized and with the molecular property evaluated). Here, the wildcard notation X will stand for any of the residues belonging to a particular anchorage point. Hence, the full database is also denoted by R = {X}1 × {X}2 × {X}3 = {XXX} or, simply, XXX. When applying the SSIR method, it is very important to define partial subsets of analogues taken from the library. For instance, the rule XAX stands for all the analogues presenting residue A at the second site. This is the same as in the molecular set XAX = {AAA, AAB, AAC, AAD, BAA, BAB, BAC, BAD} It is said that this rule embraces or condenses m1 × m3 = 8 analogues. Rule XAX is of order 1 because it establishes substitution restrictions in one site. This virtual library also admits rules of order 2 (the ones setting substitution restrictions in two sites, as for the case of XAD standing for the two analogues simultaneously presenting residues A and B at positions 2 and 3, respectively) or order 3 (such as the rule identified with the analogue BBC). The maximum rule order definable in a library is n, the number of substitution slots. The total number of rules (assuming that there are not redundant molecular symmetry issues) is In Equation (2) the leftmost summation defines the rule orders, the inner k summation symbols constitute a Nested Summation Symbol (NSS) [17][18][19][20] and generates the combinations of k elements Here, the wildcard notation X will stand for any of the residues belonging to a particular anchorage point. Hence, the full database is also denoted by R = {X} 1ˆ{ X} 2ˆ{ X} 3 = {XXX} or, simply, XXX. When applying the SSIR method, it is very important to define partial subsets of analogues taken from the library. For instance, the rule XAX stands for all the analogues presenting residue A at the second site. This is the same as in the molecular set XAX " tAAA, AAB, AAC, AAD, BAA, BAB, BAC, BADu It is said that this rule embraces or condenses m 1ˆm3 = 8 analogues. Rule XAX is of order 1 because it establishes substitution restrictions in one site. This virtual library also admits rules of order 2 (the ones setting substitution restrictions in two sites, as for the case of XAD standing for the two analogues simultaneously presenting residues A and B at positions 2 and 3, respectively) or order 3 (such as the rule identified with the analogue BBC). The maximum rule order definable in a library is n, the number of substitution slots. The total number of rules (assuming that there are not redundant molecular symmetry issues) is In Equation (2) the leftmost summation defines the rule orders, the inner k summation symbols constitute a Nested Summation Symbol (NSS) [17][18][19][20] and generates the combinations of k elements taken from the pool of n (the selection of sites involved in each rule) and the rightmost product counts how many rules are being generated from the previously selected k substitution sites. This last term corresponds to the combinatorial object variations with repetition. Given the identification of the sites being combined (values of i 1 , i 2 ,...,i k ), the number of generated rules of order k arising from it is The number of compounds being condensed by each rule of k-th order is Regarding the toy example above, Equation (2) generates a total of 9 rules of order 1 (2 + 3 + 4). The rules of order 2 arise from the combinations of sites 1-2, 1-3 and 2-3 and a total of 6 (2ˆ3), 8 (2ˆ4) and 12 (3ˆ4) rules are generated, respectively. There are a total of 2ˆ3ˆ4=24 rules of order 3 (the maximum order definable in this example). Hence, the complete sum (2) is V = 59.
Equations (2)-(4) involve rules made with "positive terms" for which the presence of a residue is required. The universe of rules increases if rules involving the concept of 'non presence of a specific residue' are also taken into account. These are termed negation terms or negation operators. The negation operator of a certain residue is denoted by a bar as in rule XAX, defining a complementary term, which here stands for the set {XBX, XCX} condensing the 2ˆ1ˆ4 + 2ˆ1ˆ4 = 16 analogues not having residue A at the second position. The combination of two complementary variables can lead to other specific molecular sets. Take as an example the rule which condenses the 8 analogues not having residue A at the first site and simultaneously not having the residue B at the second. That is the same as the set tBuˆtA,CuˆtA,B,C,Du " tBuˆtA, CuˆtXu " tBAX, BCXu Despite that algebra opens the possibility to define many combinations of negation terms along the rules (see Appendix), here all the rules will only involve juxtaposed positive (A, B, ...) or negative (A,B, ...) terms. A word of caution must be given here: it is not always necessary to attach negation operators at every site. It will be redundant to apply negation operators to binary sites (those presenting only two residues) because each level is the natural negation of the other. Specifically, for a binary site set {A,B} the complementary rule A is the same as the rule B. Conversely, the rule B is also equivalent to rule A.
All in all, systematic rule generation by computation is performed nesting three combinatorial entities: combinations among k sites, defining the rule order; the generation of variations among the residues attached to the previously selected sites; and another variational generator taking into account the presence or absence of individual negation terms (adding at most a 2 k expanding term). References [17][18][19][20] provide information about how to implement these discrete algorithms.

Rules Significance and Votes
The relative relevance of a rule arises after a preliminary dichotomization of the library. The investigated molecular property must be codified in a binary fashion (in general tagging each analogue as being or not being of interest, i.e., being or not being active, a drug, desirable, and so on). Many times the original property is not binary, and in these cases the SSIR method requires an arbitrary threshold frontier. Once the available molecular set presents only two classes, associating a significance p-value to each rule is immediate [21][22][23]. The calculation involves the number of molecules of interest which are present (known) in the library and are condensed by the rule: if the known sublibrary consists of a ď M analogues, being b of them declared of interest, a particular rule condensing c known molecules has the following hypergeometric probability to condense d out of c being also of interest: The cumulated probabilities that the rule condenses d or more (d+) structures of interest define the significance level or p-value: p pd`, c; b, aq " p pd : min pb, cq , c; b, aq " minpb,cq For instance, in our example above, let us assume that, from the M = 24 analogues, only a = 15 are at our disposal with known activity and that, from these, b = 5 are declared of interest. Then, if a rule such as XAX condenses c = 6 known molecules (the remaining 2 are not yet synthesized) and d = 4 of them are found to be active, expression Equation (7) indicates that the probability of randomly collecting 4 or more active molecules is p(4+,6;5,15) = 0.047. If this value is (arbitrarily) considered significant, the SSIR method will assume that rule XAX is a promising one, and will include it in the SAR model. Then, the assumption is that the 2 analogues that are not yet synthesized are expected to have promising chances of being active. The SSIR method assumes that the superposition of many significant rules provides a ranking method that increases the chances of pointing out new active analogues not present in the known sublibrary. This assumption requires an important aspect: establishing an arbitrary p-value cutoff (p c ) defining "significant" rules (in our example we have assumed that p c = 0.05).
The SSIR SAR model is additional in the sense that each compound (being present or not in the known training library) is assigned to a number of votes coming from the significant rules that condense it. Besides, some rules entering the model have a negative vote. For instance, if the rule BCX significance is p(d+,c;b,a) > p c , then the SSIR protocol focuses the attention on the complementary probability counterpart and evaluates the event consisting of collecting d or less active compounds, p(d´,c;b,a). It is easy to demonstrate that ppd´, c; b, aq " 1´ppd`, c; b, aq`Ppd, c; b, aq " pprb´ds`, a´c; b, aq If the condition p(d´,c;b,a) ď p c is satisfied, the rule BCX receives a negative vote, and this vote will be inherited by all the structures condensed by the rule. In fact, this procedure is equivalent to granting a positive vote to the full negation complementary rule BCX.
It is assumed that the higher the number of net positive votes a structure collects, the higher the probability of it being of interest. Conversely, analogues attached to net negative votes are presumably of non-interest. The SSIR goal is to efficiently impute votes not only to training compounds but also to new test or validation analogues not present in the available sublibrary, which serves as a training set.
Once a training set library is given and the compounds of interest are defined, the our SSIR code generates rules, keeps the significant ones according to the pre-established cutoff p-value, and assigns positive or negative votes to them. The SSIR SAR model consists of the whole set of selected rules and attached votes. The model is then ready to be applied to a test or external validation set. Future investigations are to be conducted in order to establish the efficiency of SSIR relative to the known (synthesized) percentage of the library and also to the effect of the actual distribution of substituent combinations.

Cross-Validation
As a way to test the model's predictive capabilities, several cross-validation procedures are implemented in our SSIR program code. Here the leave-one-out (L1O) and the balanced leave-two-out (BL2O) tests have been considered. Both tests are iterative and require the generation of predictions either for single analogues (L1O) or pairs of them (BL2O). All the L1O and BL2O cross-validation cycles have been designed according the Internal Test Sets (ITS) method [24][25][26] consisting of generating all the models from the beginning, i.e., selecting all the rules from scratch as if the left out cross-validated analogue(s) were not present in the original library. This constitutes a realistic cross-validation simulation, which helps to detect overparameterization effects. Despite this being a time-consuming task, fortunately the accumulative nature of SSIR rules allows the final results to be obtained after fast reckoning procedures. For instance, in order to implement a L1O procedure it is only necessary to perform the first full training involving all the analogues and keep in disk or computer memory the number of relevant compounds condensed by each rule and the number of these which are of interest. Then, during the cross-validation cycles, for each virtually left out structure it is only necessary to do the simple re-count of condensed and active structures that would be obtained if the model were rebuilt from scratch without the cross-validated analogue. In essence, this is a way to see which effect produces the momentary absence of each cross-validated analogue among each rule. This allows immediate recalculation of the rule significances and reconsideration of the vote assignation.
The same applies for the BL2O cross-validation procedure. The method consists of looping over all the pairs of active and non-active analogues and cross-validating them. This prevents generating all the combinations of molecular pairs except n actˆnnact , the product of the number of active (n act ) analogues by the number of non-actives (n nact ). For every left-out pair of molecules, two prediction votes are given and are added up. At the end of the full procedure, the sum of votes of the active and non-active compounds is divided by n nact and n act , respectively, giving a set of homogenate ranking votes. As for the L1O case, a fast version of the BL2O procedure has also been implemented, getting for each rule the new attached vote (if any) after carefully managing the respective condensation counts.

Conclusions
SSIR, a systematic procedure used to rank series of combinatorial analogues, has been described. In addition to this general description of the method, an illustrative application example has been provided. The method has been shown to be fast and systematic, leading to good predictions in some cases. Some overparametrization features can be easily detected relying on cross-validation or randomization test procedures. It has also been shown that the SSIR method constitutes an inverse (Q)SAR tool engine. The balanced leave-two-out (BL2O) cross-validation procedure has been also described.