Protein Quadratic Indices of the “Macromolecular Pseudograph’s α-Carbon Atom Adjacency Matrix”. 1. Prediction of Arc Repressor Alanine-mutant’s Stability

This report describes a new set of macromolecular descriptors of relevance to protein QSAR/QSPR studies, protein's quadratic indices. These descriptors are calculated from the macromolecular pseudograph's alpha-carbon atom adjacency matrix. A study of the protein stability effects for a complete set of alanine substitutions in Arc repressor illustrates this approach. Quantitative Structure-Stability Relationship (QSSR) models allow discriminating between near wild-type stability and reduced-stability A-mutants. A linear discriminant function gives rise to excellent discrimination between 85.4% (35/41)and 91.67% (11/12) of near wild-type stability/reduced stability mutants in training and test series, respectively. The model's overall predictability oscillates from 80.49 until 82.93, when n varies from 2 to 10 in leave-n-out cross validation procedures. This value stabilizes around 80.49% when n was > 6. Additionally, canonical regression analysis corroborates the statistical quality of the classification model (Rcanc = 0.72, p-level <0.0001). This analysis was also used to compute biological stability canonical scores for each Arc A-mutant. On the other hand, nonlinear piecewise regression model compares favorably with respect to linear regression one on predicting the melting temperature (tm)of the Arc A-mutants. The linear model explains almost 72% of the variance of the experimental tm (R = 0.85 and s = 5.64) and LOO press statistics evidenced its predictive ability (q2 = 0.55 and scv = 6.24). However, this linear regression model falls to resolve t(m) predictions of Arc A-mutants in external prediction series. Therefore, the use of nonlinear piecewise models was required. The tm values of A-mutants in training (R = 0.94) and test(R = 0.91) sets are calculated by piecewise model with a high degree of precision. A break-point value of 51.32 degrees C characterizes two mutants' clusters and coincides perfectly with the experimental scale. For this reason, we can use the linear discriminant analysis and piecewise models in combination to classify and predict the stability of the mutants' Arc homodimers. These models also permit the interpretation of the driving forces of such a folding process. The models include protein's quadratic indices accounting for hydrophobic (z1), bulk-steric (z2), and electronic (z3) features of the studied molecules. Preponderance of z1 and z3 over z2 indicates the higher importance of the hydrophobic and electronic side chain terms in the folding of the Arc dimer. In this sense, developed equations involve short-reaching (k < or = 3), middle- reaching (3 < k < or = 7) and far-reaching (k= 8 or greater) z1, 2, 3-protein's quadratic indices. This situation points to topologic/topographic protein's backbone interactions control of the stability profile of wild-type Arc and its A-mutants. Consequently, the present approach represents a novel and very promising way to mathematical research in biology sciences.

<0.0001).This analysis was also used to compute biological stability canonical scores for each Arc A-mutant.On the other hand, nonlinear piecewise regression model compares favorably with respect to linear regression one on predicting the melting temperature (t m ) of the Arc A-mutants.The linear model explains almost 72% of the variance of the experimental t m (R = 0.85 and s = 5.64) and LOO press statistics evidenced its predictive ability (q 2 = 0.55 and s cv = 6.24).However, this linear regression model falls to resolve t m predictions of Arc A-mutants in external prediction series.Therefore, the use of nonlinear piecewise models was required.The t m values of A-mutants in training (R = 0.94) and test (R = 0.91) sets are calculated by piecewise model with a high degree of precision.A break-point value of 51.32 o C characterizes two mutants' clusters and coincides perfectly with the experimental scale.For this reason, we can use the linear discriminant analysis and piecewise models in combination to classify and predict the stability of the mutants' Arc homodimers.These models also permit the interpretation of the driving forces of such a folding process.The models include protein's quadratic indices accounting for hydrophobic (z 1 ), bulk-steric (z 2 ), and electronic (z 3 ) features of the studied molecules.Preponderance of z 1 and z 3 over z 2 indicates the higher importance of the hydrophobic and electronic side chain terms in the folding of the Arc dimer.In this sense, developed equations involve short-reaching (k ≤ 3), middle-reaching (3 < k ≤ 7) and far-reaching (k

Introduction
Proteins are the major functional molecules of life whose properties are so useful that we employ them as therapeutic agents, catalysts, and materials.Many diseases stem from mutations in proteins that cause them to lose function; some 50% of human cancers are caused by mutations in the tumor suppressor p53 that primarily lower its stability [1,2].Enzymes and receptors are the usual targets of drugs, either to restore function or to destroy infectious agents or cancers.The ultimate goal of protein science is to be able to predict the structure and activity of a protein de novo and how it will bind to ligands.When this is achieved, we will be able to design and synthesize novel catalysts, materials, and drugs that will eliminate disease and minimize ill health [1].
There are now significant advances toward this goal.Experimentalists are able to alter the activity and stability of proteins by protein engineering, and the first tentative steps in protein design are under way.The advent of this approach allows the structure of proteins to be modified in a manner similar to small molecules so that structure-(stability)-activity relationships may be studied.In addition, theoreticians are able to simulate many aspects of folding and catalysis with increasing detail and reliability [3,4].In these studies, the data derived from protein engineering experiments are being used, to benchmark the computer calculations that will eventually be used for designing rational changes in protein stability and allow the modest redesign of proteins [1].
Anfinsen's experiment with ribonuclease A and staphylococcal nuclease discovered that aminoacid sequence of these small proteins encode their final folded structure and also encode the information on how to get to the structures [5,6].However, the "folding problem (prediction of the three-dimensional structure of a protein from its amino-acid sequence)" still remains as one of the greater unsolved problems of protein science.The folding problem is so important due to the large number of the genome sequences completed in recent years.This fact has provoked a large gap between the sharply increasing number of protein sequences entering into data banks and the slow accumulation of known structure.Thus, predicting the spatial structure based on a given protein primary-sequence information could play a significant role in conjunction with experimental methods [7].
Many researchers worldwide have worked on the development of models in order to predict the stability of mutants of a wild protein.For instance, Shortle has studied 118 mutants of Staphylococcal nuclease.Similarly, other researchers have modelled the stability of 145 mutants of T4 Lysozyme, 96 mutants of Barnase and 71 mutants of Chymotrypsin in what seems to be the models with the largest mutated proteins.Other important studies included modelling of the stability of 66 mutants of GeneV, 65 mutants of Human lysozyme and 58 mutants of protein L. In addition, they stand out the studies with 40 mutants of Trypsin inhibitor, 38 mutants of TNFn3 and 31 mutants of FKBP12.They have been also reported models for proteins with more than 10 mutants but less than 30 such as ACBP, Ribonuclease T1, Ribonuclease H, α Lactalbumin, hen Lysozime, Subtilisin inhibitor, U1A, ISO-1 cytochrome C, Trp synthase.Other less-mutated studied proteins are CD2, Calbindin, Apomyoglobin, Adrenodoxin, Cold shock, ribonuclease A, λ-CRO and so on.As summarized by Zhou and Zhou's excellent work, a total of 35 proteins with their respective 1023 mutants have been studied including all the examples above.In this work, Zhou and Zhou not only do an excellent review of the topic but also use the data on the 1023-mutant stability to develop what seem to be one of the largest unified models up to date [8].
Much work is currently underway to determine the contribution of individual residues to the overall fold and stability of a protein [9][10][11][12][13].This is a very challenging problem due to the complexity of both the native and unfolded states, and the transition between them.Robert Sauer has done some of the seminal work in this area on the Arc repressor [14,15].This protein provides an attractive system in which to address this issue because it is small (53 AAs), and amenable to genetic and biophysical studies [16][17][18].This is a homodimer protein with a globular domain formed by the intertwining of their monomers.It's secondary structure consists on two anti-parallel β-sheets from residues 8-14, and α-helices formed by residues 15-30 and 32-48 [15].Nevertheless, until our concern, neither Zhou and Zhou's work nor other reported in the literature, predict the stability of Arc Repressors [8].
Recently, a novel scheme to the rational -in silico-molecular design (or selection/identification of chemicals) and to QSAR/QSPR studies has been introduced by one of the present authors.It is the socolled TOpological MOlecular COMputer Design (TOMOCOMD) [19].This method has been developed to generate molecular descriptors based on the linear algebra theory.In this sense, atom, atom-type and total quadratic and linear indices have been defined in analogy to the quadratic and linear mathematical maps, respectively [20,21].This approach has been successfully employed in QSPR and QSAR studies [20][21][22][23][24][25][26][27][28][29][30], including studies related to nucleic acid-drug interactions [31].The approach describes changes in the electron distribution with time throughout the molecular backbone.
The TOMOCOMD-CARDD (acronym of the Computed-Aided 'Rational' Drug Design) strategy is very useful for the selection of novel subsystems of compounds having a desired property/activity [24,[28][29][30], which can be further optimized by using some of the many molecular modeling methods at the disposition of the medicinal chemists.The method has also demonstrated flexibility in relation to many different problems.In this sense, the TOMOCOMD-CARDD approach has been applied to the fasttrack experimental discovery of novel anthelmintic [28,30] and antimalarials [29] compounds.The prediction of the physical, chem-physical and chemical properties of organic compounds is a problem that can also be addressed using this approach [20,25,27].Codification of chirality and other 3D structural features constitutes another advantage of this method [26].The latter opportunity has allowed the description of the significance-interpretation and the comparison to other molecular descriptors [21,25].Additionally, promising results have been found in the modeling of the interaction between drugs and HIV packaging-region RNA in the field of bioinformatics using TOMOCOMD-CANAR (Computed-Aided Nucleic Acid Research) approach [31].
Therefore, describing an extended TOMOCOMD-CAMPS (Computed-Aided Modelling in Protein Science) approach to account for protein structure constitutes the main aim of this paper.In the present study, we propose a total and local definition of protein quadratic indices of the "macromolecular pseudograph's α-carbon atom adjacency matrix".In order to validate the method, protein's total macromolecular indices were used to develop quantitative models.In this sense, protein stability effects are described for a complete set of alanine substitutions in Arc repressor.The present result allows us to predict the melting temperature referred to unfolding Arc dimer.

Arc Dimer Structure and Melting Temperature of a Complete Set of A-Substitution Mutants
Arc is a homodimer in which each monomer intertwines with the other to form a single, globular domain with a well-defined core.Several side-chain hydrogen bond and salt bridge interactions are involved in the Arc crystal structure.An exhaustive representation of these interactions can be found in some detail elsewhere (see Figure 1b in Reference 15).Nevertheless, an overview of these electrostatic interactions in Arc repressor structure will be given.Hydrogen-bond interactions take place [15]: i) Between side chain in the same subunit (R16-D20, D20-R23, N29-E36, E36-R31, E36-R40, E43-K46, E43-K47) and; those between side chains in different subunits (E28-R50, R40-S44, R40-F48).ii) Between a side chain and main-chain atom intersubunit (W14-N34, N34-R13) and; those between a side chain and main-chain atom intrasubunits (E17-E17, S32-S35, S44-R40).The data of Arc repressor mutant was taken from the literature [15].In this paper, Alanine substitutions were constructed at each of the 51 non-alanine positions in the wild-type Arc sequence.To avoid intracellular proteolysis and purification difficulties, these authors constructed the alanine substitution mutant in backgrounds containing the carboxy-terminal extensions (His) 6 (designated st6) or (His) 6 -Lys-Asn-Gln-His-Glu (designated st11) [18,32].These tail sequences allow affinity purification, reduce degradation and cause no significant changes in protein stability [33].
Milla et al. subjected each purified mutant of Arc to thermal and urea denaturation experiments.Stability of the proteins was checked by melting temperature (t m ) [15].The values of t m for 53 Arc homodimers reported by these authors are given in Table 1 (see sixth column).In this Table, the Arc mutants are grouped into two categories: 1) mutants with near wild-type stability and, 2) mutants with reduced stability.The first group also includes one mutant with increased stability (PA8-st6).
Otherwise, the second one includes five unfolded mutants, even at low temperatures (< 20 o C) and absence of denaturant.
In equilibrium and kinetic unfolding-refolding studies only native Arc dimers and denatured monomers are significantly populated.Thus, folding and dimerization are concerted processes [15][16][17].For this reason, it is important to remember that t m refers to unfolding of the Arc homodimer.Then, one must take into consideration that each single mutation changes two side chains in the Arc dimer, being stability effects roughly twice these observed for monomeric proteins.Moreover, changes in stability may arise due to mutation disrupts of a native interaction, when the native structure of the mutant undergoes relaxation, or because of the change on the properties of the denatured mutant protein [9,[11][12][13]15].

Protein Quadratic Indices of the "Macromolecular Pseudograph's α-Carbon Atom Adjacency Matrix"
The major constituent of proteins is an umbranched polypeptide chain consisting of L-α-amino acids linked by amide bonds between the α-carboxyl group of one residue and the α-amino group of the next.The sequence of the amino acids defines the primary structure [1,[34][35][36][37][38].As previously outlined, the genetically encoded sequence of a protein determines its three-dimensional structure [5,6].That is to say, if the side chain of each amino acid within a protein is removed, the secondary structure of the protein is obtained.It is constructed around planar units of peptide bond.Closer examination reveals regions where the secondary structure is organized into repetitive and regular elements.
Afterwards, the side chains can be added back to the backbone, and it is then seen how the ternary structure of the proteins is formed by the packing of the regular elements of secondary structure by way of their side chains.For this reason, the structure of each protein can be expressed in a quantitative way by side chain amino-acid properties.Subsequently, Charton and Charton determined the dependence of protein conformation upon the side chain structure of the amino-acid residues using Chou-Fasman parameters [39].
In other approach about structure-activity studies, Hellberg et al. developed the so-called principal properties or z-values [40].This peptide QSAR methodology is based on a parametrization of each amino-acid occurring in a peptide chain with three z-values, which are linear combinations of the original measured variables.These values are proposed to be related to hydrophilicity, bulk, and electronic properties.The principal properties have been successfully used to seek peptide QSARs [40][41][42].Other descriptors used in peptides QSAR studies have been derived from the side-chain surface area and atomic charges of the amino acids [43].
On the other hand, the general principles of the quadratic indices of the "molecular pseudograph`s atom adjacent matrix" for small-to-medium sized organic compounds have been explained in some detail elsewhere [20,[22][23][24][25][26]28,31].However, an extended overview of this approach will be given in this work.
First, in analogy to the molecular vector X used to represent organic molecules we introduce here the macromolecular vector (X m ).The components of this vector are numeric values, which represent a certain side-chain amino-acid property.These properties characterize each kind of amino-acid (R group) within the protein.Such properties can be z-values [40], side-chain isotropic surface area (ISA) and atomic charges (ECI) of the amino acid [43], and so on.For instance, the z 1(AA) scale of the amino acid AA takes the values z 1(V) = -2.69 for valine, z 1(A) = 0.07 for alanine, z 1(M) = 2.49 for methionine and so on [40,43].Table 2 depicts descriptors scales z 1 , z 2 , and z 3 for the natural amino acids.Table 2. Descriptor Scales z 1 , z 2 and z 3 for the Natural Amino Acids [40,43].Where n is the dimension of the real sets ( ℜ n ).
If a protein consists of n amino acids (vector of ℜ n ), then the k th (k = 10) protein's total quadratic indices, q k (x m ) are defined by a q application (q: ℜ n → ℜ ).Where, X m can be expressed by a linear combination X m = x 1 a 1 +...+x n a n , being the vectors (a i ) 1≤i≤ n a base of ℜ n [20,[22][23][24][25][26]28,31].In this context, the k-th protein's total quadratic indices q k (x m ) are calculated afterwards from this macromolecular vector as Eq. 1 shows, where, k a ij = k a ji (symmetric square matrix), n is the number of amino acids of the protein (α-carbon atom in the protein's backbone) and m X 1 ,…, m X n are the coordinates of the macromolecular vector X m in the base a i .In this case, the canonical base of ℜ n {e 1 ,…,e n } is used as the quadratic form's base.
Thereafter, the coordinates of any vector X m coincide with the components of this vector.For that reason, such coordinates can be considered as weights of the vertices (α-carbon atoms) of the pseudograph of the protein's backbone.The coefficients k a ij are the elements of the k th power of the macromolecular matrix M(G m ) of the protein's pseudograph (G m ).The term pseudograph in chemical graph-theory was introduced by Frank Harary [44].According to him, a pseudograph is a graph with multiple edges or loops between the same vertices or the same vertex.Loop-multigraph [45] or general graphs [46] are other terms also used in this research area [47]. Here, where n is the number of α-carbon atoms in protein's backbone.The elements a ij are defined as follows: (2) = 1 if i = j and the amino acid i has a hydrogen bond between its side chain and its main-chain atom = 0 otherwise where, E(G m ) represents the set of edges of G m .In this adjacency matrix M(G m ) the row i and column i correspond to vertex v i from G m .The elements a ii = 1 are loops in v i .On the other hand, the element a ij of this matrix represents a bond between an α-carbon atom i and other j.Here, we consider only covalent interaction (peptidic bond) and hydrogen-bond interaction (within a chain as well as between chains).As a first approximation, we considered both interactions equivalent, taking into account the "connectivity of the protein".The matrix M k (G m ) provides the number of walks of length k linking the α-carbon atom of the amino acids i and j.Additionally, proteins containing amino acids that present hydrogen bond between its side chain and its main-chain atom are represented like a pseudograph.Specifically, the Arc repressor presents this kind of interaction for the amino acid E17, where the presence of this intrasubunit hydrogen bond is accounted by means of a loop in its α-carbon atom of the protein's backbone [15].
We can obtain q k (x m ) by means of the matrix expression Being, [ m X] the column vector (an nx1 matrix) of the coordinates of X m in the canonical base of ℜ n , [ m X] t the transpose of [ m X] (an 1xn matrix) and M k (G m ) the k th power of the matrix M(G m ) (quadratic form's matrix).Table 3 exemplifies the calculation of q k (x m ) for bradykinin-potentiating pentapeptides previously used in QSAR studies [43].In addition to total protein quadratic indices, computed for the whole-molecule, local-fragment (both aminoacid and aminoacid-type) formalisms can be developed.The q kL (x m ) are graph-theoretical invariants for a given fragment (F R ), where F R is a connected subgraph and represents a specific group or set of amino acids in a protein.The definition of these descriptors is as follows: where m is the number of amino acids (α-carbon atoms) of the fragment of interest and k a ijL is the element of the file i and column j of the matrix M k L (G m ).This matrix is extracted from M k (G m ) and contains information referred to the vertices of the specific protein fragments (F R ) and also of the molecular environment.
The matrix M k L (G m ) = [ k a ijL ] with elements k a ijL is defined as follows: where, the k a ij are the elements of the k th power of M(G m ).These local analogues can also be expressed in matrix form by the expression: Note that for every partition of a protein into Z macromolecular fragments there will be Z local macromolecular-fragment matrices.That is to say, if a protein is partitioned into Z macromolecular fragments, the matrix M k (G m ) can be partitioned into Z local matrices M k L (G m ), L = 1,... Z.The k th power of the matrix M(G m ) is exactly the sum of the k th power of the local Z matrices.
In the same way, M k (G m ) = [ k a ij ] where, (7) and the total protein's quadratic indices are the sum of the macromolecular quadratic indices of the Z molecular fragments (see Table 3), Aminoacid and aminoacid-type quadratic indices are specific cases of local protein quadratic indices.In this sense, the k th aminoacid quadratic indices are calculated by summing the k th aminoacid quadratic indices of all aminoacids of the same aminoacid type in the protein.In the aminoacid-type quadratic indices formalism, each aminoacid in the molecule is classified into an aminoacid-type (fragment), such as apolar, polar uncharged, positive charged, negative charged, aromatic, and so on.For all data sets, including those with a common molecular scaffold as well as those with very diverse structure, the k th aminoacid-type quadratic indices provide important information.
Any local protein's quadratic index has a particular meaning, especially for the first values of k, where the information about the structure of the fragment F R is contained.Higher values of k relate to the environment information of the fragment F R considered within the macromolecular pseudograph (G m ).
In any case, a complete series of indices performs a specific characterization of the chemical structure.The generalization of the matrices and descriptors to "superior analogues" is necessary for the evaluation of situations where only one descriptor is unable to bring a good structural characterization [48,49].The local macromolecular indices can also be used together with total ones as variables for QSAR/QSPR modeling of properties or activities that depend more on a region or a fragment than on the macromolecule as a whole.

Ca
Amino acid residue (side chain R) Here, we consider only covalent interaction (peptidic bond), but noncovalent interaction (hydrogen-bond and salt bridge interaction) can be taken into consideration (within a chain as well as between chains) Macromolecular Vector: In the definition of the X m , as macromolecular vector, the one letter symbol of the amino acids indicates the corresponding side-chain amino-acid property, e.g., z 1 -values.That is to say, if we write V it means z 1 (V), z 1 -values or some amino acid property, which characterizes each side chain in the polypeptide.Therefore, if we use the canonical bases of R 5 , the coordinates of any vector X m coincide with the components of that macromolecular vector Total (whole molecule) protein quadratic indices of zero, first and second order are a quadratic maps; q k (x m ): ℜ n → ℜ such that, q 0 (V, K, W, A, A) = (V 2 +K 2 +W 2 +A 2 +A 2 ) = 37.874 Table 3. Cont.

The zero, first and second powers of the local (amino-acid) matrix
( ) and the total (whole-molecule) quadratic indices are the sum of the macromolecular quadratic indices of the 5 amino-acids,

TOMOCOMD Software
TOMOCOMD is an interactive program for molecular design and bioinformatics research [19].The program is composed by four subprograms, each one of them dealing with drawing structures (drawing mode) and calculating 2D and 3D molecular descriptors (calculation mode).The modules are named CARDD (Computed-Aided 'Rational' Drug Design), CAMPS (Computed-Aided Modelling in Protein Science), CANAR (Computed-Aided Nucleic Acid Research) and CABPD (Computed-Aided Bio-Polymers Docking).In this paper we outline salient features concerned with only one of these subprograms: CAMPS.This subprogram was developed based on a user-friendly philosophy without prior knowledge of programming skills.
The calculation of total and local macromolecular quadratic indices for any peptide or protein was implemented in the TOMOCOMD-CAMPS software [19].The main steps for the application of this method in QSAR/QSPR can be briefly resumed as follows: 1. Draw the macromolecular pseudographs for each protein of the data set, using the software's drawing mode.This procedure is carried out by a selection of the active aminoacid symbol belonging to 'natural' aminoacid code.Here, we consider only covalent interaction (peptidic bond) and hydrogen-bond interaction (within a chain as well as between chains).Afterward, we draw the mutants by changing an AA for alanine and considering that this change only affect the possibility of this region of the protein to form polar interaction (because we suppressed the hydrogen interaction if the former AA had it). 2. Use appropriated amino acid weights in order to differentiate the side chain of each amino acid.
In this work, we used as amino-acid property the three z-values [40,43]. 3. Compute the protein quadratic indices of the "macromolecular pseudograph's α-carbon atom adjacency matrix".They can be performed in the software calculation mode, in which one can select the side chain properties and the family descriptor previously to calculate the molecular indices.This software generates a table in which the rows and columns correspond to the compounds and the q k (x m ), respectively.4. Find a QSPR/QSAR equation by using statistical techniques, such as multilinear regression analysis (MRA), Neural Networks (NN), Linear Discrimination Analysis (LDA), and so on.That is to say, we can find a quantitative relation between a property P and the q k (x m ) having, for instance, the following appearance, where P is the measurement of the property, q k (x m ) [or q kL (x m )] is the k th total [or local] macromolecular quadratic indices, an the a k 's are the coefficients obtained by the statistical analysis.5. Test the robustness and predictive power of the QSPR/QSAR equation by using internal and external cross-validation techniques, 6. Develop a structural interpretation of the obtained QSAR/QSPR model using macromolecular quadratic indices as molecular descriptors.

Statistical Analysis
Linear Discrimination Analysis (LDA), Linear Multiple Regression (LMR) and the nonlinear estimation analysis, Piecewise Linear Regression (PLR) were used to obtain quantitative models.These statistical analyses were carried out with the STATISTICA software package [50].Forward stepwise was fixed as the strategy for variable selection in the case of LDA and LMR analysis.The tolerance parameter (proportion of variance that is unique to the respective variable) used was the default value for minimum acceptable tolerance, which is 0.01.
LDA is used in order to generate the classifier function on the basis of the simplicity of the method [51].To test the quality of the discriminant functions derived we used the Wilks' λ and the Mahalanobis distance.The Wilks' λ statistic for overall discrimination can take values in the range of 0 (perfect discrimination) to 1 (no discrimination).The Mahalanobis distance indicates the separation of the respective groups.It shows whether the model possesses an appropriate discriminatory power for differentiating between the two respective groups.The classification of cases was performed by means of the posterior classification probability, which is the probability that the respective case belongs to a particular group, i. e., mutants with near wild-type stability (H) or mutants with reduced stability (P) (see Table 1, second column).In developing this classification function the values of 1 and -1 were assigned to H and P mutants.The quality of the ADL-model was also determined by examining the percentage of good classification and the proportion between the cases and variables in the equation.We also consider the linear discriminant canonical analysis statistics such as: canonical regression coefficient (R canc ), chi-squared and p-level [p(χ 2 )].Validation of the discriminant function was corroborated by means of leave-n-out cross-validation procedures.
A simple linear and other more complex nonlinear model was obtaining using LMR and PLR as statistic techniques, respectively.The quality of the models was determined examining the statistic parameters of multivariable comparison of regression and cross-validation procedures.In this sense, the quality of models was determined by examining the regression coefficients (R), determination coefficients (R 2 ), Fisher-ratio's p-level [p(F)], standard deviations of the regression (s) and the leaveone-out (LOO) press statistics (q 2 , s cv ) [52].In recent years, the LOO press statistics (e.g., q 2 ) have been used as a means of indicating predictive ability.Many authors consider high q 2 values (for instance, q 2 > 0.5) as indicator or even as the ultimate proof of the high-predictive power of a QSAR model.In a recent paper, Golbraikh and Tropsha demonstrated that a high value of LOO q 2 appears to be a necessary but not the sufficient condition for the model to have a high predictive power [53].
In addition, to assess the robustness and predictive power of the found models, external prediction (test) sets were also used.This type of model validation is very important, if we take into consideration that the predictive ability of a QSAR model can only be estimated using an external test set of compounds that was not used for building the model [52,53].

Classification Model
The development of a discriminant function that permits the classification of mutants as near wildtype stability or reduced stability is a key of the present approach to describe the protein stability effects of a complete set of alanine substitutions in Arc repressor.The overall performance of the current method critically depends on the selection of cases of the training set used to build the classifier model.Here we consider a general data set of 53 A-mutants, 28 of them having near wildtype stability (1-28) and the rest being mutants with reduced stability .This data set was randomly divided into two subsets, one containing 41 mutants (21 having near wild-type stability and 20 reduced stability) was used as a training set, and the other containing 12 mutants (7 having near wild-type stability and 5 reduced stability) was used as a test set.These mutants were never considered in the development of the quantitative model.
The principle of parsimony (Occam's razor) was taken into account as strategy for model selection.In its original form, the Occam's razor states that "Numquam ponenda est pluritas sin necessitate", which can be translated as "Entities should not be multiplied beyond necessity" [54].In this case simplicity is loosely equated with the number of parameters in the model.If we understand predictive error to be the error rate for unseen examples, the Occam's razor can be stated for the selection of QSAR/QSPR models as ("QSAR/QSPR Occam's Razor"): Given two QSAR/QSPR models with the same predictive error, the simpler one should be preferred because simplicity is desirable in itself [54].In this connection, we select the functions with higher statistical signification but having as few parameters (a k ) as possible.Equation (10) shows the linear classification model obtained together with the LDA's statistical parameters: where N is the number of mutants, λ is the Wilks's statistic, D 2 is the squared Mahalanobis distance and F is the Fisher ratio.
These statistics indicate that model (10) is appropriate for the discrimination of near wild-type stability/reduced stability mutants studied here.It classifies correctly 85.0% (18/21) of near wild-type stability mutants and 85.7% (17/20) of reduced stability mutants in the training set, for a global good classification of 85.4% (35/41).The percentages of false mutants in training set are the same for both groups: 7.32% (3/41).False near wild-type stability mutants are those reduced-stability mutants that model classifies as near wild-type stability mutants, and the false reduced-stability mutants are near wild-type stability mutants classified as reduced-stability mutants by the model.In Table 1 we give the classification of mutants in the training set together with their posterior probabilities calculated from the Mahalanobis distance.
To assess the predictability of the classification model (10), a leave-n-out cross-validation was carried out using the classification tree module.The selected conditions for the validation procedure were the following: discriminant-based linear combination as split method, prune on misclassification error as stopping rule and the same prior probabilities than in equation (10) (proportional to group size).Once the selected conditions were applied to the classification tree module, the equation (10) was obtained and varying the folding parameter of the cross-validation, a leave-n-out routine could be developed.This model shown an 82.93, 82.93, 80. 49, 80.49, 82.93, 80.49, 80.49, 80.49, 80.49 and 80.49% of global good classification when n varied from 2 to 10 in the leave-n-out cross validation procedures.The model was stabilized around 80.49% when n was > 6 (see Figure 1).The most important criterion to accept or not of a discriminant model, such as model (10), is based on the statistics for the test set.Model (10) classifies correctly 11 of 12 mutants, for a global classification of 91.67%.In Table 1, we give the classification of mutants in the test set.If we considered the data set and the test set (full set) the percentage of good classification was 86.79% (46/53).
Canonical analysis is used here to test both the ability of protein's quadratic indices to discriminate between the two groups of Arc A-mutants and to order these mutants accordingly with their stability profile.
When LDA analysis is applied to solve the two-group classification problem we ever find two classification functions [55,56].Medicinal chemists used to report the function obtained by taking the difference between these two functions when develop QSAR studies [57][58][59][60][61][62][63].
However, we cannot use these two classification functions to evaluate all the compounds and obtain a bivariate stability map because they are not orthogonal [55,56].To solve this problem we used canonical analysis in this case the dimensional reduction caused by canonical analysis makes possible to obtain a one-dimension stability map [56].
That is the same that we can order all compounds taking into account its canonical scores.The canonical scores of all A-mutants of Arc repressor appear in Table 1 (fifth column).We can detect an overall ascendant tendency of canonical scores when they are plotted in the same order in which stability (t m ) increases (see Figure 2).As it is expected, the over all mean of canonical root scores for the group of near wild-type stability mutants has an opposite sign (+) with respect to the other group (-) [56].
Model (12) explains almost 72% of the variance of the experimental t m .The predictive ability of model ( 12) is evidenced by the value of the LOO press statistics (for example q 2 > 0.5 and s cv , which is only 10.64% higher than that of the regression model) [52].Taken into account that a high value of LOO q 2 (for instance, q 2 > 0.5) appears to be a necessary but not a sufficient condition for the model to have a high predictive power [53], a test set was also used to access the predictive ability of the equation (12).When linear regression model (12) was applied to resolve t m predictions of Arc Amutants in the prediction set, poor results were found (see Table 1; the last two columns).Thus, this model ( 12) has a low predictive power.
Different protein folding may be the reason for the lack of linear regression between protein's quadratic indices and stability (t m ); leading to a nonlinear dependence between t m and protein's quadratic indices.In this case other terms should be taken into consideration such as cooperative saltbridges and hydrogen-bonds formation, hydrophobic forces, steric terms, and so on.In this sense, far from strong quantitative correlations between stability and structural factors have been obtained in previous study [15].For example, when the set of t m values were tested for linear correlations with fractional side-chain solvent accessibility, with changes in buried surface area, with average side-chain B-factors, and with the number of side-chain atoms or total atoms within 6 Å of the atoms deleted by the alanine substitution, the pairwise correlation coefficient (r 2 ) ranges from 0.21 to 0.38 [15].Thus, even though most substitution of alanine for hydrophobic-core residues are destabilizing, there is no simple relationship between the size of the replaced core residue and the destabilizing effect [15].
As we previously pointed out, the quality of a QSAR/QSPR model is mainly expressed by its predictive power, measured to a test set of mutants not included in the training set.In Table 1, we depicted the observed, predicted, and residual values of t m for the training and test set.As can be appreciated, the piecewise model found to describe the stability of Arc A-mutants has a rather good predictive power (R = 0.91, R 2 = 0.82, s = 4.249).In developing this model only one mutant (1PA8-st6) was detected as statistical outlier.This is a logic result because only this mutant (PA8) is significantly more stable than wild type.The t m of this mutant protein is about 15 o C higher than that of the wild-type parent (see Table 1), and the free energy of unfolding is increased by 2.9 kcal mol -1 compared with wild type [15].
The main difficulty of the regression non-linear piecewise, is its limitation in the prediction of neither new mutants whose profiles of stability are nor known.The problem here is: which equation should be applied to a new mutant not considered in this study?The Bkpt value (51.32), perfectly agrees with an experimental scale previously proposed [15].The same scale was used for grouping mutants into the two studied groups in our ADL approach.For this reason, we can use the ADL and piecewise models in combination to classify and to predict the stability of the mutans' Arc homodimers.

Interpretation of Obtained Models
At present it is known that the folding of Arc repressor is influenced by different kinds of interactions [14-16, 18, 22, 23].An overwhelming role is played by the Van der Waals forces [15].The hydrophobic interaction is another factor influencing the stability due to the hydrophobic nature of the Arc wild-type core [15][16][17].Another factor is related to electrostatic force, mainly due to intra and intersubunit salt bridges and hydrogen bonds [15][16][17].
However, most of these factors are interrelated to each other, and it is difficult to determine the contribution of each one by separate.For instance, hydrophobic interaction is intimately related to van der waals forces, and the electrostatic interactions are also related to dispersion interactions, which are part of the Van der Waals forces.In addition, Arc wild-type and its mutants showed a cooperative behaviour in folding/dimerization processes [15][16][17].
As can be observed in the obtained models, the included variables are related with the factors that influence on the stability and this one with the structural features of Arc dimer.In this sense, the protein's quadratic indices calculated using z 1 , z 2 , or z 3 values, as amino-acid (side-chain) properties are included in most of the developed models.These z-values are related to hydrophilicity, bulk, and electronic properties, respectively.For this reason, it is possible to determine the nature of the driving forces of the Arc repressor folding, e.g., hydrophobic, steric, or electronic.
The preponderance of hydrophobic and electronic effects in the obtained equations (10-13) over other types of protein's quadratic indices clearly indicates the importance of the hydrophobic and electronic side chain factor in the folding of Arc dimer.
It must be pointed out that developed equations (10-13) involve short-reaching (k ≤ 3), middlereaching (3 < k ≤ 7) and far-reaching (k = 8 or greater) protein's quadratic indices.This situation means that the stability profile of wild-type Arc and its A-mutants results in topologic/topographiccontrolled protein's backbone interactions.

Conclusions
In this study a new set of macromolecular descriptors relevant to protein QSAR/QSPR studies is present.These descriptors, total and local protein's quadratic indices, are calculated from the macromolecular pseudograph's α-carbon atom adjacency matrix using z-values and canonical bases as side chain of amino-acid property and quadratic form's bases, respectively.Their derivation is straightforward, and it is easy to interpret the QSARs/QSPRs that include them.The total protein's quadratic indices and LDA, LMR and PLR have been used in QSSR studies of 53 Arc A-mutants.The resulting quantitative models are significant from a statistical point of view.A LOO cross-validation procedure (internal validation) and an external predicting series (external validation) revealed that the QSSR models had a good predictability.
The models found to describe the stability profile of wild type Arc and its A-mutants include protein's quadratic indices accounting for hydrophobic (z 1 ), bulk-steric (z 2 ), and electronic (z 3 ) features of the studied molecules.These models using such combination of molecular descriptors are better than any other model that can be found by using only one type of the studied descriptors.We interpret these results as suggesting that many of the Arc mutations affect stability in more than one way and: by disrupting specific electronic interaction, by changing hydrophobic burial, and/or by changing the structure of the native or the denatured protein [9][10][11][12][13].Thus, we have proved that the combined use of z 1, 2, 3 -protein's quadratic indices is an appropriate approach to QSSR studies.These models are not only good enough to predict thermodynamic parameter of the folding of mutants of Arc dimer repressor, but also permit the interpretation of the driving forces of such folding processes.
The approach described here represents a novel and very promising way to bioinformatics research.We would expect computational protein science to have a similar effect on the search for new vaccines, receptors, drugs, and so on as molecular modelling and QSAR have had on the search for new drugs.

5 (
69 2.84 -4.75 0.07 0.07] [ m X] t = transposed of [ m X] and it means the vector of the coordinates of X m in the canonical basis of R 5 (an 1x5 matrix) [ m X]: vector of coordinates of X m in the canonical basis of R an 5x1matrix) If the peptide is partitioned into each (5) amino acid, the matrix M k (G m ) can be partitioned into 5 local matrices M k L (G m ), L = 1,... 5.The k th power of the matrix M(G m ) is exactly the sum of the k th power of the local (5) matrices: M k (G m ) N = 41 λ = 0.476 D 2 = 4.40 F(4,36) = 9.8965 p(F) < 0.0001

Figure 1 .
Figure 1.Behavior of the global or total percentage of good classification (accuracy) in different n-fold cross-validation analysis.

Figure 2 .
Figure 2. Overall ascendant tendency of canonical scores plotted in the same order in which t m increases.Blocks I and III contain misclassified Arc A-mutants.

Table 1 .
Results of the ADL, PLR and LMR Analyses of the Arc A-Mutants in the Training and Test Sets.

Table 1 . Cont. Protein Class b P% (P) c P% (H) c Score d t m (Obs) e t m (Pred) f Res g t m (Pred) h Res g
(12)siduals: t m (Obs) -t m (Pred).hCalculatedt m values by the linear regression model(12).i Statistical outlier.

Table 3 .
Definition and Calculation of Three (k = 0-2) Total and Local (Side Chain Amino Acid) Protein Quadratic Indices of the "Macromolecular Pseudograph's α-Carbon Atom Adjacency Matrix" of a Bradykinin-Potentiant Pentapeptide.