Functional and Structural Features of Disease-Related Protein Variants

Modern sequencing technologies provide an unprecedented amount of data of single-nucleotide variations occurring in coding regions and leading to changes in the expressed protein sequences. A significant fraction of these single-residue variations is linked to disease onset and collected in public databases. In recent years, many scientific studies have been focusing on the dissection of salient features of disease-related variations from different perspectives. In this work, we complement previous analyses by updating a dataset of disease-related variations occurring in proteins with 3D structure. Within this dataset, we describe functional and structural features that can be of interest for characterizing disease-related variations, including major chemico-physical properties, the strength of association to disease of variation types, their effect on protein stability, their location on the protein structure, and their distribution in Pfam structural/functional protein models. Our results support previous findings obtained in different data sets and introduce Pfam models as possible fingerprints of patterns of disease related single-nucleotide variations.


Introduction
The elucidation of the association between individual phenotypes and personal genome is the major goal of modern genomic research. This research program is heavily supported by the fast improvement of sequencing technologies and it mostly focuses on the dissection of the genetic basis of diseases [1]. Although many phenotypic traits depend on the combination of multiple genetic and environmental factors, a number of relationships between single genes and diseases have been collected in databases [2,3]. Among the different types of variations, the most investigated are single-nucleotide mutations occurring in coding regions and leading to the substitution of a single residue in the expressed proteins. The challenge is then to assess the effects of single residue variations (SRVs) on protein structure and function [4]. Different studies searched for features that can help the discrimination of disease related variations in terms of type of mutation, localization on the protein structure, and functional characterization of the protein carrying the variation [5][6][7][8][9][10][11][12][13]. As a general trend, disease-related SRVs are often found in buried regions [9,10] or in surface patches likely to mediate interaction with other proteins [6,8], while a small number occurs in catalytic and metal binding sites [6,9]. Proteins carrying disease related mutations perform many different functions, and a relevant number of enzymes is involved [9].
Relation to protein stability has been investigated with different approaches and on different datasets. Since the thermodynamic effect of disease-related SRVs has been characterized in few

HVAR3D: A Dataset of Protein Variants with Structural and Functional Information
We extracted from Humsavar March 2018, (https://www.uniprot.org/docs/humsavar) a set of single residue variations (SRVs), comprising 30,221 disease-related and 39,894 neutral polymorphisms. Then, we mapped SRVs into available protein structures in the Protein Data Bank (PDB) (https://www. rcsb.org/), restricting our 3D set to some 2070 protein chains, of which 764 contains 8127 disease SVRs ( Figure S1).
The number of protein structures mapping disease SRVs depends on the relative sequence to structure coverage ( Figure S1), and only extended coverage (e.g., ≥70%) allows a correct analysis of their structural and functional features. When considering only proteins having at least one disease-related SRV and structure covering at least 70% of the corresponding UniProtKB sequence (with resolution lower than 3 Å), the PDB dataset reduces to 386 entries, including 1044 neutral and 5577 disease SRVs, in 392 protein chains (Table 1). We refer to this dataset as HVAR3D (Human Variants 3D). Our HVAR3D data set includes protein chains containing disease related (protein is labeled D) and disease-related and neutral (D/N) variations Table 1. Summary statistics of the HVAR3D dataset.

Structural Characterization of HVAR3D
About 62% of the proteins in the data set have multimeric biological assemblies (Table 2). Some 43% of the multimers are homodimers. This allows mapping disease-related variations also at the interface region (Table 2). Table 2. Biological assemblies in HVAR3D. The total number of residues in monomers is 54107; the total number of residues in multimers is 72012, 10678 of which are protein-protein interaction sites.

Monomers
Following CATH structural classification (http://www.cathdb.info/), HVAR3D protein chains group into five structural classes (Table S1). Most of the protein chains group into the Alpha and Beta class, accounting for some 2207 disease related and 266 neutral variations. Disease-related variations similarly distribute in all the main structural classes (α, β and α + β).
As previously described [8,9], the contribution of missense mutations to human diseases is also due to specific interface residues which may affect complex stability when variated. For this reason, in our data set, we discriminate functional monomers from functional multimers, finding that about 9% of disease SRVs reside at interface (Table 2).

Functional Characterization of HVAR3D
With the aim of highlighting some association among disease-related variations and variant structural and functional features, we first inspect which functional annotations are present in the data set. We extracted the enzyme commission (EC) numbers associated to the 392 protein chains (corresponding to the 386 entries) in HVAR3D dataset and counted, for each enzymatic class (i.e., first-digit EC number), the number of disease-related and neutral SRVs. Furthermore, we also count the frequency of appearance of an SRV in an annotated active or binding site (Table 3). We found that 248 out of 392 proteins (63%) are endowed with at least one EC number, accounting for 852 disease-related variations in 83 protein chains (D) and 3406 disease-related and 455 neutral variations in 165 protein chains (D/N). Among the different enzymatic classes, Oxidoreductases, Transferases and Hydrolases contain the highest numbers of protein chains and disease-related SRVs. Only seven protein chains are annotated with multiple ECs.  Oxidoreductases  64  1045  107  2  20  0  0  Transferases  71  1058  122  5  20  0  0  Hydrolases  69  1342  143  7  10  0  1  Lyases  19  260  36  0  3  0  0  Isomerases  7  236  14  0  7  0  0  Ligases  11  138  15  0  7  0  2  Multiple ECs  7  179  18  0  0  0  0   Total  248  4258  455  14  67  0  3 (a) EC = enzyme commission; (b) D-SRVs = disease-related SRVs; (c) N-SRV = neutral SRVs. Residue annotation (Active site, Binding site) is taken from the corresponding protein file in UniprotKB.
It appears that the number of disease-related variations falling into an active site or a ligand-binding site is only 1% of the total number of disease-related variations in the set. This figure is strictly dependent on the extent of annotation present in UniprotKB and it differs from previous studies where prediction methods were adopted [8,9].
To complement the above analysis, we further investigated the set of 144 protein chains lacking EC annotation by extracting, when available, the gene ontology (GO) molecular function (MF) terms.
In doing this, we restricted our attention to the first-level GO molecular function terms (i.e., terms that are direct descendant of the GO molecular function ontology root (http://geneontology.org/page/ download-ontology). The gene ontology lists 19 first-level molecular function terms: we found that 10 GO MFs are sufficient to annotate our 144 protein chains. One hundred and twenty protein chains can be annotated with the first level "binding (GO:0005488)" GO MF term: in 29 protein chains the term is the only annotation, whereas in 91 protein chains it couples with at least one of the remaining nine GO MFs (Table 4).   (Table 4), for a total sum of 1319 disease related and 589 neutral variations. Table 5 also lists disease-related and neutral SRVs, and the number of times SRVs of the two categories fall into a ligand-binding site. Similarly to what we find in the Enzyme set (Table 3), it appears that the number of variations falling in binding sites is a minor fraction of the total (1%). Reactome pathways including protein chains of HVAR3D are listed in Table S3, according to the first-level biological pathways as extracted from the Reactome database (https://reactome.org/) (along with the number of associated disease-related and neutral variants). Nineteen out of 25 first level Reactome pathways are populated by proteins contained in the database (even in the case of the 97 protein chains with multiple Reactome pathways). Forty-four percent of the protein chains in HVAR3D are involved in metabolic pathways (collectively accounting for 168 chains and 2796 disease-related variations). The observation is consistent with the fact that 63% of the protein chains are enzymes.
The second most populated Reactome first level pathway is the Immune system, accounting for 27 protein chains with 461 disease related variations (Table S2).

Structural and Functional Characterization of HVAR3D
In HVARD3, 370 protein chains have 357 Pfam functional domains (https://pfam.xfam.org/), mapping some 4895 disease-related (87.7% of the total) and 798 neutral (76.4% of the total) variations. Table 5 lists the most populated and the three less populated Pfam domains, the number of covered protein chains and the number of disease-related and neutral variations present in the domain. Fifty Pfam models contain 67.5% of disease SVRs in the data base ( Figure S2).
Overall, HVAR3D is a database of fairly well categorized and annotated proteins offering an ample variety of structural and functional features to associate with disease-related variations.

Chemico-Physical Characterization of Disease Related Variations in HVAR3D
We start our analysis following a long-standing discussion, aiming at understating whether possible features of disease-related variations can emerge from the change in chemico-physical properties of the lateral side chain [7,17,20,21]. In this regard, our database is a possible source of information, in which the position of the variation can immediately be evaluated in the context of the wild type protein structure. Our analysis investigates what HVAR3D contains in relation to the different types of variations. When necessary for statistical robustness, the number of neutral variations is increased by adopting a reference data set comprising 2667 neutral variations from 728 protein chains covering more than 70% of corresponding UniprotKB sequences (see Material and Methods).
In Figure 1A, the frequency of disease-related variation types emerges from the color code: the darker the red, the more frequent the disease-related variation type, the lighter the red the less frequent the disease-related variation type. In Figure 1B we show the measure of the strength of association (Log-odds ratio) of a given variation type to disease or neutral polymorphism (for details about the computational method, see Materials and Methods).  The heatmap indicates that variations mostly associated to disease are G → C, followed by S→W, F→C, W→C, Y→C and I → S, whereas S→A, T→ E, Y→ F, F→Y are more strongly associated to neutral variations. Interestingly, the values of Figure 1B well correlates (0.71 of Pearson correlation) with our previous probability of having a variation associated to disease (P d index) [20], derived from a much larger data set comprising some 5300 protein sequences and 9896 disease-related variations.  Figure 3 shows the strength of association to perturbation of protein stability (Log-odd ratio, computed using the classification of variations into perturbing/non-perturbing) of the different variation types grouped according to their chemico-physical properties, similarly to Figure 2. It appears that the most perturbing variations correspond to a change from aromatic side chains into polar and charged ones and to a less extent also into apolar ones. Less perturbing variations are those where a charged residue turns into aromatic or charged lateral side chains.

G A V P L I M F W Y S T C N Q H D E K R G A V P L I M F W Y S T C N Q H D E K R G
The regression plot of the association strengths depicted in Figures 2 and 3, is presented in Figure  4. Each data point is reported with the associated standard errors alongside both axes. The regression line is shown in red, while grey dotted lines are drawn at an ordinate distance from the regression For sake of comparison with previous results, we investigated the relation among the type of variation and the extent of protein perturbation, as measured from the computed change in Gibbs free energy (∆∆Gs) upon single residue variation. To this aim, we adopted INPS-3D, a state-of-the-art prediction method previously described [20,23] and suited to compute the extent of protein stability change upon variation. INPS-3D computes ∆∆Gs for all variations (disease and neutral) in our HVAR3D dataset complemented with the set of neutral variations in other 728 protein chains. Then, we divided the set of variations into two subsets: perturbing variations, having |∆∆G| ≥ 1 kcal/mol and non-perturbing ones, with |∆∆G|< 1 kcal/mol. Figure 3 shows the strength of association to perturbation of protein stability (Log-odd ratio, computed using the classification of variations into perturbing/non-perturbing) of the different variation types grouped according to their chemico-physical properties, similarly to Figure 2. It appears that the most perturbing variations correspond to a change from aromatic side chains into polar and charged ones and to a less extent also into apolar ones. Less perturbing variations are those where a charged residue turns into aromatic or charged lateral side chains. Two main outliers can be identified: (i) variations from aromatic to polar residues, which are slightly associated to neutral polymorphisms (association strength is -0.08) but strongly associated to perturbation (association strength is 2.44); (ii) variations from charged to aromatic residues, which are moderately associated to disease (association strength is 0.76) but strongly not associated to perturbation (association strength is -3.23). Pearson's correlation value is 0.59, suggesting, as previously reported [20], that the association of disease related variations to perturbing ones is moderate (Figure 4). The regression plot of the association strengths depicted in Figures 2 and 3, is presented in Figure 4. Each data point is reported with the associated standard errors alongside both axes. The regression line is shown in red, while grey dotted lines are drawn at an ordinate distance from the regression line equal to the root mean square of the residuals (RMSE = 1.89) (see Materials and Methods for details).

Variant
Two main outliers can be identified: (i) variations from aromatic to polar residues, which are slightly associated to neutral polymorphisms (association strength is -0.08) but strongly associated to perturbation (association strength is 2.44); (ii) variations from charged to aromatic residues, which are moderately associated to disease (association strength is 0.76) but strongly not associated to perturbation (association strength is -3.23). Pearson's correlation value is 0.59, suggesting, as previously reported [20], that the association of disease related variations to perturbing ones is moderate (Figure 4).

Residue Solvent Exposure and Chemico-Physical Features of Disease SRVs
In Figure 5, we analyze in more details whether the solvent exposure of the variated site influences the strength of the association between disease and variation type. We group variations as a function of their chemico-physical properties. Increasing red and green color intensities indicate that variation types are more associated to disease or neutral polymorphisms, respectively. The solvent exposed disease related variations are compared to the buried ones ( Figure 5A and 5B, respectively). Major differences are: (i) apolar residues turning into apolar or aromatic ones are associated with disease when occurring in exposed sites and with neutral polymorphisms when they are located in buried sites; (ii) charged residues turning into polar or charged residues are associated with disease when occurring in buried and to neutral polymorphisms in exposed sites; (iii) similarly, polar residues turning into charged residues are associated with disease when occurring in buried and to neutral polymorphisms in exposed sites.
These observations are consistent with the general view that on average, buried regions in proteins are routinely richer in apolar and aromatic residues, while polar solvent accessible surfaces include more polar and charged residues. From our data, it appears that the distribution density of disease-related variations differs depending on whether the variation types is or not exposed to the polar solvent. These results are consistent with previous observations [8,9,21]

Residue Solvent Exposure and Chemico-Physical Features of Disease SRVs
In Figure 5, we analyze in more details whether the solvent exposure of the variated site influences the strength of the association between disease and variation type. We group variations as a function of their chemico-physical properties. Increasing red and green color intensities indicate that variation types are more associated to disease or neutral polymorphisms, respectively. The solvent exposed disease related variations are compared to the buried ones ( Figure 5A,B, respectively). Major differences are: (i) apolar residues turning into apolar or aromatic ones are associated with disease when occurring in exposed sites and with neutral polymorphisms when they are located in buried sites; (ii) charged residues turning into polar or charged residues are associated with disease when occurring in buried and to neutral polymorphisms in exposed sites; (iii) similarly, polar residues turning into charged residues are associated with disease when occurring in buried and to neutral polymorphisms in exposed sites.
These observations are consistent with the general view that on average, buried regions in proteins are routinely richer in apolar and aromatic residues, while polar solvent accessible surfaces include more polar and charged residues. From our data, it appears that the distribution density of disease-related variations differs depending on whether the variation types is or not exposed to the polar solvent. These results are consistent with previous observations [8,9,21].

Pfam Protein Domains and Chemico-Physical Features of Disease SRVs
We analyze the distribution of SRVs chemico-physical types in Pfam domains and find out a useful representation. In Figure 6, the frequency of each disease SRV type is color coded with respect to Pfam domains which contain at least 30 variations. Fifty Pfam domains cover some 67.5% of the overall disease-related SRVs in HVAR3D and 87 protein chains ( Figure S2). Disease SRV type frequency distribution ( Figure 1A) is now associated to structural/functional Pfam domains. This makes it possible to retrieve per functional model where the most harmful MIM mutations are located at a structural level. As expected, disease SVR types are not homogenously distributed among the different Pfams. When compared to background distribution (last two rows of Figure 6, Total = contained in 357 Pfam domains, and whole data set), it appears that the most frequent disease variation type, apolar into apolar (top left quadrant of Figure 1A) is also spread among Pfams, with few exceptions. From the background distribution, PF02414, corresponding to Thrombospondin type 3 repeat, has disease-related variations when a charged residue turns into apolar, aromatic, and polar residues. The domain derives from PDB 3FBY, which represents in HVAR3D the Cartilage oligomeric matrix protein. In the corresponding P49747 (COMP_HUMAN) Uniprot file, 28 Aspartic acid residues (D) changing to apolar, aromatic and polar are associated to Pseudoachondroplasia and Multiple epiphyseal dysplasia 1, both Mendelian skeletal disorders. It also turns out that D residues in the native protein are ion Calcium stabilized. Another interesting example, is PF00476, DNA polymerase (EC: 2.7.7.7), which contains variations related to changes of lateral side chains from polar to polar and from charged to polar (3/3 and 3/4 quadrants from top left of Figure 1A, respectively).

Pfam Protein Domains and Chemico-Physical Features of Disease SRVs
We analyze the distribution of SRVs chemico-physical types in Pfam domains and find out a useful representation. In Figure 6, the frequency of each disease SRV type is color coded with respect to Pfam domains which contain at least 30 variations. Fifty Pfam domains cover some 67.5% of the overall disease-related SRVs in HVAR3D and 87 protein chains ( Figure S2). Disease SRV type frequency distribution ( Figure 1A) is now associated to structural/functional Pfam domains. This makes it possible to retrieve per functional model where the most harmful MIM mutations are located at a structural level. As expected, disease SVR types are not homogenously distributed among the different Pfams. When compared to background distribution (last two rows of Figure 6, Total = contained in 357 Pfam domains, and whole data set), it appears that the most frequent disease variation type, apolar into apolar (top left quadrant of Figure 1A) is also spread among Pfams, with few exceptions. From the background distribution, PF02414, corresponding to Thrombospondin type 3 repeat, has disease-related variations when a charged residue turns into apolar, aromatic, and polar residues. The domain derives from PDB 3FBY, which represents in HVAR3D the Cartilage oligomeric matrix protein. In the corresponding P49747 (COMP_HUMAN) Uniprot file, 28 Aspartic acid residues (D) changing to apolar, aromatic and polar are associated to Pseudoachondroplasia and Multiple epiphyseal dysplasia 1, both Mendelian skeletal disorders. It also turns out that D residues in the native protein are ion Calcium stabilized. Another interesting example, is PF00476, DNA polymerase (EC: 2.7.7.7), which contains variations related to changes of lateral side chains from polar to polar and from charged to polar (3/3 and 3/4 quadrants from top left of Figure 1A, respectively).
In HVAR3D, this Pfam model represents protein 3IKM, the human mitochondrial DNA polymerase subunit gamma-1, (P54098, DPOG1_HUMAN), whose mutations are associated to seven different OMIM mitochondrial diseases (P54098). Interestingly, the protein (1239 residue long) is characterized by the presence of two domains, the C-terminal PF00476, containing the polymerase catalytic site, and the N-terminal PF18136, a DNA mitochondrial polymerase exonuclease domain, which acts as a proofreading exonuclease. This last domain is not included in Figure 6, for sake of brevity and it is shown in Figure 7, where the distribution of the disease variation types is compared. The protein contains 59 disease related variations, 30 in the first and 11 in the second Pfam domain. The remaining 18 variations are located in the interdomain region. Although the two Pfams share a common pattern of charged to apolar, aromatic and polar residue variations associated to disease, they differ for the apolar to apolar frequency that is relevant only in the DNA mitochondrial polymerase exonuclease (Figure 7). In HVAR3D, this Pfam model represents protein 3IKM, the human mitochondrial DNA polymerase subunit gamma-1, (P54098, DPOG1_HUMAN), whose mutations are associated to seven different OMIM mitochondrial diseases (P54098). Interestingly, the protein (1239 residue long) is characterized by the presence of two domains, the C-terminal PF00476, containing the polymerase catalytic site, and the N-terminal PF18136, a DNA mitochondrial polymerase exonuclease domain, PF00206  PF00089  PF01227  PF01379  PF16499  PF00149  PF01229  PF02781  PF00083  PF01208  PF00007  PF00576  PF00764  PF04209  PF00349  PF02744  PF00476  PF01593  PF00067  PF00185  PF00029  PF00224  PF02729  PF02310  PF02771  PF02412  PF17211  PF00884  PF01642  PF00441  PF02770  PF01847  PF00171  PF04952  PF05053  PF00728  PF00351  PF03727  PF17450  PF02055  PF00135  brevity and it is shown in Figure 7, where the distribution of the disease variation types is compared. The protein contains 59 disease related variations, 30 in the first and 11 in the second Pfam domain. The remaining 18 variations are located in the interdomain region. Although the two Pfams share a common pattern of charged to apolar, aromatic and polar residue variations associated to disease, they differ for the apolar to apolar frequency that is relevant only in the DNA mitochondrial polymerase exonuclease (Figure 7).  P54098 (DPOG1_HUMAN). The color span from yellow (low frequency) to red (high frequency). Kullback-Leibler divergence between distributions of the two Pfams domains is 2.5 bits.

Variant Data Collection and Mapping to Protein Data Bank (PDB)
SRVs were collected from the release of March 2018 of the Humsavar. In order to perform structural and functional analyses, we firstly mapped SRVs from protein sequences on positions in corresponding PDB structures. For this task, we used the PDBSWS website (last updated on July 2018) [24] and SIFTS mapping [25]. PDB coverage was computed dividing the number of residues covered in the PDB by the length of the UniprotKB sequence deprived of signal or transit peptides, when present.
In the HVAR3D dataset, we collected all disease-related and neutral variations occurring in protein chains covering more than 70% of the corresponding UniprotKB sequence and endowed with at least one disease-related variation (i.e. we retained only chains having both disease-related and neutral or only disease-related variations).

Retrieving Protein Annotations
Enzyme commission (EC) numbers and Gene Ontology Molecular Function (GO-MF) terms were retrieved for each protein using cross references directly available from the UniprotKB entry. In computing statistics for variations, all available EC numbers (at any level of annotation) were collapsed into their first-digit classification.
For each GO-MF term annotated, we reconstructed the full path to the ontology root node using the GO-TermFinder Perl software package (https://metacpan.org/release/GO-TermFinder) [26] and using the latest version of GO obo ontology file available at the Gene Ontology Consortium website (http://geneontology.org/page/download-ontology). Complete DAGs for all GO-MF annotated on each protein were then merged together into a single functional annotation. Each protein chain (and corresponding variants therein) were then classified into one (or more) terms from the first-level GO-MF annotation (overall comprising 19 terms).
Positions of ACTIVE and BINDING sites for each protein chain were retrieved directly from features annotated on the UniprotKB entry.
Reactome pathway annotation was retrieved for each protein using the all-level pathway hierarchy mapping file (available at https://reactome.org/download-data). This file assigns, to each UniprotKB accession, the full Reactome annotation at all levels of the hierarchy. From these, we extracted the first-level pathways used to perform statistics.  P54098 (DPOG1_HUMAN). The color span from yellow (low frequency) to red (high frequency). Kullback-Leibler divergence between distributions of the two Pfams domains is 2.5 bits.

Variant Data Collection and Mapping to Protein Data Bank (PDB)
SRVs were collected from the release of March 2018 of the Humsavar. In order to perform structural and functional analyses, we firstly mapped SRVs from protein sequences on positions in corresponding PDB structures. For this task, we used the PDBSWS website (last updated on July 2018) [24] and SIFTS mapping [25]. PDB coverage was computed dividing the number of residues covered in the PDB by the length of the UniprotKB sequence deprived of signal or transit peptides, when present.
In the HVAR3D dataset, we collected all disease-related and neutral variations occurring in protein chains covering more than 70% of the corresponding UniprotKB sequence and endowed with at least one disease-related variation (i.e., we retained only chains having both disease-related and neutral or only disease-related variations).

Retrieving Protein Annotations
Enzyme commission (EC) numbers and Gene Ontology Molecular Function (GO-MF) terms were retrieved for each protein using cross references directly available from the UniprotKB entry. In computing statistics for variations, all available EC numbers (at any level of annotation) were collapsed into their first-digit classification.
For each GO-MF term annotated, we reconstructed the full path to the ontology root node using the GO-TermFinder Perl software package (https://metacpan.org/release/GO-TermFinder) [26] and using the latest version of GO obo ontology file available at the Gene Ontology Consortium website (http://geneontology.org/page/download-ontology). Complete DAGs for all GO-MF annotated on each protein were then merged together into a single functional annotation. Each protein chain (and corresponding variants therein) were then classified into one (or more) terms from the first-level GO-MF annotation (overall comprising 19 terms).
Positions of ACTIVE and BINDING sites for each protein chain were retrieved directly from features annotated on the UniprotKB entry.
Reactome pathway annotation was retrieved for each protein using the all-level pathway hierarchy mapping file (available at https://reactome.org/download-data). This file assigns, to each UniprotKB accession, the full Reactome annotation at all levels of the hierarchy. From these, we extracted the first-level pathways used to perform statistics. CATH classification was performed by downloading, from the CATH website (http://www. cathdb.info/), annotation file mapping CATH domains on UniprotKB/PDB entries.
Annotation of Pfam domains was derived from Pfam/UniprotKB mapping files available at the database website (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/).

Computing Residue Surface Exposure and Protein-Protein Interfaces
We computed Absolute Solvent Accessibility (ASA) for each residue using the DSSP program [27]. Relative Solvent Accessibility (RSA) was computed by dividing the ASA value by the maximum theoretical accessible area as given in [28].
Protein complexes were analyzed in order to identify protein-protein interfaces. We defined protein-protein interaction sites as those residues that undergo a change ≥1Å 2 in the accessible surface area upon formation of the complex.

Prediction of the Impact of SRVs on Protein Stability
The effect of different SRVs on protein stability has been computationally evaluated with INPS-MD [17], a method based on Support Vector Regression (SVR) for estimating the difference folding Gibbs free energy between the wild-type and the mutated forms of the proteins (∆∆G). INPS-MD analyses two sets of descriptors extracted from protein sequence and structure, respectively. The former includes BLOSUM62 scores, hydrophobicity, Dayhoff mutability index and evolutionary information derived from multiple sequence alignments, while the latter consists of the residue relative solvent exposure and the difference between the native and the variant structure in terms of pairwise contact potential computed in a local structural environment (for further details, refer to [22,23]). When assessed on a dataset comprising 2648 variations whose ∆∆G has been experimentally determined, INPS-MD performs with a Pearson's correlation index of 0.58 and a root mean error of 1.2 kcal/mol.
Each variation can be classified as perturbing or non-perturbing the protein stability according to the value of predicted ∆∆G. In particular, we define perturbing variations as those having |∆∆G| ≥ 1, non-perturbing otherwise (i.e., |∆∆G| < 1).

Computing the Association Strength of Each Variation Type to Disease and Pertubation
The association strength of each variation type to disease has been evaluated from the HVAR3D dataset. In order to strengthen our statistical analysis, HVAR3D was complemented using an additional dataset comprising 2667 neutral variations occurring on protein chains endowed with only neutral variations and covering more than 70% of the corresponding UniprotKB sequence (this set of neutral variations was not considered in all other analyses). Moreover, from disease-related variations in HVAR3D, we excluded those falling on ACTIVE or BINDING sites as well as those localized on known protein-protein interfaces.
After this procedure, we obtained a dataset of 8701 variations comprising 4990 and 3626 disease-related and neutral variations, respectively. We used this extended dataset to perform association analysis.
Let V be the resulting set of variations comprising N d and N n disease-related and neutral variations, respectively. For each variation type X → Y from residue type X to Y, we can compute the frequencies of occurrence f d and f n in disease-related and neutral subsets, respectively, as follows: where #(X → Y) d and #(X → Y) n are the number of occurrences of the variation type X → Y in the disease-related and neutral subsets, respectively. The association strength s d X→Y (Log-odd ratio) of variation type X → Y to disease is then computed as: An estimate of the standard error on the log odd ratios in Equation (3) can be computed as follows: Analogously, associations of variation types to perturbation of protein stability has been evaluated on the same set V of variations as above (i.e., HVAR3D + complement of neutral variations). First, variations in V were classified into perturbing and non-perturbing using INPS-3D as explained in the Section 3.4. Then, we computed frequencies of occurrence f p and f t of variation type X → Y on the perturbing and non-perturbing subsets, respectively: where #(X → Y) p and #(X → Y) t are the number of occurrences of the variation type X → Y in the perturbing and non-perturbing subsets, respectively, and N p and N t are the total number of perturbing and non-perturbing variations, respectively. The association strength s t X→Y (Log-odd ratio) of variation type X → Y to perturbation is then computed as: Standard error is computed for s t X→Y as done for s d X→Y (Equation (4)).

Computing Linear Regression and Residuals
Regression lines were computed using the ordinary least square method as implemented in the Linear Regression model of the Python Sklearn package (https://scikit-learn.org). In regression plots, lines parallel to the regression line are drawn at an ordinate distance equal to the root mean square error (RMSE) of the residuals, defined as: where y i is the i-th experimental variable, y * i is the prediction of the variable as obtained by the linear regression model and n is the number of variables.

Kullback-Lieber Divergence Between Distributions
Differences between probability distributions are evaluated using the Kullback-Leibler divergence. This is defined between two discrete probability distributions p and q defined on the same probability space X as:

Conclusions
In this paper, we introduce a database (HVAR3D) that collects 392 protein chains with a well solved 3D structure, covering at least 70% of the corresponding protein sequence. This allows mapping to the protein structures some 6621 variations, 5577 of which are disease-related.
The database allows a detailed statistical analysis of which type of variations are mostly associated with diseases, whether they are in the core or at the surface, and how they relate to protein structure and function.
For the analysis, we group residues into four chemico-physical types and adopt a sixteen-item code to include all the possible single-residue variation types ( Figure 1A). The strength of disease association for each variation type highlights that those mostly associated to disease in our dataset turn aromatic into charged or polar residues (Figure 2), confirming previous observation in different data sets [7,9,20,21]. The correlation among the strength of association to disease and the strength of association to perturbation is significant (Pearson's correlation value is 0.59), however moderate [20,21,29].
When variation types are sorted out by being on the protein surface or buried, we find that apolar residues turning into apolar or aromatic residues are more associated with disease than when they occur in exposed sites; in turn, they are more associated with neutral polymorphisms when they occur in buried sites [8].
We map disease SRV types to Pfam domains and find that their distribution varies across different structural/functional models. This suggests that the newly introduced mapping can be adopted to define each Pfam a specific a priori distribution patterns of disease related SRV types.