SAV-Pred: A Freely Available Web Application for the Prediction of Pathogenic Amino Acid Substitutions for Monogenic Hereditary Diseases Studied in Newborn Screening

Next Generation Sequencing (NGS) technologies are rapidly entering clinical practice. A promising area for their use lies in the field of newborn screening. The mass screening of newborns using NGS technology leads to the discovery of a large number of new missense variants that need to be assessed for association with the development of hereditary diseases. Currently, the primary analysis and identification of pathogenic variations is carried out using bioinformatic tools. Although extensive efforts have been made in the computational approach to variant interpretation, there is currently no generally accepted pathogenicity predictor. In this study, we used the sequence–structure–property relationships (SSPR) approach, based on the representation of protein fragments by molecular structural formula. The approach predicts the pathogenic effect of single amino acid substitutions in proteins related with twenty-five monogenic heritable diseases from the Uniform Screening Panel for Major Conditions recommended by the Advisory Committee on Hereditary Disorders in Newborns and Children. In order to create SSPR models of classification, we modified a piece of cheminformatics software, MultiPASS, that was originally developed for the prediction of activity spectra for drug-like substances. The created SSPR models were compared with traditional bioinformatic tools (SIFT 4G, Polyphen-2 HDIV, MutationAssessor, PROVEAN and FATHMM). The average AUC of our approach was 0.804 ± 0.040. Better quality scores were achieved for 15 from 25 proteins with a significantly higher accuracy for some proteins (IVD, HADHB, HBB). The best SSPR models of classification are freely available in the online resource SAV-Pred (Single Amino acid Variants Predictor).


Introduction
Newborn screening (NBS) is a meaningful, priority, globally-accepted public health program. All born infants are advised to undergo blood spot screening, also known as the heel prick test, to find any inherited diseases that are severe after an asymptomatic period. The overall detection rate is up to 1 in 500 births [1]. The testing is intended to provide an early diagnosis and treatment before significant, inevitable damage ensues. The core conditions panels mainly include monogenic autosomal recessive disorders, most of which are inborn errors of metabolism. The conditions may be indicated by biochemical analysis, tandem mass spectrometry and immunoassay techniques as well as DNA-based methods [2].
Over the past few years, next-generation sequencing (NGS) technologies have been actively implemented in the clinic. As the cost of sequencing decreased, the field of application increased, leading to the first cases of using NGS in NBS [3,4]. Since NGS has a high throughput and can identify the majority of genetic defects, DNA sequencing has the capability to become a suitable NBS method. At the same time, the increasing screening rate and the availability of NGS technologies contribute to the detection of new variants without a clinical interpretation. In addition, NGS may expand the existing panels to other diseases as it does not require special protocols and reagents to obtain a result.
Variants of clinical interpretation involve multiple evidence categories: population data, functional studies, and clinical presentations. As an outcome, a genetic variant is assigned a pathogenic class if it causes a disease, or a benign class if it is proven to have no such relationships. Quite often, the criteria produce an opposite interpretation, e.g., eventually causing a variant of uncertain significance (VUS) or some conflict of interpretation [2]. Such variants cannot assist in making medical decisions.
Preliminarily, for variants with a VUS classification as well as unclassified ones, predicted pathogenicity estimates can be obtained using computational tools (e.g., PolyPhen-2 [5], SIFT [6], MutationAssessor [7]). The most common genetic alterations happening and requiring clinical classification are missense. Missense variants modify codons, resulting in an encoded amino acid (a.a.) alteration. In turn, the alteration affects protein primary structures, the basis of the secondary, tertiary, and quaternary structures, and may disrupt their implementing function. The existing bioinformatics predictors are trained on heterogenic datasets, which may lead to a decreased prediction accuracy in specific clinically important genes [8,9].
Here, we introduce SAV-Pred-a public web-application to predict the effect of single amino acid variants (SAVs) for 25 core conditions from a newborn screening panel. This work is intended to present the sequence-structure-property relationships (SSPR) analysis of a.a. substitutions and their surroundings in specific proteins to predict the clinical effect of the variants as an additional interpretation.

SAV-Pred Contents and Comparison with Other Bioinformatic Tools
Disease-related proteins were selected from the Uniform Screening Panel for Major Conditions recommended by the Advisory Committee on Hereditary Disorders in Newborns and Children and approved by committees of the American College of Obstetricians and Gynecologists (ACOG) [2]. The panel includes the following disease groups: congenital organic acid/amino acid/fatty acid metabolic errors, hemoglobinopathies, and various multisystem disorders such as cystic fibrosis or hypothyroidism. For these monogenic diseases the benefits of screening and treatment availability have been confirmed. Thus, the SSPR approach can be applied to them.
The data selection scheme is shown in Figure 1. The final set included 25 proteins with a total of 2124 missense variants. These variants were initially found with clinical classification (see Material and Methods). It turned out that for many of the proteins the databases contained few benign variants, insufficient for training classifiers. For instance, there was only one benign variant for PAH and two benign variants for the HADHB and HMGCL genes ( Table 1, column "B"). Therefore, 8397 polymorphisms unrelated to pathological conditions were added as a negative class in the curated manual analysis. The resulting number of SAVs in the training datasets are shown in Table 1. For each of the proteins, 195 SSPR models were created (with different levels of the multi-level neighborhoods of atoms (MNA) descriptors (15 levels, from 1 to 15)) and peptide length (13 size options with an odd number of a.a. in a peptide, from 7 to 31) (see Material and Methods). The most accurate SSPR models in terms of the area under the receiver operating characteristic curve (AUC) obtained in leave-one-out (LOO-CV) and 20-fold cross-validation (20F-CV) procedures were selected, and their parameters are presented in Table 1. Twenty-four SSPR models exceeded the accuracy threshold of 0.7. For such conditions as isovaleric acidemia, hemoglobinopathies, and trifunctional protein deficiency, the AUC values of the created models were greater than 0.9. Only the SSPR model for galactose-1-phosphate uridylyltransferase displayed an AUCF20-CV value of less than 0.7 (0.686). This may be linked to the presence of contradictions in the clinical classification data due to the existence of Duarte galactosemia, which differs from classical galactosemia in that patients with Duarte galactosemia have a partial GALT deficiency.
The best created SSPR models were compared with known bioinformatic tools: SIFT 4G, Polyphen-2 HDIV, MutationAssessor, PROVEAN, and FATHMM [5][6][7]10,11] (Table 2.). The same approach had been used in our previous study [12]. For the aforementioned methods, we obtained scores of SAV effects from dbNSFP4.1a [13] for almost all proteins and calculated AUC. In quantitative terms, our approach (SAV-Pred) was the most accurate for 15 proteins. For several genes, HADHB, HBB, and IVD, the prediction accuracy was over 0.9, while for the alternative methods it was kept at 0.796. The performances of the rest of the models are inferior to the other methods but are not much lower and are roughly in the average accuracy range. At the same time, the highest average AUC (0.804 ± 0.040; CI95%) was achieved and corresponds to the previous results [12]. For each of the proteins, 195 SSPR models were created (with different levels of the multi-level neighborhoods of atoms (MNA) descriptors (15 levels, from 1 to 15)) and peptide length (13 size options with an odd number of a.a. in a peptide, from 7 to 31) (see Material and Methods). The most accurate SSPR models in terms of the area under the receiver operating characteristic curve (AUC) obtained in leave-one-out (LOO-CV) and 20-fold cross-validation (20F-CV) procedures were selected, and their parameters are presented in Table 1. Twenty-four SSPR models exceeded the accuracy threshold of 0.7. For such conditions as isovaleric acidemia, hemoglobinopathies, and trifunctional protein deficiency, the AUC values of the created models were greater than 0.9. Only the SSPR model for galactose-1-phosphate uridylyltransferase displayed an AUC F20-CV value of less than 0.7 (0.686). This may be linked to the presence of contradictions in the clinical classification data due to the existence of Duarte galactosemia, which differs from classical galactosemia in that patients with Duarte galactosemia have a partial GALT deficiency.
The best created SSPR models were compared with known bioinformatic tools: SIFT 4G, Polyphen-2 HDIV, MutationAssessor, PROVEAN, and FATHMM [5][6][7]10,11] (Table 2.). The same approach had been used in our previous study [12]. For the aforementioned methods, we obtained scores of SAV effects from dbNSFP4.1a [13] for almost all proteins and calculated AUC. In quantitative terms, our approach (SAV-Pred) was the most accurate for 15 proteins. For several genes, HADHB, HBB, and IVD, the prediction accuracy was over 0.9, while for the alternative methods it was kept at 0.796. The performances of the rest of the models are inferior to the other methods but are not much lower and are roughly in the average accuracy range. At the same time, the highest average AUC (0.804 ± 0.040; CI95%) was achieved and corresponds to the previous results [12]. B-Benign variants in the sets; P-Pathogenic variants in the sets; B+-benign variants that initially did not have clinical classification; AUC LOO-CV -AUC obtained by leave-oneout validation procedure; AUC 20F-CV -AUC obtained by twenty-fold cross-validation procedure; PL (peptide length) and MNA (the level of MNA descriptors)-parameters of sequence-structure-property relationships (SSPR) models.

SAV-Pred Web Application
The best SSPR models became the basis for the creation of the freely available web application, SAV-Pred (Single Amino acid Variants Predictor), hosted at the way2drug.com portal (http://www.way2drug.com/SAV-Pred/) (accessed on 29 December 2022). Figure 2 illustrates an example of the output window with predictions for three single amino acid substitutions. The substitutions were published in the ClinVar [14] database after May 2022 and did not belong to the training sets. The predicted effect shown in the "Annotation" column is consistent with the current clinical classification. The data in the values in the Confidence column are calculated as Pa-Pi (see Materials and Methods) for the prediction of the pathogenic effect. Positive values of Confidence mean that the queried a.a. substitutions may belong to the class of pathogenic substitutions. The higher the Confidence value, the higher the probability that the variant is pathogenic. Negative values of Confidence mean that the queried a.a. substitutions may belong to the class of benign substitutions. The more negative the Confidence value, the more likely the variant is benign. During the analysis of the prediction results, one should also take into account the value of the prediction accuracy in the last column (AUC) for the appropriate SSPR model. The columns in the table with prediction results may by sorted. Moreover, the appropriate fields for filtration of the data are under each column. Here, one can also see the references to the description of diseases in OMIM as well as protein identifiers in UniProt [15]. The left side of the screen shows the protein sequence with the highlighted location and replacement of the letter. The user can select the protein and substitution of interest manually with the "Input" button, or they can load a query list of substitutions in the following format: <gene name> <position> <a.a. substitution> The prediction results can be saved as a file in the CSV or XLS formats, or simply copied. The data on composition, the datasets, and AUC values are also provided.

SAV-Pred Web Application
The best SSPR models became the basis for the creation of the freely available web application, SAV-Pred (Single Amino acid Variants Predictor), hosted at the way2drug.com portal (http://www.way2drug.com/SAV-Pred/) (accessed on 29.12.2022). Figure 2 illustrates an example of the output window with predictions for three single amino acid substitutions. The substitutions were published in the ClinVar [14] database after May 2022 and did not belong to the training sets. The predicted effect shown in the "Annotation" column is consistent with the current clinical classification. The data in the values in the Confidence column are calculated as Pa -Pi (see Materials and Methods) for the prediction of the pathogenic effect. Positive values of Confidence mean that the queried a.a. substitutions may belong to the class of pathogenic substitutions. The higher the Confidence value, the higher the probability that the variant is pathogenic. Negative values of Confidence mean that the queried a.a. substitutions may belong to the class of benign substitutions. The more negative the Confidence value, the more likely the variant is benign. During the analysis of the prediction results, one should also take into account the value of the prediction accuracy in the last column (AUC) for the appropriate SSPR model. The columns in the table with prediction results may by sorted. Moreover, the appropriate fields for filtration of the data are under each column. Here, one can also see the references to the description of diseases in OMIM as well as protein identifiers in Uni-Prot [15]. The left side of the screen shows the protein sequence with the highlighted location and replacement of the letter. The user can select the protein and substitution of interest manually with the "Input" button, or they can load a query list of substitutions in the following format: <gene name> <position> <a.a. substitution> The prediction results can be saved as a file in the CSV or XLS formats, or simply copied. The data on composition, the datasets, and AUC values are also provided.

Discussion
In this paper, we present a new freely available web-based application, SAV-Predtwenty-five SSPR models were created to identify amino acid substitutions related to monogenic heritable diseases recommended for universal newborn screening by calculating and interpreting pathogenicity scores. The models are Naïve Bayesian classifiers trained on describing the structural properties of peptide fragments, thus linking the effect to the primary structures of the proteins. Since the secondary/tertiary/quaternary structures, physicochemical, and functional properties of proteins also depend on the primary sequence, SSPR models take them into account indirectly.
In summary, the SSPR models obtained comparable accuracy, often exceeding the accuracy of the individual methods. For example, the developed predictors outperformed the widely used tools: SIFT 4G in 16/24 cases and PolyPhen-2 HDIV 16/22 cases, respectively. Depending on the method and the protein, SSPR models and individual bioinformatics tools outperform each other to diverse degrees, in keeping with the previous studies [16,17]. However, protein-specific datasets are often unbalanced due to a lack of annotated variants and this may cause a negative impact on protein-specific predictors. The absence of differences in AUC in the leave-one-out and twenty-fold cross-validations, as well as the similar average accuracy with the previous study, suggest the robustness of the obtained classifiers (Table 1).
Based on the best SSPR models, we have created a web application SAV-Pred, which is freely available at http://www.way2drug.com/SAV-Pred/ (accessed on 28 December 2022). In the prospective application, SAVs features such as secondary structure parameters and evolutionary data are going to be used as descriptors to increase the predictor's accuracy. Additionally, we going to apply the approach to the secondary conditions table and other similar diagnostic panels.

Datasets Collection
Of 32 core conditions from the ACOG screening panel, 24 monogenic diseases were chosen and 25 associated genes were found based on the OMIM database (accessed on 10 January 2022) ( Table 1). The annotated data on missense variants related to the known genes, including clinical significance, variant supporting evidence, and protein allele were obtained from ClinVar [14] (accessed on 14 January 2022), humsavar [15] (accessed on 14 January 2022), LOVD [18] (accessed on 12 January 2022), and dbSNP [19] (accessed on 14 January 2022) databases using the BioMart data mining tool [20] (accessed on 14 January 2022) (Figure 1). SAVs currently classified as pathogenic or likely pathogenic constituted the positive class, and substitutions that were interpreted as benign/likely benign, as well as all those that were in no way related to the phenotype/disease, constituted the negative class. Based on the known annotated SAVs and an appropriate protein sequence, we created the datasets containing fix length peptides (from 7 to 31 a.a. in the peptide) from the substitution and its a.a. surroundings in the form of structural formulas in the MOL V3000 format, plus their effect indicators (0-benign, 1-pathogenic). A similar algorithm was used earlier for the prediction of phosphorylation sites in proteins [21]. Amino acid surroundings were taken from canonical reference protein sequences from the UniProt [15] (accessed on 3 February 2022) database by related positions.

Building the SSPR Models
Classification models were created and validated in the modified command line version of the Prediction of Activity Spectra for Substances (PASS) software [12,[21][22][23]-MultiPASS (version 2022, Institute of Biomedical Chemistry, Moscow, Russia)-which allows one to use different levels (up to 15) of Multilevel Neighborhoods of Atoms (MNA) descriptors to describe the structural formula of peptides [19]. Each of the fifteen MNA levels was used to build the individual SSPR model on each of thirteen different peptide fragment length datasets. Originally, PASS prediction results are a list of predicted characteristics of molecules with Pa (probability of "to be active") and Pi (probability of "to be inactive") values. In this study, the Pa value is the probability that the peptide with the a.a. substitution belongs to the class of pathogenic variants, and the Pi value is the probability that the peptide with the a.a. substitution does not belong to the class of pathogenic variants.
Multilevel Neighborhoods of Atoms (MNA) descriptors were used for the descriptions of molecular structures. The MNA descriptor is a representation of an atom-centered fragment of a molecule in the form of a string of characters. The level of the MNA descriptor reflects the order of proximity. Figure 3 shows an example of the representation of the first three levels for a carbon atom marked with a gray circle. Thus, the structural and physicochemical properties of molecules are embedded in the MNA descriptors. Similar to our previous work [12], descriptors from levels 1 to 15 were used for the creation of SSPR models. fragment length datasets. Originally, PASS prediction results are a list of predicted characteristics of molecules with Pa (probability of "to be active") and Pi (probability of "to be inactive") values. In this study, the Pa value is the probability that the peptide with the a.a. substitution belongs to the class of pathogenic variants, and the Pi value is the probability that the peptide with the a.a. substitution does not belong to the class of pathogenic variants. Multilevel Neighborhoods of Atoms (MNA) descriptors were used for the descriptions of molecular structures. The MNA descriptor is a representation of an atom-centered fragment of a molecule in the form of a string of characters. The level of the MNA descriptor reflects the order of proximity. Figure 3 shows an example of the representation of the first three levels for a carbon atom marked with a gray circle. Thus, the structural and physicochemical properties of molecules are embedded in the MNA descriptors. Similar to our previous work [12], descriptors from levels 1 to 15 were used for the creation of SSPR models.

Validation and Performance Assessment
SSPR models based on datasets with an appropriate length of peptides and a level of MNA descriptors were created and selected based on the leave-one-out and 20-fold crossvalidation procedures implemented in MultiPASS. For every disease (protein), the best SSPR model was chosen with the highest the area under the ROC curve (AUC) value. We used individual methods (SIFT 4G, Polyphen-2 HDIV, MutationAssessor, PROVEAN and FATHMM) to compare against the SSPR models, and we used the scores from the dbNSFP (accessed on 09.10.2022) and sklearn.metrics package [24] in Python 3.9 to calculate AUC as a statistical indicator of accuracy. In doing so, we used the thresholds recommended by authors to obtain protein-related AUC values.

Validation and Performance Assessment
SSPR models based on datasets with an appropriate length of peptides and a level of MNA descriptors were created and selected based on the leave-one-out and 20-fold cross-validation procedures implemented in MultiPASS. For every disease (protein), the best SSPR model was chosen with the highest the area under the ROC curve (AUC) value. We used individual methods (SIFT 4G, Polyphen-2 HDIV, MutationAssessor, PROVEAN and FATHMM) to compare against the SSPR models, and we used the scores from the dbNSFP (accessed on 9 October 2022) and sklearn.metrics package [24] in Python 3.9 to calculate AUC as a statistical indicator of accuracy. In doing so, we used the thresholds recommended by authors to obtain protein-related AUC values.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.