Annotation of Human Exome Gene Variants with Consensus Pathogenicity

A novel approach is developed to address the challenge of annotating with phenotypic effects those exome variants for which relevant empirical data are lacking or minimal. The predictive annotation method is implemented as a stacked ensemble of supervised base-learners, including distributed random forest and gradient boosting machines. Ensemble models were trained and cross-validated on evidence-based categorical variant effect annotations from the ClinVar database, and were applied to 84 million non-synonymous single nucleotide variants (SNVs). The consensus model combined 39 functional mutation impacts, cross-species conservation score, and gene indispensability score. The indispensability score, accounting for differences in variant pathogenicities including in essential and mutation-tolerant genes, considerably improved the predictions. The consensus combination is consistent with as many input scores as possible while minimizing false predictions. The input scores are ranked based on their ability to predict effects. The score rankings and categorical phenotypic variant effect predictions are aimed for direct use in clinical and biological applications to prioritize human exome variants and mutations.


Introduction
With the growing availability of NGS technology in medical diagnostics, quantitatively exemplified by 68,000 genetic tests currently offered in clinics according to Genetests [1], the amount of experimentally sequenced data on human genetic variation in both healthy and patient populations is rapidly increasing. Accurate and exhaustive variant annotation is important for every application of NGS technology, including development of therapies, selection of effective individualized therapy, and comparing multiple samples in clinical studies, to mention a few; however, currently there are major challenges associated with sequencing data analysis and interpretation [2]. While most of the prevalent or common variants (about one million of currently sequenced variants) have been annotated as benign (neutral) based on their high frequency of occurrence in the healthy population [3], the remaining majority are ultra-rare Variants of Unknown Significance (VUS). In total there are almost 100 million possible human Single Nucleotide Variants (SNV) in the gene coding 'exome' space, of which around 83 million are non-synonymous nsSNVs and around 15 million are short splice-site ssSNVs. However, only a tiny fraction (ca 0.1% of the total number) of variants have definitive clinical annotations. The current inability to annotate and prioritize the vast number of sequenced VUSs by variant effect significance is an obstacle to the reliable and wider use of NGS in diagnostics and treatment of cancer and other diseases.
As an alternative to the scarce annotations based on clinical and biological evidence alone and a plethora of predicted real-valued pathogenicity scores, in the following we are proposing an approach for combining both these types of data into two consensus scores: a continuous-value score and a categorical classification score. These two types of scores have their pros and cons, which are compensated in their combination. While there is a number of continuous ensemble scores that are already commonly used, they possess a key disadvantage of lacking determined thresholds for different genes, leading to uncertainty or errors in differentiating between pathogenic and benign variants based on the score value alone. Such thresholds need to be determined for each of the continuous-value pathogenicity prediction scores, and are in most cases not known. Furthermore, it might be not correct to use a single pathogenicity score threshold for all genes, as for example, variants in the essential genes typically require a higher threshold compared to the mutation-tolerant genes. To remedy this situation, here we use a gene indispensability score as one of the inputs in training to account for such differences due to varying gene pathogenicities. In contrast, categorical classes, such as 'Pathogenic', 'Uncertain_significance' or 'Likely_benign', are easier interpretable and can be used directly in clinical practice, without the need for any thresholds determination. To the best of our knowledge, no studies have developed an ensemble method able to classify variants into pathogenicity classes across the entire coding genome, and with high enough accuracy. In this study we have achieved accuracy of classification of around 1%. We propose to use the predicted pathogenicity classes in the same way as the currently used evidence-based classification data, where such are missing, i.e. for the large majority of variants.
The classification (termed consensus variant effect prediction, or cVEP) of gene variants by pathogenicity uses a stacked ensemble of supervised learners machine learning approach. The classification scheme is in accordance with the standards and guidelines of the American College of Medical Genetics and Genomics (ACMG), and the Association of Molecular Pathology (AMP) [4]. The variants are classified into five categories: pathogenic, likely pathogenic, uncertain significance, likely benign, benign. This evidence-based data, used for our input, is rather accurate, since ACMG/AMP expanded guideline requires to have certainty from experimental clinical/biological evidence of better than 90% [5] for inclusion into the above classes. It is worth mentioning another disadvantage of the majority of existing continuous-value pathogenicity predictions -they use only a subset of the evidence-based classes for training, namely only 'benign' and 'pathogenic' class data points are used to train model predictions into the 0 to 1 range, and simply do not use data from other classes.
Recently, there has been a rapid increase in the use of variant allele frequency as a proxy for differentiating between disease-causing (or pathogenic) and benign (or neutral) variants, based on the assumption that when a variant has the allele frequency of above 0.1% it is very likely to be benign. For example, Minor Allele Frequency (MAF) and Alternative Allele Frequency (AAF) annotations, which are being obtained in large population sequencing projects such as ExAC [6] and GnomAD [7], are routinely used in various clinical and diagnostic applications in relation to Mendelian and other diseases to distinguish between common and rare variants. MAF is the frequency of the second-most occurring allele, while AAF describes frequencies of all possible nonsynonymous mutations at a gene locus. Alternatively, one can compute the population maximal frequency (e.g. AAFpopmax), which is calculated as the maximum of AAF values across individual sub-populations. It has been shown [6] that use of AAFpopmax produces considerably fewer variants after filtering with the same threshold, compared to that using AAF.
While it is possible to a certain extent to use the AAF value alone for benign/pathogenic classification, it has clear limitations. Firstly, since the histogram of the known AAF values follows a gamma distribution, it is intrinsically difficult to choose where to put a threshold. Consequently, it is problematic to achieve a clear separation to discriminate between benign and pathogenic variants using a single AAF threshold for all genes. The problem is alleviated recently by proposed use of different AAF thresholds for each gene [8], with the threshold values selected manually to aim for the best discrimination in the ClinVar database. Secondly, despite the proliferation of use of variant frequency, these annotations are currently available for only a small proportion of the total number of possible variants (~6%). Consequently, a large majority of germline and somatic rare variants (i.e. below 0.1% AAF) discovered in WES samples currently have unknown a priori functional significance.
Whole Exome Sequencing (WES) of a tumor sample usually reveals a few mutation variants that are important for cancer initiation and survival, called 'drivers', hundreds of 'passenger' variants associated with cancer evolution and differentiation, and hundreds of thousands of neutral variants that have nearly no impact [9]. The small number of causative variants can be effective therapy targets, but they first need to be distinguished from the bulk of VUSs and common variants by filtering and prioritization using annotations. To remedy this situation, many in silico methods for variant effect prediction (VEP) have been developed based on gradually improving machine learning methods. Generally, rare variants are typically less evolutionarily conserved and have higher predicted pathogenicity than frequent ones [10]. This relationship is utilized in a number of VEP tools (e.g., GERP++, SiPhy, bStatistic) [11] that are based on nucleotide conservation in alignments of genetic sequences in mammalian species. Another group of tools, which aim to predict the functional impact of mutations (e.g. MutationTaster, Eigen, fitCons) are trained on slowly growing number of pathogenic/benign annotations derived from clinical and biological studies. In addition, a group of VEP methods based on structural approaches has shown better performance in finding cancercausing driver variants, based on predictions of structural stability, ligand binding or protein-protein network interactions. Some of methods in this group are based on structural clustering and classify a mutation as a driver if it is close to a cluster of known driver mutations [12]. As an example of such a tool, CADD [13] uses mapping of mutation variants to protein 3D structure for prediction of effects of mutations.
Ensemble methods, which use combinations of predictive scores of several types, have shown better performance than any one individual tool. In The Cancer Genomic Atlas (TCGA), a dataset comprised of 10,000 tumor samples across 33 cancer types, it was shown [14] that improved prediction results were obtained by combining sequence conservation and population data-based tools with structural approaches. In this study [14] in vitro functional tests were able to confirm the predictions of four structural tools in 78% of cases, a considerably better rate than when using methods that do not use structural information. Recent advances in ensemble methods include combining a larger number of tools, greater methodological diversity, using more training data, and using better machine learning methods such as random forests or gradient boosting machines, which overcome over-fitting and other computational problems such as highly unbalanced datasets.
Despite these advances, prediction errors are common even with the best available tools, particularly for rare variants [15]. For example, recent quantitative comparison of performance of the current tools on rare cancer variants has shown [15] that they have adequate sensitivity of 84-95%, but low specificity in the range of 25-83%, meaning a high probability of errors in prediction. The level of accuracy that is currently achievable would translate to thousands of false annotations in a patient tumor sample with a high mutation rate, which is unacceptable. As a consequence, there remains a lack of consensus regarding which method is best for predicting variant effects.
In the study, first, we illustrate how variant class discrimination improves with increasing number of dimensions of input data. Then, we describe a meta-learning consensus method which combines a selection of 39 of the best-performing functional impact effect, sequence conservation and pathogenicity scores, and one gene-level score. We then train and cross-validate the consensus ensemble model on a dataset of known pathogenicity annotations derived from the ClinVar database. We provide rankings for the ability of these scores to predict pathogenicity, both individually and in combination, derived from the feature importance in the base models. We apply the ensemble model to predict pathogenicity scores and classes for the human exonic variants in all chromosomes. Finally, to demonstrate the potential of the method for variant effect prediction in clinical applications, we provide data on its application to several highly pathogenic genes.

Machine Learning algorithm
A Stacked Ensemble of Supervised Deep Learners (SESDL) is a machine learning algorithm which combines several prediction algorithms, so-called base-learners, using a process called stacking (also known as meta-or super-learning). Unlike bagging and boosting approaches, also used for combining base-learners, the goal in stacking is to combine a diverse set of learners together. It has been shown [16] that a super-learner ensemble represents an asymptotically optimal system for learning. Among stacking algorithms super-learning is distinguished by the direct implementation of cross-validation for tuning parameter selection in the algorithm.
As a basis for our approach, we use the H2O tool [17], which is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics tool that allows users to build machine learning models for big data analysis, implemented as a package in the statistical environment R. The 'meta-learning' algorithm, implemented here using the h2o.stackedEnsemble function of the H2O tool (version 3.0), finds an optimal combination of the following three base learners: generalized linear model (GLM) [18], gradient boosting machine (GBM) [19] and distributed random forest (DRF) [20].
The popular GLM method is a well-known generalization of linear regression for response variables. More recently developed DRF and GBM methods both build collections of decision trees. A decision tree captures interactions among different features, with the interactions order controlled by a tree depth parameter -higher order for higher depths. There are significant differences in these algorithms: briefly, DRF averages multiple decision trees, created on different random samples of rows and columns; GBM builds the model in a stage-wise fashion. The algorithms are non-linear, robust for noisy data, provide importance of each predictor in the models. Here the base-models use 'multinomial' distribution for class prediction; the types of distributions used for continuous combination score are given in the Supplementary Materials.
The key parameters (max_depth, ntrees, max_after_balance_size, learn_rate/sample_rate, min_rows) for DRF, GBM and GLM, were chosen using the grid search in the parameter space to produce fits with the lowest logarithmic score and with its achieved stability in the five-fold crossvalidations. The logarithmic score [21] method is connected to Shannon entropy and Kullback-Leibler divergence; here it is denoted by the term logloss (Eq.1), same as was used in H2O documentation. It measures the performance of a multinomial classification model, with the goal of the machine learning model being to minimize it. A perfect model would have a logloss of 0. Logloss increases as the predicted probability diverges from the actual label. For the multi-class classification, the sum of logloss values for each class prediction in the observation is taken: Where yc,o is a binary indicator (0 or 1) of whether class label c is the correct classification for observation o; pc,o -the model's predicted probability that observation o is of class c.
Since there were a different number of points for each class, the learning was performed with balancing classes. The type of algorithm used as the meta-learner was GLM with nonnegative weights. When the algorithms were learning categorical values, instead of the R 2 the McFadden's pseudo R-squared, also known as likelihood-ratio index, was used. It is a statistical measure of how close the predicted and input data are in the fits. The log likelihood of the intercept model is treated as a total sum of squares, and the log likelihood of the full model is treated as the sum of squared errors. The ratio of the likelihoods, scaled to 0-1 range, gives the level of improvement over the intercept model offered by the full model. When comparing two models on the same data, McFadden's ratio is higher for the model with the greater likelihood.

Datasets
The GnomAD [7] dataset extends the ExAC [6] dataset from 60,706 to 125,748 unrelated individuals of a diverse set of ethnicities including European, African, Latino, South Asian, East Asian, and Other. It contains allele frequencies of the sequenced genetic (exome and whole genome) variants.
ClinVar [22] is a freely accessible, public archive of the relationships among human variations and phenotypes, with supporting evidence. Its interpretations of the clinical significance of germline and somatic variants for reported conditions are based on "ground-truth" -clinical and biological evidence. The archive located at the National Center for Biotechnology Information (NCBI) is maintained according to variant classification standards and guidelines of the American College of Medical Genetic and Genomics (ACMG) and the Association for Molecular Pathology (AMP) [5]. The ClinVar dataset is the subset of the GnomAD dataset with non-empty ClinVar annotations, including a total of 227,232 nsSNVs.
The dbNSFP [23] is a database integrating annotations from multiple sources, including allele frequencies from ExAC, GnomAD, clinical annotations from ClinVar, and many state-of-the-art functional and conservation annotations for the comprehensive collection of human SNVs [23]. Its current academic version (named "a"-branch), including a total of 84,013,490 non-synonymous SNVs  (Table S2).
Each of the curated training, prediction and validation datasets contained the following annotations from the dbNSFP4.0 database [23]: 40 functional annotation scores ( Table 1, the scores' sources are listed in Table S1); the gene-level score Gene_indispensability_score, a probability prediction of whether the gene is essential (from 0 to 1); and the two AAF and two AAFpopmax fields for the exome and control samples: gnomAD_exomes_AF is the alternative allele frequency in all of the gnomAD exome samples (125,748 samples); gnomAD_exomes_POPMAX_AF is the maximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry); gnomAD_exomes_controls_AF is the alternative allele frequency in the controls subset of the gnomAD exome samples (54,704 samples); gnomAD_exomes_controls_POPMAX_AF is the maximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry) in the controls subset.
The gene indispensability score was included because it was expected to lead to a significant improvement in predictive performance, particularly for rare variants and VUSs. It is well known that a moderately deleterious variant in a non-essential gene can often exhibit itself as nonpathogenic, as e.g. such can be located in s mutation-tolerant gene, or in a truncated dis-functional gene having a duplicate functional gene; whereas such a variant in an essential gene is usually very pathogenic, since such genes are involved in several cellular networks, leading to a major cell function disruption. The original indispensability model [24] builds a 'multinet' from several networks and then derives an 'Indispensability' score from the degree-centralities and other parameters in various types of networks: protein-protein interactions (PPI), phosphorylation, signaling, metabolic, genetic, regulatory, essentiality, LoF-tolerance, number of networks in which a gene is involved, number of protein interaction interfaces, dN/dS ratios, paralogs, etc. The score value is high e.g. for variants in oncogenes and tumor suppressors, while it is very low in duplicated and mutation-tolerant genes.

Training, prediction and validation procedures
Training and prediction method used the 40 scores (Table 1) and the gene indispensability score derived from the dbNSFP4 database. The LINSIGHT_rankscore ("x31") was missing for most of the variants in GnomAD, it was omitted and was replaced by us with a mean of 39 scores available for each variant, termed "Mean_rankscore".
The combined "GnomAD-AAF" dataset consisted of 5,502,403 variants, each with Gnomad_exome_AF annotations. This dataset was further subdivided into 'Control', with non-zero Gnomad_exome_control_AF annotations (3,451,486 variants obtained from 54,704 samples from individuals not associated with any disease), and 'Disease', which included the remaining 2,050,917 variants (i.e. annotated with Gnomad_exome_AF but not with Gnomad_exome_control_AF). There were two AAF versions used: Gnomad_exome_AF and Gnomad_exome_AFpopmax annotations.
All nsSNV variants listed in the DBNSFP4.0 database were classified into established categories of ClinVar annotations. The training/cross-validation dataset comprised the GnomAD variants with ClinVar annotations (133,514 variants) across five classes: 'Benign' (35,659 variants, denoted by "2" label in calculations), 'Likely_benign' (21,931 variants, labelled as "1", including Benign/Likely_benign), 'Uncertain_significance' (18,716 variants, labelled as "0"), 'Likely_pathogenic' (21,510 variants, labelled as "-1", including 'Pathogenic/Likely_pathogenic'), and 'Pathogenic' (35,698 variants, labelled as "-2"). Note that to improve the imbalance between the classes the original ClinVar categories Benign/Likely_benign and Likely_benign were merged into the 'Likely_benign' class, and similarly, ClinVar's Pathogenic/Likely_pathogenic and Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 July 2020 doi:10.20944/preprints202007.0735.v1 2 Likely_pathogenic were merged into the 'Likely_pathogenic' class; the 'Uncertain_significance' class was down-sampled seven times randomly. The 2D histograms for the resulting five ClinVar categories were plotted as heat maps. The histogram plot illustrates the distribution of variants across the 41 score dimensions separately for the classes. The models were then applied to predict cVEP classes for the prediction dataset comprised of all nsSNVs contained in the dbNSFP4.0 database, i.e. 84,013,490 human nsSNVs in the human coding genome (exome) split by the chromosome regions. This database is the resulting output provided via an URL link. In addition, we illustrate the dependence of the cVEP classifications on the gene indispensability, and differences in individual genes of two pathogenic types.

Visualization in 1-and 2-dimensional space
For the purpose of illustrating the usefulness of combining different scores, we show how the variants belonging to different pathogenicity classes can be progressively better discriminated, going from 1-to 2-dimensional space using one or two selected features, respectively. Subsequently, we show how the classification in nD space (41 features) is learnt by our proposed ensemble machine learning approach.
3.1.1. One-dimensional discrimination of ClinVar variants by using AAF, Indispensability and VEST4 annotations Figure 1 displays three histograms of the five classes of variants from the GnomAD database with available ClinVar annotations, in relation to the fit features: AAF, Indispensability and VEST4, where the latter is the best performing functional annotation (see Section 3.2.4). For AAF there is a threshold clearly identifiable on the plot between the benign and the pathogenic classes at 0.0003, but the corresponding "Likely" classes are more overlapped. The histograms in the middle panel are very similar, but there are twice as many variants for pathogenic class in relation to the benign class for low indispensability seen in the histogram bin at 0.1, and the opposite is true at 0.95. This finding translates to there being twice as many benign variants for the mutation-tolerant genes (i.e. having low indispensability score), and twice as many pathogenic variants for essential genes (i.e. with high indispensability). There is a very clear separation in the right panel, where pathogenic and benign (as well as the respective "Likely" classes) can be distinguished by a threshold at 0.64 VEST4 score. Here there is no differentiation for the "Uncertain_significance" class ('Undefined' label). 3.1.2. 2D discrimination of variants by pathogenicity using the population allele frequencies Figure 2 illustrates the distribution of the prevalence of all AAF and AAFpopmax values among the nsSNV variants, sequenced in the GnomAD ( Figure. 2 left panel) and the ClinVar (Figure 2 right panel) databases. A variant is above the diagonal on the plots when AAFpopmax > AAF, signifying that the variant has higher prevalence in some sub-population in comparison to the entire population. The 'control' group variants (cyan colored dots) span the entire range of AAF above the diagonal; rare variants are found both above and below the diagonal to a similar extent. Conversely, 'disease' variants, i.e. the GnomAD exome variants that are not control variants, are mostly very rare (AAF range 10 -4 -10 -6 ) and are situated above the diagonal on the plot. Similar to the 'control' group, the ClinVar annotations in the 'Benign' and 'Likely_benign' groups span the entire range of frequencies, while similarly to the 'disease' group, the 'Likely_pathogenic' and 'Pathogenic' groups are also clustered in the very rare range and are situated above the diagonal. The  Overall, the majority of common, rare and very rare variants are above the diagonal on both plots, while a large proportion of rare variants are below the diagonal. It is noteworthy that both the 'Disease' (brown dots, Figure 2, left) and 'Pathogenic' variants (red circles, Figure 2, right) lie mostly above the diagonal. Consequently, the measure of difference (AAFpopmax -AAF > 0) is potentially useful (in combination with AAF<10 -4 ) as an additional measure to distinguish very rare pathogenic variants from very rare benign variants. This highlights the need for combining multiple scores by a flexible machine learning approach that, e.g., can automatically leverage such difference information.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 July 2020 doi:10.20944/preprints202007.0735.v1  5 Each functional prediction tool produced a converted rank-score for each variant, with the values ranging from 0 (benign) to 1 (pathogenic). To illustrate the average score values in relation to AAF, Figure 3 shows the means of the 40 scores for the GnomAD dataset, computed for the six equally separated AAF ranges on the log10-scale. As the ranges regress from common to very rare (from bottom to top on the left panel), the score means increase, for example from around 0.2 for the range [1.0-0.1] to 0.5 for the range [1e-5 -1e-6]. The right panel shows that for most scores, there is a close to linear relationship between the score mean values and the AAF value (lower bound of range), while some scores show step-function dependence.

Ranking of predictive capacity of individual pathogenicity features
To obtain a ranking of the features based on cross-correlation between the score values and the predicted class values, we used the ClinVar dataset as described in the Methods section, with training using only one base classifier "GLM" applied to predict classes when inputting only one feature, i.e. each of the scores individually. The rankings sorted by McFadden's pseudo R-squared, which quantifies explained variation, are given in Table 2. The most highly correlated score was "x10" (VEST4) with R 2 =0.82, closely followed by "x20", "x24" with R 2 of 0.81. The other R 2 values were in the range of 0.75 -0.81. "x10" 0.818 VEST4 "x20" 0.811 CADD "x24" 0.811 Eigen "x31" 0.808 Mean "x6" 0.806 MutationTaster "x25" 0.805 Eigen-PC "x15" 0.799 MutPred "x14" 0.798 REVEL "x13" 0.798 M-CAP "x12" 0.797 MetaLR "x19" 0.793 Deogen2 "x16" 0.793 MVP "x18" 0.792 PrimateAI "x4" 0.791 Polyphen2_HVAR "x22" 0.790 fathmm-MKL "x11" 0.789 MetaSVM "x17" 0.787 MPC "x9" 0.787 Provean "x3" 0.786 Polypen2_HDIV "x7" 0.786 MutationAssessor "x1" 0.785 SIFT "x33" 0.784 phyloP100_vertebrate "x2" 0.781 SIFT4G "x5" 0.781 LRT "x39" 0.781 SiPhy_29way_logOdds "x36" 0.780 phastCons100vertebrate "x21" 0.779 DANN "x23" 0.779 fathmm-XF "x32" 0.775 phyloP100_vertebrate "x8" 0.771 fathmm "x37" 0.770 phastCons30mammalian "x34" 0.768 phyloP30_mammalian "x38" 0.767 phastCons17_primate "x26" 0.765 GenoCanyon "x41" 0.763 Gene_indispensability "x35" 0.762 phyloP17way_primate "x27" 0.751 integrated_fitCons "x29" 0.750 H1-hESC_fitCons "x40" 0.750 bStatistic "x30" 0.750 HUVEC_fitCons "x28" 0.750 GM12878_fitCons Figure 5 shows 2D histograms of the values of 41 scores for the nsSNVs, divided into five ClinVar-derived pathogenicity classes (see Methods). To aid interpretation the scores were sorted in descending order along the x-axis according to Table 2, i.e. to the individual feature correlation with the predicted classes. Consequently, it is apparent that the score values in the left half of the plot are considerably more correlated and similar than in the right half, where the histogram patterns are much more random or less correlated to the predicted class. In particular, in the left half the class of pure 'Benign' has high histogram (normalized) counts for score values that are close to 0.0 (i.e. towards the bottom of the class panel), and low counts for the medium and high score values. The class of 'Likely_benign' variants has a very similar pattern to the pure 'Benign' class. In contrast, the 'Likely_pathogenic' and 'Pathogenic' classes have high variant prevalence for score values close to 1.0 (towards the top of the class panel). The histogram pattern in the 'Uncertain_significance' class in the left half is quite indistinct, while in the right half the pattern is similar to the 'Benign' class. In the right half of each panel the histogram patterns tend to be somewhat similar, and thus poorly correlated with the predicted class.  Table 2. The y-axis on the left side denotes the five ClinVar classes, and the Y-axis on the right corresponds to the histogram bins for the values of each feature (on the x-axis), i.e. 10 equally 0.1-spaced bins between 0 and 1. The colour from white (0) to blue (0.7) corresponds to histogram counts in each of the bins, normalized to sum to 1 for the variant counts for each class and each score.

Training and cross-validation of the ensemble model
The prediction method uses for training the annotations features (listed in Table 1) derived for the GnomAD variants having ClinVar annotations, as described in the Methods section. The various accuracy measures for the three base-and ensemble-learners can be seen (Table S2- 7 validation (80% training, 20% testing, with data points selected randomly using the 'modulo' algorithm from the h2o package) for the three base-learners ("GLM", "GBM", "DRF") and the Superlearner ("Ensemble") are given in Tables S2-5, with the main test metrics summarized in Table 3a.
The similar results of the performance measures and the small standard deviations in the five crossvalidation runs (Tables S2-5) point to the robustness and of the prediction of cVEP classes.  The resulting ensemble performance: mean-per-class-error is 8.4%; 7.4% of total number of misclassifications according to 'Confusion Matrix' (Table S4), i.e. when the predictions fall into the other than correct classes. This accuracy corresponds to "Hit ratio 1" of 93% for predicting the correct class exactly, and a "Hit ratio 2" of 98.8% for predicting either the correct or the nearest class. Since 'next' is along the class sequence ("-2" "Pathogenic" , "-1" "Likely_pathogenic" , "0" "Uncertain_significance", "1" "Likely_benign", and "2" "Benign"), misclassification into next-class is acceptable, while three-or more class misclassifications are not (Hit ratios [3][4][5]. The errors are attributable to ambiguity in the classes themselves, e.g. mixing between likely and pure classes. As can be seen from (Table S4) most of the misclassifications occur between "Pathogenic" and "Likely_pathogenic", "Likely_benign" and "Benign". Besides, the inaccuracies are due to the errors in the input scores, and due to missing data for some scores in the sets (in the GnomAD database the number of scores available for each variant range between minimum of 23 and maximum of 40). Table 3b shows the accuracies for real-valued pathogenicity prediction. Here the training/testing datasets were exactly the same, except the predictive column was treated as real-values, i.e. -2, -1, 0, 1, 2 (instead of treating these as symbolic classes "-2", "-1", .., in the above). The majority of the training parameters were the same, except for the distribution types.

Ranking by importance of the features in the consensus models
Ranking the scores in terms of their correlations with predicted class (Table 2, Figure 5) suggested a high level of redundancy among the features, since all of the correlations were similarly high. Alternatively, ranking by feature importance in the models more distinctly shows the most important and unique contributions made by each feature.

Variation in cVEP by gene indispensability score
To assess differences in predicted variant pathogenicities as a function of the gene indispensability score I, the variants were selected from the entire GnomAD dataset in three ranges (see Table 5).
From the table it is clear that the mutation-tolerant group has a considerably higher proportion of predicted benign or likely benign variants at around 80%. In contrast, the essential group has a smaller proportion of predicted benign variants, with most classified as uncertain or likely benign. 66% of the intermediate group's variants were predicted to be as benign or likely benign. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 July 2020 doi:10.20944/preprints202007.0735.v1 Table 5. Proportions of predicted pathogenicity classes for three gene indispensability ranges. The proportion of variants predicted to be pathogenic in a selection of known pathogenic genes ( Table 6) was, as expected, higher than in the total GnomAD dataset. The highest proportion of pathogenic or likely pathogenic variants for cardiac disorder genes can be found for the MYH7 (61%) and FBN1 (52%) genes. The proportion of variants with predicted uncertain significance is relatively large in this group, with the highest of 74% for DSP and 69% for PKP2. For recessive genes the pathogenic proportion is highest for HBB (70%) and GJB2 (62%).

Discussion
The approach presented here enables the two-step prioritization of variants, whereby initial classification into the cVEP classes is followed by filtering in the selected class, e.g. 'pathogenic' class, using the continuous-value consensus score, or any of the best performing scores. The study ranks the features by importance, which can be used as a guidance when selecting scores for the variant filtering. Many NGS applications, especially in the clinical domain, would benefit from directly using a predicted pathogenicity classification, which is compatible with the clinically and experimentally determined ClinVar database classes, and defined according the ACMG/AMP standards and guidelines for use in clinical practice. The presented results demonstrate that prediction of class is of high accuracy, with error rate being of around 1% for cVEP 2-class classification. Moreover, Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 July 2020 doi:10.20944/preprints202007.0735.v1 11 prediction performance is likely to improve over time with more data available for training and with the development of better machine learning tools. Clearly, it would be preferable to use actual clinical, experimental in vitro or in vivo test results to determine variant pathogenicity instead of using a prediction score. However, such data are currently unavailable for more than 99% of all potential nsSNV mutations. This situation is likely to improve in the future, e.g. it might become possible to obtain direct measures of pathogenicity using high-throughput in vitro cell screening assays of effects of synthetic mutations, e.g. using the CRISPR approach [25], leading to an increase in the percentage of known annotations. However, there are types of variants, such as null mutations, which appear to be too pathogenic to exist in a living cell or will be below a screen detectability level. The rankings by feature importance in the three base-models provides three differing selections of the best scores for variant filtering. The GBM approach sequentially selects features one-by-one using such algorithm: a next selected feature is orthogonal to the preceding in the sequence features and contributes the most to classification. Thus, the GBM list of top-ranking features has much less redundancy compared to the DRF list, where selection of the most-contributing features is done without an orthogonality requirement. It is interesting that the GLM list top-ranks the 'Mean' feature, which is computed as a simple average of the features available for each variant. The mean score should indeed have theoretically lower by factor of 40 error level compared to each individual score feature used alone, due to averaging of noise, which corresponds to its high rank in the linear regression model. However, this mean-rankscore is less important for DRF and GBM baseapproaches. The most contributing to the resulting ensemble or consensus scores are VEST4, CADD and MutPred, which are intersection of the top-5 most-important features for GBM and DRF. These or any the top-5 features in the base-models are proposed for use in variant filtering.
A novel score, gene indispensability was included into predictors for several reasons. On the one hand, there is a considerable number of potentially deleterious (i.e. having high pathogenicity scores) variants located in the mutation-tolerant genes. Clearly, such variants, if having high allele frequencies, could be assumed neutral based on that measure, or could be associated with no negative effects in the cells, and as a result, were attributed into the 'Benign' class in the ClinVar database. On the other hand, moderately deleterious variants (i.e. having medium-level pathogenicity scores) in the essential genes can be potentially associated with very adverse effects in cells, and were included into the 'Pathogenic' ClinVar class. A relatively large number of such cases could be a major source of errors, when training is based on pathogenicity scores alone. Thus, it is not surprising that the indispensability score had contributed significantly to the improvement in accuracy, since this score ("x31") was listed 10th in the DRF ranking by importance. In our initial classification tests (data not shown) the addition of this score has reduced the amount of total classification errors more than fivefold: from initial ca 10% (without Indispensability) to current <2%.
The method presented here could be subject to criticism on a number of fronts. Firstly, one might argue that the effect of a particular mutation in a living cell depends on many other factors, not necessarily caused by the mutation per se, e.g. may be determined by gene regulation or a combination of many mutations including in other genes, by an accumulation of history of direct and indirect effects, or by other external factors. It is known that the pathogenicity of some hereditary mutations exhibits itself only with age, such as in BRCA1,2 genes. Each mutation might contribute a small fraction to overall risk of developing a pathological condition. This can render predictions based on association of a mutation and its causative manifestation either infeasible or simply too ambiguous. In this approach it translates to many predictions falling into the "uncertain significance" class, in which variants could be conditionally pathogenic depending on other unknown factors. Secondly, some of the pathogenicity prediction scores, in particular those based on gene sequence conservation, are measures obtained from alignments of genetic sequences in the mammalian families, of which the human population is only a part. This implies an indirect relationship between the values of mammalian conservation scores and a likelihood to observe clinical/biological pathogenicity effects in humans. Nonetheless, when a gene variant is conserved (and common) in a Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 July 2020 doi:10.20944/preprints202007.0735.v1 12 mammalian species family, it might be common in humans as well, and thus is more likely to be considered benign. The indirect negative connection between variant allele frequency and its pathogenicity, however, stems from a long history of negative evolutionary selection pressure on dysfunctional or pathogenic variants, and positive selection for those which are beneficial. Overall, the evolutionary link between all three characteristics of an allele variant: human frequency, mammalian family conservation and pathogenicity is undeniable: pathogenic variants are less conserved in families and less frequent in genetic sequences, while beneficial and benign mutations are more prevalent. Thirdly, although these results are already useful today, we suggest treating this attempt to classify all variants as preliminary, and to use the predictions with a degree of caution, similar to the usage of the allele frequencies and the impact-effect predictive scores. Nonetheless, the consensus VEP class has better accuracy in comparison to each of the individual input scores and might be better than using the population-derived AAFs for ultra-rare and very-rare variants, which currently suffer from a significant bias in sampling of human populations. Finally, the genome-wide pathogenicity class predictions may become a novel source of leads of causative pathogenic variants and genes suitable for further testing in addition to those discovered in GWAS.

Conclusions
In this work, predicted pathogenicity consensus scores and classifications were obtained for all potentially possible non-synonymous human exome SNVs, using a machine learning method with primary inputs of 39 functional impact scores. The input of the gene indispensability score accounted for gene variants differences in essentiality and mutation-tolerance. We showed that nearly all of the functional, conservation and impact scores used in this study correlate with each other and the evidence pathogenicity data on average, and anti-correlate with human population allele frequencies.
The fact suggests that these measures are indeed highly related, and that using a combination approach is an appropriate method of predicting variant pathogenicity. Assessment of the performance of the methods using five-fold cross-validation on the ClinVar dataset confirmed the reliability and robustness of prediction. The good performance of the method can be attributed to the consensus combination of a wide array of multiple conservation and impact effect scores, which are based on each score's respective domain of evidence, e.g. all protein 3D structures for CADD. Another advantage of the method is that the models are trained based on multiple available pathogenicity classes, in comparison to some of the existing methods utilizing only two (pathogenic/benign) classes. Finally, the predicted consensus classification are aimed to help to distinguish between pathogenic, uncertain_significance and benign variants for all human exome nsSNVs in biological and clinical applications of Whole Exome Sequencing.

Supplementary Materials:
The following are available online at www.mdpi.com/xxx/s1-3, Table S1 with the list of the impact effect scores used for input in the machine learning procedure; Table S2 with the individual parameters of the base-and meta-learners; Tables S3a-e with the outputs of the model fits containing crossvalidations metrics for the base-and ensemble-learners. Availability: the predicted consensus VEP (cVEP) class scores for the 84,013,118 human exome nsSNVs are available here: https://cvep.imbi.uni-freiburg.de/cvep.tgz, in the form of 1.2Gb tar-gzip archive, in BED (Browser Extensible Data) format split by the chromosome regions; the archive can be searched using provided Java program by input of hg19, hg38 (1-based coordinates) or gene, chromosome; it is compatible with any of the variant annotation tools that can use the dbNSFP database