Computational Approaches to Prioritize Cancer Driver Missense Mutations

Cancer is a complex disease that is driven by genetic alterations. There has been a rapid development of genome-wide techniques during the last decade along with a significant lowering of the cost of gene sequencing, which has generated widely available cancer genomic data. However, the interpretation of genomic data and the prediction of the association of genetic variations with cancer and disease phenotypes still requires significant improvement. Missense mutations, which can render proteins non-functional and provide a selective growth advantage to cancer cells, are frequently detected in cancer. Effects caused by missense mutations can be pinpointed by in silico modeling, which makes it more feasible to find a treatment and reverse the effect. Specific human phenotypes are largely determined by stability, activity, and interactions between proteins and other biomolecules that work together to execute specific cellular functions. Therefore, analysis of missense mutations’ effects on proteins and their complexes would provide important clues for identifying functionally important missense mutations, understanding the molecular mechanisms of cancer progression and facilitating treatment and prevention. Herein, we summarize the major computational approaches and tools that provide not only the classification of missense mutations as cancer drivers or passengers but also the molecular mechanisms induced by driver mutations. This review focuses on the discussion of annotation and prediction methods based on structural and biophysical data, analysis of somatic cancer missense mutations in 3D structures of proteins and their complexes, predictions of the effects of missense mutations on protein stability, protein-protein and protein-nucleic acid interactions, and assessment of conformational changes in protein conformations induced by mutations.


Introduction
Cancer is a complex disease that is driven by genetic alterations. Cancer genome sequencing projects have revealed vast numbers of somatic mutations [1,2], and the majority of these are expected to be passenger mutations (i.e., mutations having no direct or indirect effect on a selective growth advantage of tumor cells) [3]. A group of key mutations, called drivers, significantly alter normal cellular systems [4,5], providing a selective growth advantage to cancer cells [3] that becomes apparent during different stages of oncogenesis. A large number of mutations detected in cancer are single nucleotide variants (SNVs), and those that alter amino acid sequences are called missense mutations. These mutations may affect protein structure/stability and disrupt protein interactions with other biomolecules, rendering proteins non-functional and potentially promoting tumor progression. Some missense mutations have been identified as drivers, such as the BRAF V600E mutation in melanoma [6] and the KRAS G12D and G12V mutations in colorectal cancer [7].
The key challenge in cancer research is to determine which mutations are likely to be drivers. Although mutations that are observed very frequently can be classified as drivers, many mutations discovered thus far are observed in a relatively small fraction of tumors [8]. Thus, methods that can identify driver or passenger mutations without explicitly relying on observed frequency counts are clearly needed [9]. Experimental methods, including functional studies in model organisms or in cultured cells using gene knockout or siRNA, are extremely useful for elucidating the function of individual mutated genes. However, they have limitations with respect to analyzing a large number of gene candidates from large-scale cancer genome projects. For missense mutations, one can considerably decrease the number of potential driver candidates by determining the functional impact of each mutation on proteins [10]. In addition, mutations that confer drug resistance should be identified. Overall, the binary driver-passenger model can and should be adjusted by taking into account additive pleiotropic effects of mutations [11,12]. Subcellular localization of proteins is also important to their biological functions and aberrant protein subcellular localization is closely correlated to cancer, such as primary human liver tumors [13] and breast cancer [14]. Knowing where a protein resides within a cell can give insight into identification of drug targets and drug design [15]. Several computational methods have been developed to determine the subcellular localization of proteins that deal with large-scale proteomic data [16,17].
In this review, we focus on the description of computational approaches and tools to annotate cancer driver missense mutations. We divide the process of annotating functional and driver variants into five independent, but related, approaches ( Figure 1). The first consists of analyzing the distribution of cancer somatic missense mutations in 3D structures of protein and protein complexes with protein-binding partners, nucleic acids and low molecular-weight ligands. These resources can help identify cancer drivers, drug biomarkers, or rationalize the mechanism of action. The second approach introduces computational methods for predicting the effects of missense mutations on protein stability, which may directly relate to their functional activity. Computational methods that accurately predict the effects of variations on protein stability may help identify functionally important mutations. The third group describes computational methods for predicting the effects of missense mutations on protein-protein and protein-nucleic acid interactions. A protein's ability to establish highly selective interactions with macromolecular partners is a crucial prerequisite for proper biological function. A missense mutation affecting protein interactions may cause significant perturbations or complete abolition of protein function, potentially leading to disease. The fourth group introduces molecular dynamic simulations to assess changes in proteins and their conformations induced by mutations, which may aid in the detection of cancer drivers and elucidation of molecular mechanisms. The fifth approach discusses several statistical methods for identifying potential functional impacts of cancer missense mutations and signs of positive selection across the patient cohort. identified as drivers, such as the BRAF V600E mutation in melanoma [6] and the KRAS G12D and G12V mutations in colorectal cancer [7].
The key challenge in cancer research is to determine which mutations are likely to be drivers. Although mutations that are observed very frequently can be classified as drivers, many mutations discovered thus far are observed in a relatively small fraction of tumors [8]. Thus, methods that can identify driver or passenger mutations without explicitly relying on observed frequency counts are clearly needed [9]. Experimental methods, including functional studies in model organisms or in cultured cells using gene knockout or siRNA, are extremely useful for elucidating the function of individual mutated genes. However, they have limitations with respect to analyzing a large number of gene candidates from large-scale cancer genome projects. For missense mutations, one can considerably decrease the number of potential driver candidates by determining the functional impact of each mutation on proteins [10]. In addition, mutations that confer drug resistance should be identified. Overall, the binary driver-passenger model can and should be adjusted by taking into account additive pleiotropic effects of mutations [11,12]. Subcellular localization of proteins is also important to their biological functions and aberrant protein subcellular localization is closely correlated to cancer, such as primary human liver tumors [13] and breast cancer [14]. Knowing where a protein resides within a cell can give insight into identification of drug targets and drug design [15]. Several computational methods have been developed to determine the subcellular localization of proteins that deal with large-scale proteomic data [16,17].
In this review, we focus on the description of computational approaches and tools to annotate cancer driver missense mutations. We divide the process of annotating functional and driver variants into five independent, but related, approaches ( Figure 1). The first consists of analyzing the distribution of cancer somatic missense mutations in 3D structures of protein and protein complexes with proteinbinding partners, nucleic acids and low molecular-weight ligands. These resources can help identify cancer drivers, drug biomarkers, or rationalize the mechanism of action. The second approach introduces computational methods for predicting the effects of missense mutations on protein stability, which may directly relate to their functional activity. Computational methods that accurately predict the effects of variations on protein stability may help identify functionally important mutations. The third group describes computational methods for predicting the effects of missense mutations on protein-protein and protein-nucleic acid interactions. A protein's ability to establish highly selective interactions with macromolecular partners is a crucial prerequisite for proper biological function. A missense mutation affecting protein interactions may cause significant perturbations or complete abolition of protein function, potentially leading to disease. The fourth group introduces molecular dynamic simulations to assess changes in proteins and their conformations induced by mutations, which may aid in the detection of cancer drivers and elucidation of molecular mechanisms. The fifth approach discusses several statistical methods for identifying potential functional impacts of cancer missense mutations and signs of positive selection across the patient cohort.

Data Resources for Cancer Missense Mutations
The progress in this rapidly developing field has induced unprecedented growth in databases on genetic variants, such as cancer-oriented databases and databases storing different types of human genetic variations [18][19][20][21]. These databases provide important resources for detecting disease-causing or cancer-driving mutations and serve as the training templates or testing benchmarks for development of in silico prediction methods. Cancer genome sequencing projects have revealed vast numbers of somatic missense mutations in protein coding regions. The Cancer Genome Atlas (TCGA) was jointly supervised by the National Cancer Institute (NCI) and National Human Genome Research Institute COSMIC is the world's largest somatic cancer mutations repository database [18]. It includes not only mutations from patients whole-genome and -exome sequencing projects but also from cancer cell lines, which offers a most comprehensive resource for exploring the impact of somatic mutations in human cancer. However, not all cancer mutations provide a selective growth advantage to cancer cells. Large efforts dedicated to the detection of cancer driver mutations have yielded significant improvements in precision cancer medicine. In connection with this, several databases of cancer alterations were subsequently developed [22][23][24]. The Database of Curated Mutations (DoCM) is a public repository of disease-causing somatic cancer mutations comprehensively curated from literature with established relevance to cancer biology [25]. DoCM v3.2 includes 1364 variants and 1276 missense mutations from 122 cancer subtypes, enabling the cancer research community to aggregate, store, and track biologically important cancer variants that are essential for clinical annotation. Clinical Interpretations of Variants in Cancer (CIViC) is a community-edited web resource for discovering clinical interpretations of variants in cancer [26], which provides an educational forum for the dissemination of knowledge and active discussion of the clinical significance of cancer genome alterations. As of 22 February 2018, CIViC included 1767 variants, enabling precision medicine in cancer treatment. It should be mentioned that all these databases largely overlap in terms of their entries and may contain predictions as well as experimental validations mostly reporting potential driver mutations with a consistent lack of cancer somatic neutral variants.
In summary, the aforementioned data resources (Table 1) provide a variety of data for systematically exploring genomic, epigenomic and transcriptomic characteristics of tumor samples. These data not only allow for but also call for, the development of methods and tools that can efficiently detect cancer-related mutations and genes.

3D Spatial Distributions of Cancer Missense Mutations
Three-dimensional (3D) structures of proteins and their complexes could provide crucial information for identifying cancer-driving mutations. Thus, servers or databases for exploring and building the relationship between cancer-related missense mutations and structures will be useful for deciphering the biological consequences of these mutations (see Table 2). NCBI resources provide different platforms to map and analyze single nucleotide polymorphisms (SNPs) or cancer mutations with respect to protein structures (see the detailed description in [19]). In addition, the Cancer3D database helps users analyze the distribution of cancer somatic missense mutations from TCGA and CCLE (The Cancer Cell Line Encyclopedia project) in the context of 3D protein structures [29], allowing users to predict novel cancer drivers or drug biomarkers. dSysMap is a resource that maps disease/cancer-related mutations obtained from Uniprot in protein structure and interactions in the human interactome [30]. This program helps in rationalizing the mechanism of action for these mutations by putting them in a systemic context. The StructMAn server provides annotation of human and non-human non-synonymous single-nucleotide variants in a structural context [31]. It analyzes the spatial location of mutated sites in protein 3D structures relative to other binding partners of proteins, nucleic acids or low molecular-weight ligands. This tool provides structural context for up to 60% of nonsynonymous single-nucleotide variations (nsSNVs) in genes related to human diseases by searching for all structures of corresponding proteins and other homologs. Table 2. A summary of online and free software resources for analyzing 3D spatial distribution of cancer missense mutations, predicting the effects of mutations on protein stability, protein-protein and protein-nucleic acid binding affinity. All resources need structure as an input except those with "*".  Several methods have been recently developed for identifying cancer drivers using protein 3D structure information. Hotspot regions are demonstrated to have biological relevance in cancer. Kamburov et al. proposed a method to detect cancer genes using significant 3D clustering of mutations in the corresponding protein structure [57]. They applied this approach and analyzed pan-cancer somatic mutations from thousands of tumors falling within 18,356 proteins, among of them 5140 human proteins with known human protein 3D structures (51,980 3D structures). Eight well-established oncogenes (PIK3CA, PTPN11, BRAF and HRAS) and tumor suppressors (PTEN, TP53, FBXW7 and CDKN2A) with significant 3D clustering of missense mutations were detected. They concluded that systematic consideration of 3D structure can aid in identifying cancer genes with the understanding of the functional assignment of their mutations. For example, mutations that cluster at protein-protein interfaces may disturb key molecular interaction and function. Tokheim et al. also presented a novel and stringent algorithm using 3D protein structures to detect missense mutation hotspot regions in human cancer [58], enabling the discovery of hotspot regions in more genes. In addition to experimentally determined protein structures, they also considered high-quality structural models, so the genomic coverage increased from 5000 to more than 15,000 genes. This study can help cancer researchers investigate the biological functions of cancer somatic missense mutations by linking to the corresponding 3D protein structures. For example, the identified hotspot region in RAC1 overlaps with the binding site. It contains a mutation in melanoma that has been identified as dysregulating RAC1 by a fast cycling mechanism [59]. A computational tool, HotSpot3D, was developed by Niu et al. to identify protein 3D spatial hotspots (clusters) and to interpret the potential function of variants within them [60]. They applied HotSpot3D to more than 4000 TCGA tumors across 19 cancer types and discovered more than 6000 intra-and intermolecular clusters. In addition, they identified 369 rare mutations and 99 medium-recurrent mutations, all residing within clusters having potential functional implications. Furthermore, the predictions were validated in EGFR using high-throughput phosphorylation data and cell-line based experimental evaluation. Their mutation-drug cluster and network analysis predicted over 800 promising candidates for druggable mutations, providing new possibilities for designing personalized treatments.

Assessing Changes in Protein Conformation induced by Mutations
The effects of mutations on macromolecular conformational dynamics are important [61]. Changes in macromolecular conformational dynamics, especially for proteins whose function is activated by conformational changes, can cause disease [62][63][64]. The effects of cancer mutations on oncogene conformations and functions have been studied extensively, both experimentally and computationally. L858R is an activating mutation in EGFR that is found in a large fraction of cancer patients. The mutant protein shows up to a 50-fold increase in activity compared to the wild type [65]. According to different proposed mechanisms, L858R mutation can either lock the kinase in the active state by preventing formation of the inactive state helical conformation [65] and/or it can reduce the intrinsic disorder content, favoring dimerization and stabilization of the active conformation [66]. Extensive molecular dynamics simulations with enhanced sampling demonstrated that L858R stabilizes the active conformation of EGFR more than the inactive conformation and rigidifies the αC-helix. Interestingly, the L858R and T790M double mutants exhibit significant positive epistasis [67,68].
Proteins may adopt different conformations during a biochemical reaction, and their intrinsic flexibility and ability to assume alternative conformations are crucial for protein function. Mutations might shift the equilibrium between different conformations and, as a result, the conformation of a mutated protein can differ in structure, stability and functional activity from the wild-type conformation. It is extremely difficult to model structural changes in a protein backbone produced by mutations. In fact, most algorithms discussed in the previous sections do not account for backbone flexibility. If several conformations are available in the structural databank for the same protein, ideally all of them should be used to provide a complete picture of dynamic and energetic mutational effects.
All-atom molecular dynamics (MD) simulation is a commonly used approach to study bio-macromolecule conformational dynamics [69][70][71]. Using MD, one can simulate changes in conformations and hydrogen-bond networks [72][73][74][75][76][77]. Atomistic molecular dynamics simulation is based on Newton's equations of motion, and the force is calculated by differentiating the potential energy with respect to the position of each atom in the system. The potential energy of the system is estimated based on a set of empirical parameters and equations, called a force field. The output of an MD simulation can be used to yield physical observations for a system, such as distances between atoms or residues, changes in hydrogen-bond networks, or secondary structures. The accuracy of MD simulation largely depends on the given 3D structures of the biomolecules. The current existing molecular dynamics packages and force fields have been rather successful in revealing these changes for mutations that do not induce dramatic structural alterations. The most widely used packages are NAMD [78], CHARMM [79] and Amber [80]. NAMD, for example, is fast and easy to use. It can be applied in conjunction with the CHARMM or Amber force fields.
Mutations can either change the global conformation of an entire molecule or have a more localized effect. With respect to the effects of oncogenic mutations, for example, MD simulations and energy calculations were performed for the effects of several mutations from the same DNA-binding loop on the NFAT5 transcription factor [81]. Results illustrated that the effects of these mutations on protein conformations and binding with DNA were drastically different, although all mutations were located very close to each other in both sequence and structure. In particular, a phosphomimetic mutation, T222D, made the overall complex very rigid, whereas other mutations increased its flexibility. Demir et al. studied a variety of missense mutants by measuring their functional activity and thermodynamic stability [82]. In parallel, they performed molecular dynamics simulations for each mutant and calculated the number of distinct conformations in the dynamic landscape for measuring protein flexibility globally. They found that the number of individual protein conformations obtained from a simulation trajectory correlated well with thermodynamic stability and protein functional activity, indicating that mutants can lead to protein loss-of-function by increasing protein flexibility.

Estimating the Effects of Mutations on Protein Stability
One can considerably decrease the number of potential cancer driver mutation candidates by determining the functional impact of each mutation on its corresponding protein. Protein stability may directly relate to functional activity, and changes in stability or incorrect folding could be major consequences of pathogenic missense mutations. It was previously shown that missense mutations destabilize tumor suppressors significantly more than SNPs, but this same effect was not observed for oncogenes [83]. In most cases, missense mutations are deleterious due to decreasing the stability of the corresponding protein [67,84]. For example, oncogenic mutations disrupt Casitas B-lineage lymphoma (CBL) function by decreasing the stability of CBL proteins [85]. Six mutations in the tumor suppressor gene phosphatase and tensin homolog (PTEN) in patients with PHTS-associated cancer show a global decrease in structural stability and increased dynamics across the domain interface [86]. In other cases, missense mutations may cause diseases by enhancing stability of the corresponding protein [87].
Computational methods that accurately predict the effects of variations on protein stability may help to identify functionally important mutations. Typically, the magnitude of mutational effects on stability can be quantified by unfolding free energy changes ∆∆G fold . The ProTherm database is a collection of thermodynamic parameters for wild-type and mutant proteins [27]. It includes unfolding Gibbs free energy, enthalpy and heat capacity changes, etc. that provide important clues for understanding the relationship among structure, stability and function of proteins and their mutants. This database also contains information on experimental conditions and methods used for measuring these data, which is frequently used as training templates for development of the following in silico prediction methods (Table 1). Table 2 lists major computational approaches and tools for predicting quantitative changes in unfolding free energy in response to mutations. They are different in terms of algorithms used for training models, procedures used for optimization and sampling of protein conformations, and terms of energy functions. The terms of energy functions may vary from physics-based force fields to knowledge-based potentials by combining different structure-based or sequence-based physicochemical properties of amino acids. In addition, some methods take into account experimental conditions, such as salt concentration, pH values and temperature, which are important for assessing the free energy at near physiological conditions. For example, FoldX uses an empirical force field to evaluate the effects of mutations on stability, folding and dynamics in proteins and DNA [37]. One of the core functionalities of FoldX is the calculation of the unfolding free energy of a macromolecule based on its 3D structure. Its energy function is parametrized on experimental changes of unfolding free energy. FoldX is a software package, can be easily run on the Linux system, and allows users to deal with large datasets. FoldX has become a standard tool for predicting the effects of mutations including both single and multiple mutations on protein stability. SAAFEC is an approach that uses weighted MM-PBSA (Molecular Mechanics -Poisson-Boltzmann Surface Area) methods and various biophysical terms parametrized on thousands of experimental values [38]. Its energy terms are calculated using minimized wild-type and mutant structures. In particular, missing residues in the 3D structures can be added by SAAFEC.
The majority of the above mentioned methods require coordinates of protein 3D structures as the inputs. Prediction accuracy can be influenced by different factors, including protein class and structural flexibility, type of substituted and wild type amino acid and structural environment of the substituted site. The performance of these predictors was assessed and compared in different studies using datasets of experimentally characterized mutants [88][89][90][91][92]. In the first study [92], the performance of six different methods were evaluated on a large set of 2156 single mutations, and the mutations used for training each model were excluded. The following performance ranking was reported: EGAD > CC/PBSA > I-Mutant2.0 > FoldX > Hunter > Rosetta with correlation coefficients between predicted and experimental ∆∆G values in the range of 0.59 and 0.26 and standard deviation in the range of 0.95 and 2.32 kcal mol −1 . However, the servers, EGAD and CC/PBSA, with the top performances are no longer available. In the second study [91], 11 online stability predictors (CUPSAT, Dmutant, FoldX, I-Mutant2.0, two versions of I-Mutant3.0 (sequence and structure versions), MultiMutate, MUpro, SCide, Scpred, and SRide) were compared by performing a systematic analysis on 1784 single mutations excluding those used for training each program. I-Mutant3.0, Dmutant, and FoldX were found to be the most reliable predictors. Furthermore, Kepp evaluated the relative performance of these methods by calculating the stability changes of SOD1 and myoglobin variants [89,90]. Five methods, CUPSAT, I-Mutant2.0, I-Mutant3.0, PoPMuSiC and SDM, were tested on 54 SOD1 mutations. The results showed that PoPMuSiC was the most accurate approach with correlation coefficient R~0.5 and MAE~1.0 kcal mol −1 and followed by I-Mutant. Kumar et al. extended this study for SOD1 stability changes upon mutations using three different structures and four additional protein stability predictors (PoPMuSiC 3.1, FoldX, mCSM and ENCoM) [88]. Overall, PoPMuSiC and FoldX were shown as the best methods.

Estimating Quantitative Effects of Mutations on Protein-Protein or Protein-Nucleic Acid Interactions
A protein's ability to establish highly selective interactions with macromolecular partners is a crucial prerequisite for proper biological function. A missense mutation affecting protein interactions [93][94][95] may cause significant perturbations or complete abolishment of protein function, potentially leading to disease. The binding free energy change ∆∆G bind is a way to quantify the magnitude mutational effects on protein-protein or protein-nucleic acid interactions. The SKEMPI database (Table 1) [28] includes experimentally measured values of change in thermodynamic parameters for binding affinity and kinetic rate constants upon single and multiple amino acid substitutions for protein-protein interactions with experimentally determined heterodimeric complex structures. It was derived from scientific literature and contains binding free energy, enthalpy and rate constant changes in response to mutations. The ProNIT database [27] is a collection of experimentally determined thermodynamic interaction parameters between proteins and nucleic acids, including binding constants, changes in free energy, enthalpy and heat capacity, with experimentally determined complex structures. These two databases were used as training benchmarks for development of the following prediction methods. Table 2 lists several methods to estimate ∆∆G bind values. These methods require all-atom or at least protein backbone atom coordinates of a wild type. BeAtMuSiC, is a coarse-grained predictor of binding affinity changes in response to point mutations that uses different statistical potentials trained with known protein structures [49]. The BeAtMuSiC server provides an option for rapidly calculating the binding affinity changes for all possible mutations in a protein chain, while it does not make a model of the mutant structure. MutaBind is a web-based application method for evaluation of the effects of sequence variants and disease mutations on protein-protein interactions [48]. The MutaBind method relies on a combination of molecular mechanics force fields, statistical potentials and fast side-chain optimization algorithms. It can map mutations on a protein complex structure, calculate the associated changes in binding affinity, determine the deleterious effects of a mutation, estimate the confidence of this prediction and produce a mutant structure model for download. MutaBind was compared with BeAtMuSiC and FoldX by testing on two independent test sets and the results showed that MutaBind performs better than the other methods as evident from the values of correlation coefficients and root-mean-square errors. The MutaBind server was applied to estimate the putative changes in binding affinity of Spalax p53 interactions with other DDR proteins [96]. The calculated results supported the possibility that Spalax's stress-related substitutions in TAD2 decrease the binding affinity of p53 to other DDR proteins as compared to humans. Another similar method, SAAMBE, is based on modified MM/PBSA-based components along with a set of statistical terms derived from physico-chemical properties of protein complexes [50,51].
Protein-protein interactions can be modulated by small-molecule drugs and biologics, such as peptides and antibodies. They are often considered druggable targets in anticancer therapy. As coverage of protein families with structural protein-protein interactions remains limited [97], integrative studies identifying key interactions in cancer pathways using protein structural similarity and homology to infer potential drug-protein interactions represent a promising data-driven strategy [98][99][100]. It is indeed instrumental and essential to have information about the locations of binding site residues on protein-protein interfaces, as well as binding specificity of interfaces with respect to interaction partners.
There are very few methods available for predicting the effects of mutations on protein-nucleic acid interaction. mCSM-NA, for example, performs this task by relying on graph-based signatures that encode distance patterns between atoms [55]. mCSM-NA was trained on the entire ProNIT database and did not consider some special cases, such as the mismatch of nucleic acid sequences used in measuring binding affinity changes experimentally and in 3D protein-nucleic acid structures for developing the model. Another method, SAMPDI, uses a combination of modified MM/PBSA-based energy and knowledge-based terms to predict changes in binding affinity in response to mutations, in particular, for protein-DNA complexes [56]. SAMPDI was benchmarked against purged experimental data of protein-DNA interactions from the latest ProNIT database and data from the recent references. Compared with mCSM-NA, SAMPDI provides relative contribution of each energy term and additional structural information. For the majority of these methods, the rational choices of structure optimization protocols, energy terms or solvation models are determinants for achieving reasonable prediction accuracy. Moreover, prediction accuracy depends on the mutation type and its location in a protein-protein or protein-nucleic acid complex [101]. For example, interfacial mutations exhibit larger effects on protein-protein or protein-nucleic acid interactions compared to non-interfacial mutations [48,93,101,102]. Although available methods for structure modeling and analysis, energy calculations, assessment of conformational dynamics and functional annotations still need considerable improvement, they can provide meaningful results if they are applied correctly to the problems they aim to solve [62,67,84,[103][104][105][106][107][108][109][110][111][112][113][114][115][116][117][118]. Herein, we present an example of the detailed analysis of cancer mutations molecular mechanisms for Casitas B-lineage lymphoma (CBL) protein activity [119]. The Cbl RING finger ubiquitin ligase (E3) plays both positive and negative regulatory roles in tyrosine kinase signaling and is aberrantly activated in many cancers. Oncogenic mutations in the CBL gene have been found in many tumors [85], but the mechanistic significance of these mutations and their impacts on CBL function were largely unknown [85,120]. Four CBL structures have been solved, representing snapshots of different stages of the CBL activation cycle. Computational modeling was applied to all four stages. First, cancer-related missense mutations for the CBL gene were extracted from the COSMIC database and mapped to all four CBL and CBL-E2 complex structures. All possible single-nucleotide substitutions resulting in amino acid changes in the CBL gene were produced as a reference set. Second, wild-type and mutant structures were optimized using a previously developed optimization protocol [101] that was performed with the NAMD program using the CHARMM27 force field [121]. Third, the unfolding free energy changes in response to mutations were calculated using the optimization procedure implemented in the FoldX program. Fourth, the binding free energy changes were calculated according to the previously introduced approach [101]. Finally, in vivo experiments of CBL-mediated EGFR ubiquitination for 15 mutations in three human cell lines were performed. The results indicated that computational approaches incorporating multiple protein conformations, stability, and binding affinity evaluations can successfully predict the magnitude of effects due to mutations and further help understand their mechanisms of action.

Assessing Driver Status of Cancer Mutations
Many methods and tools have been developed over the past several years for predicting the functional impact of missense mutations [9,122], such as MutationAssessor [123] and PROVEAN [124]. These methods utilize a variety of features that describe the properties of a mutation from the aspects of evolutionary conservation, physicochemical attributes, or sequence context. Among them, several approaches are specifically designed for cancer missense mutations. The functional analysis through hidden Markov models (FATHMM)-cancer [125] is an algorithm that predicts the potential functional impact of cancer missense mutations. It uses cancer-associated mutations (germ line and somatic) from the CanProVar database [126] and putative neutral polymorphisms from the UniProt database as the training set and features of conservation and epigenomic signals. CHASM [127] is another approach for prioritizing cancer-driver mutations based on a random forest classifier [128,129] that was trained on 49 predictive features. The training set used for developing CHASM includes missense mutations from the COSMIC database and breast, colorectal, and pancreatic tumor resequencing studies [8,[130][131][132]. Passenger mutations were synthetized by sampling from eight multinomial distributions that depend on dinucleotide context and tumor type. CHASM and other approaches focus on properties of individual mutations and does not explicitly rely on the frequency at which mutations appear in a gene, so it can potentially detect driver mutations occurring at low frequencies.
In addition, CHASM is trained in a cancer-type-specific fashion and can be adapted to different cancer types. CanDrA [133] is a weighted supporting vector machine (SVM)-based tool for prioritizing somatic missense mutations by incorporating 95 structural and evolutionary features generated by over 10 functional prediction algorithms. Driver and passenger mutations selected based on the observed frequency for training of the model were taken from glioblastoma multiforme and ovarian carcinoma patients from COSMIC. They have precomputed CanDrA scores for almost all possible missense mutations across whole genome and allowed users to perform very efficient predictions.
A recent systematic study was performed for comparing 15 such methods including FATHMM-cancer, CHASM and CanDrA that are introduced here on 849 non-neutral and 140 neutral mutations affecting 15 cancer genes. Cancer-specific mutation effect predictors display no-to-almost perfect agreement in their predictions of these SNVs and none of them were yet sufficiently reliable to guide high-cost experimental or clinical follow through [134]. ParsSNP [135] is an unsupervised functional impact predictor that uses an innovative, parsimony-based approach to prioritize cancer driver mutations. ParsSNP does not use predefined training labels that can introduce biases, but rather utilizes an expectation-maximization framework to find mutations that explain tumor incidence, so it can be applied to the problems that lack sufficient training samples for supervised methods. In particular, ParsSNP can identify truncation events in the tumor suppressor, while methods like CHARM and CanDrA are designed to work only with missense mutations. In their study, ParsSNP was reported to outperform the existing tools (CanDrA, CHASM and FATHMM Cancer) across five distinct benchmarks. In addition, the authors applied ParsSNP to an independent dataset of 30 patients with diffuse-type cancer, and ParsSNP identified many known and likely driver mutations that other methods did not detect.
DNA context-dependent mutability is an important factor affecting frequencies at which cancer mutations reoccur in tumor samples [136]. Therefore, it is necessary to integrate context-dependent mutations into cancer-specific mutational models. To achieve this task, the MutaGene server (https: //www.ncbi.nlm.nih.gov/research/mutagene/) provides tools for the analysis of expected mutability of mutations for cancer-specific and pan-cancer cases, ranking and predicting whether mutations are drivers or passengers [137,138].
This review attempts to outline the current development of computational approaches for prioritizing cancer driver missense mutations using various biophysical characteristics, including stability, binding affinity, and conformation dynamics. It was demonstrated that these biophysics-based approaches can identify functionally important missense mutations and facilitate understanding of the mechanisms of molecular effects in human cancer. In addition, we present a collection and introduction of the most comprehensive databases that store different types of sequencing data on cancer somatic missense mutations to the highly curated databases from the literature with established relevance to cancer biology and clinical annotation. It is important to emphasize that these approaches have limited capacity to identify driver mutations for tumor development directly. The reason for this is primarily that very few mutations have been validated as causative. Rather, they are able to prioritize candidates for follow-up experiments that may illustrate the actual physiological relevance of these mutations in cancer.