Robust Prediction of Single and Multiple Point Protein Mutations Stability Changes

Accurate prediction of protein stability changes resulting from amino acid substitutions is of utmost importance in medicine to better understand which mutations are deleterious, leading to diseases, and which are neutral. Since conducting wet lab experiments to get a better understanding of protein mutations is costly and time consuming, and because of huge number of possible mutations the need of computational methods that could accurately predict effects of amino acid mutations is of greatest importance. In this research, we present a robust methodology to predict the energy changes of a proteins upon mutations. The proposed prediction scheme is based on two step algorithm that is a Holdout Random Sampler followed by a neural network model for regression. The Holdout Random Sampler is utilized to analysis the energy change, the corresponding uncertainty, and to obtain a set of admissible energy changes, expressed as a cumulative distribution function. These values are further utilized to train a simple neural network model that can predict the energy changes. Results were blindly tested (validated) against experimental energy changes, giving Pearson correlation coefficients of 0.66 for Single Point Mutations and 0.77 for Multiple Point Mutations. These results confirm the successfulness of our method, since it outperforms majority of previous studies in this field.


Introduction
The amino acid sequence of a protein is the most important factor that determines its secondary and tertiary structure, dynamics and, ultimately, its function. The understanding of the mechanisms that determine protein stability is one of the forefront challenges in proteomics and transcriptomics, since even a single amino acid substitution can be the cause of a devastating disease [1] Experiments are utilized to engineer or design proteins with specific mutations to examine the effect of that specific amino acid substitution [2]. The effect of a mutation is assessed by (∆∆G)-a measure of the change in free energy between the folded and unfolded states when a point mutation is present. This has been found to be an excellent indicator of whether a point mutation is favorable in terms of protein stability. A comprehensive database of experimentally obtained mutations and its associated free energy changes is available at ProTherm database [3]-the thermodynamic Database for Proteins and Mutants that contains more than 10,000 data of several thermodynamic parameters for wild type and mutant proteins. Each entry in ProTherm includes data for unfolding Gibbs free energy change, mutant. This sampler has been used earlier in different fields to assess the intrinsic uncertainty in the inverse problem and in various classification problems [40][41][42].
The combination of the energy distribution of the mutant, the observed change of energy, and information on the specific amino acid substitution and the residue position at which the substitution takes place are utilized to train a simple Neural Network to predict the effect of a mutation on a given protein (∆∆G) Results are, later, validated utilizing experimental data extracted from the ProTherm database.
The data set of proteins with two mutations in ProTherm database is smaller than the set of proteins with single mutation, however, it is sufficiently large to give us a good insight on the uncertainty associated with the energy change caused by a mutation.
Advantage and robustness of our methodology is achieved via the use of a large number of Holdouts and through multiple trainings and validations of the Neural Network to compute the frequency histogram of accuracies. This approach leads to better prediction accuracies than any other machine learning method used in the past. Furthermore, our method often achieves a robust performance in the prediction of the effect of single and multiple mutations while being fast enough to be run in a few minutes. Remarkably, our methodology is distinguished from others since our machine learning-based method does not require the computation of energy or other physical properties. Therefore, like in the work of Dehghanpoor et al. [28], it is possible to predict the effect of a mutation without the need of biophysical information and the method does not depend on hydrogen bond energies, van der Waals forces and generally no force field is needed. Our present study extends past work in this field by introducing the concepts of distribution of mutation energy and its uncertainty and by parametrization of a Neural Network in order to accurately predict the observed energy. Therefore, it represents is highly novel approach to prediction of changes of stability of proteins due to mutations.

Protein Datasets
The protein dataset has been derived from the Thermodynamic Database for Proteins and Mutants (ProTherm) which contains approximately 14,500 data of several thermodynamic parameters for wild type and mutant proteins along with detailed information on experimental methods and conditions. The thermodynamic data is linked to sequential and structural data in Protein Data Bank (PDB), Protein Information Resource (PIR) and SWISS-PROT [43].
In this paper, two different datasets are utilized. The databases utilized are randomly split for learning and testing for both the uncertainty analysis of mutant proteins and for learning and blind validation of the Neural Network. That way, the machine learning process conforms to the norms that are accepted in the AI community of independent training and generalization.
The databases have been derived from ProTherm by taking into consideration the following assumptions: (1) the change in protein free energy (∆∆G) has been measured experimentally and deposited in the ProTherm database; (2) the protein structure is known and has been deposited in the Protein Data Bank (PDB) [44]; (3) the data is limited only to single point and double point mutations.
The Single Point Mutation Dataset is shown in Supplementary Material 1, and the Double Point Mutation Dataset in Supplementary Material 2.

Single Point Mutations
In this paper, we propose a protein mutation prediction scheme composed of two steps. Our methodology is based on the assumption that every amino-acid substitution has its own Cumulative Distribution Function (CDF) of changes of energy. Therefore, regardless of the protein in which that substitution takes place, the produced energy change can take on any value provided by CDF. In other words, CDF of changes of energy provides a set of admissible changes of energy that a given mutation may cause in a specific protein.
In this sense, the first step consists of sampling the uncertainty of energy change landscape for a given amino acid substitution in order to obtain the overall change of energy distribution for the mutant protein (also considered as the uncertainty). This initial step is carried out via the Holdout Random Sampler, which has been proposed by Cernea et al. [42] in application to the problem of phenotype prediction. It is based on the boot-strapping technique. This algorithm quantifies the cumulative distribution function of the change of energy of a mutant protein on a validation dataset with the aid of the landscape of possible energy changes that a mutation causes regardless of protein and the mutant residue position in the protein. This mutation energy landscape was previously extracted from a learning dataset [40,41].
The second step consists of predicting the energy change produced by the amino acid substitution with a Neural Network composed of one hidden layer and 10 neurons. The Neural Network is trained considering CDF of changes of energy of mutant proteins extracted from the Holdout Sampler, and, later on, validated by comparing with the real energy changes extracted from ProTherm database.
An illustrative general workflow of the protein mutation prediction scheme is shown in Figure 1. Further details of the algorithm are shown in the following sections. other words, CDF of changes of energy provides a set of admissible changes of energy that a given mutation may cause in a specific protein.
In this sense, the first step consists of sampling the uncertainty of energy change landscape for a given amino acid substitution in order to obtain the overall change of energy distribution for the mutant protein (also considered as the uncertainty). This initial step is carried out via the Holdout Random Sampler, which has been proposed by Cernea et al. [42] in application to the problem of phenotype prediction. It is based on the boot-strapping technique. This algorithm quantifies the cumulative distribution function of the change of energy of a mutant protein on a validation dataset with the aid of the landscape of possible energy changes that a mutation causes regardless of protein and the mutant residue position in the protein. This mutation energy landscape was previously extracted from a learning dataset [40,41].
The second step consists of predicting the energy change produced by the amino acid substitution with a Neural Network composed of one hidden layer and 10 neurons. The Neural Network is trained considering CDF of changes of energy of mutant proteins extracted from the Holdout Sampler, and, later on, validated by comparing with the real energy changes extracted from ProTherm database.
An illustrative general workflow of the protein mutation prediction scheme is shown in Figure  1. Further details of the algorithm are shown in the following sections.

Multiple Point Mutations
Prediction of protein stability for Multiple Point Mutations follows a similar scheme as in the prediction of stability for single point mutations. It is also based on the fact that every amino acid substitution has its own CDF of energy changes. Consequently, if the single point amino acid distributions are taken and Holdouts of multiple point mutations are generated, it is possible to quantify the cumulative distribution function of energy changes for a mutant protein using a validation dataset with the aid of the landscape of possible energy changes that a mutation causes, regardless of protein, and the position of the mutated residue in the protein.

Multiple Point Mutations
Prediction of protein stability for Multiple Point Mutations follows a similar scheme as in the prediction of stability for single point mutations. It is also based on the fact that every amino acid substitution has its own CDF of energy changes. Consequently, if the single point amino acid distributions are taken and Holdouts of multiple point mutations are generated, it is possible to quantify the cumulative distribution function of energy changes for a mutant protein using a validation dataset with the aid of the landscape of possible energy changes that a mutation causes, regardless of protein, and the position of the mutated residue in the protein.
The CDF of changes of energy consists of a set of parameters that improve the accuracy of a Neural Network, composed of one hidden layer and 10 neurons. The Neural Network is trained by considering the CDF of changes of energy upon amino acid mutation of proteins extracted from the Holdout Sampler, and, validated with the experimentally observed energy changes extracted from ProTherm database.

The Holdout Sampler-Based Uncertainty Predictor
The purpose of our algorithm is to explore the change of energy landscape for every mutation. The sampling of the energy landscape is, later used to compute the energy landscape for a specific mutant protein. The simplest way of carrying out this task is by utilizing random data bags with different datasets for training, followed by a testing procedure. This is comparable to modifying the evidence of c obs with respect to the classifier/regressor L * , since part of the samples used for blind testing/validation have not been used for training. L* corresponds to the classifier/regression model utilized to predict the free energy changes or free energy uncertainty, while c obs is the observed free energy change, coming from experiments. The problem consists of finding the uncertainty of protein mutation free energy changes relative to L * (g) can be interpreted as a generalized regression problem of the observed free energy change, c obs with respect to the predicted CDF.
This method is based on the statistical technique of bootstrapping, or arbitrary sampling with replacement [45], which is used to build the confidence intervals in sample estimates and to estimate the sampling distribution of any statistic via a random sampler.
In prior works for other application, this methodology was utilized to optimally sample model parameters of posterior distribution via the least squares fitting of different data bags [40,41]. In this case, the approach is similar; the idea is to sample the model parameters in order to obtain effect of mutation on distribution of energy. The Holdout Sampler samples CDF of energy change of the mutant protein as follows: Data bagging: We randomly divide the data set into a 75/25 data bag holdouts, where 75% of the data is used for learning and 25% for testing/validation. In this case, 100 different bags were generated. For each holdout, let us consider an amino acid substitution, m i , which is present in a set of proteins, p k . That set of protein has an associated free energy change, which is known, E p k , therefore; it is possible to obtain the distribution of energy for that specific amino acid substitution, m i = E p k /m i . This energy distribution accounts for the different energy value a specific substitution may have depending on the external conditions (pH, temperature, pressure, etc.). It also, accounts for how the substitution influences the protein structure in its surroundings.
Data Testing: After completing the learning stage, where the amino acid substitution CDFs are obtained, a testing stage is carried out. This means that every change of energy in the mutant protein E cd f ,MC,k in the testing bag is predicted with a Monte Carlo algorithm, where the rejection or acceptance of a simulation is based on the previously computed CDF. The energy distribution computed with the Monte Carlo method is compared with the real experimentally measured energy and a residual Ω k with the weight k Ω computed, to fit the following expression: Holdout selection: Once CDFs of energy change for the mutant protein are computed in the training dataset T k , the Holdout accuracy is computed and the best Holdout predictors are chosen to predict, afterwards CDFs of energy changes for the entire protein dataset. To compute the Holdout accuracy, we define the energy change median sign as follows: In addition, we compute the sign of the real energy in the training dataset: The accuracy is defined as the percentage of proteins, whose predicted energy change sign coincides with the observed energy change sign, as follows: The Holdouts that fulfill the following condition: Acc HD,i > 0.99·Acc HD,min are selected to compute the cumulative distribution function of energy changes for the entire protein dataset.
Computing the distribution of protein energy changes: After selecting the best Holdouts that fulfill our threshold condition, the residual distribution, Ω k,cd f and the weight, k Ω are averaged throughout the entire holdouts. In this sense, the energy change is predicted with a Monte Carlo algorithm, where the rejection or acceptance of a simulation is based on the holdout learned CDFs and adjusted with the weigh and residual according to the expression: This part of the algorithm is general and, prediction of CDFs of energy changes for both Single and Multiple Point Mutations follows the same procedures and equations.

The Neural Network Based Predictor
Artificial neural networks are computing systems composed of simple processors whose layered, interconnected architecture resembles the structure of neurons in the brain. A neural network is capable of learning from data, so it can be trained to recognize patterns, classify data and or perform regressions [46]. A neural network divides the input data into layers of abstraction, and it could be trained over many input datasets to perform predictions. A neural network performance depends on connectivity of individual neurons, their weights and the strengths of these connections. All these parameters are automatically adjusted and updated at every step during the training/learning process. This is carried out until the neural network performs its task (classification, regression, pattern recognition) with a high degree of accuracy for the training/learning dataset [47]. Due to that, neural networks are especially well suited to solve classification and regression problems. The neural network presented in this paper, combines one input layer where 100 parameters corresponding to the CDF, the mutation type and the protein code are provided, one hidden layer with 10 neurons and an output layer with one neuron, which provides the free energy change prediction. The architecture of the Neural Network was found to be optimum when the hidden layer consists of 10 nodes after the analysis (see Figure 2). This tuning was performed with a 20% of the data in order to sample the neural network hidden layer architecture and select the best configuration possible. The layers are interconnected via nodes with each layer using the output of the previous one as its input. The neural network performs supervised learning, since it is trained in order to produce the desired targets (observed energy) according to a set of inputs (Protein ID, mutation type and CDF of the change of energy). In this sense, the algorithm can perform a classification of the proteins and their mutations while, carrying out a regression by modelling the response between the CDF of the energy change and position and the observed energy in the ProTherm database.
The neural network is trained and validated 100 times by splitting randomly the data set in a way such that 70% of the mutated proteins and their CDFs are used for learning, 15% are used for testing and 15% for blind validation. This Neural Network utilizes the Levenberg-Marquardt algorithm as a training method [48] and RELU activation function.

Results
In the ProTherm database, mutations in protein sequence are related to the protein stability change (∆∆ ) Therefore, every single mutation in the database is taken alongside the energy change it causes, in order to construct the associated cumulative distribution of energy changes. Figure 3 shows the cumulative distribution of energy changes for the amino acid substitution A-S (where Alanine is substituted with Serine) and F-L (where Phenylalanine is substituted with Leucine). As it can be observed the problem is highly uncertain, since a wide range of energy changes are observed with a high frequency. Therefore, the prediction of energy changes in protein mutants should always be accompanied by a proper uncertainty analysis and model parametrization to ensure that the models perform according to the highest accuracy standards. The layers are interconnected via nodes with each layer using the output of the previous one as its input. The neural network performs supervised learning, since it is trained in order to produce the desired targets (observed energy) according to a set of inputs (Protein ID, mutation type and CDF of the change of energy). In this sense, the algorithm can perform a classification of the proteins and their mutations while, carrying out a regression by modelling the response between the CDF of the energy change and position and the observed energy in the ProTherm database.
The neural network is trained and validated 100 times by splitting randomly the data set in a way such that 70% of the mutated proteins and their CDFs are used for learning, 15% are used for testing and 15% for blind validation. This Neural Network utilizes the Levenberg-Marquardt algorithm as a training method [48] and RELU activation function.

Results
In the ProTherm database, mutations in protein sequence are related to the protein stability change (∆∆G) Therefore, every single mutation in the database is taken alongside the energy change it causes, in order to construct the associated cumulative distribution of energy changes. Figure 3 shows the cumulative distribution of energy changes for the amino acid substitution A-S (where Alanine is substituted with Serine) and F-L (where Phenylalanine is substituted with Leucine). As it can be observed the problem is highly uncertain, since a wide range of energy changes are observed with a high frequency. Therefore, the prediction of energy changes in protein mutants should always be accompanied by a proper uncertainty analysis and model parametrization to ensure that the models perform according to the highest accuracy standards. Biomolecules 2019, 9, x FOR PEER REVIEW 8 of 17 The Holdout Random Sampler was designed in this paper as a consensus classifier. The utilization of consensus classifiers generally leads to significant improvement of prediction performance, and, at the same time, represents a good tool for predicting effects of protein mutations. In this sense, it confirms that consensus prediction is an accurate and robust alternative to classical and individual machine learning tools [49].
Therefore, the Holdout Random Sampler employs a consensus system connecting different decision boundaries. In this sense, the Holdout Random Sampler, removes and does not consider protein mutations with low statistics, which leads to inaccuracies in the evaluation of (∆∆ ) in the CDF. In addition, the Holdout Random Sampler computes the sign of the median (∆∆ ) in the CDF in order to decide whether the specific mutation will lead to protein stability or instability.
The accuracy of each Holdout is computed as reported in the previous section, utilizing the expression: The results are combined in Figure 4, where both the CDFs of accuracies and distribution histograms are presented. The Holdout Random Sampler was designed in this paper as a consensus classifier. The utilization of consensus classifiers generally leads to significant improvement of prediction performance, and, at the same time, represents a good tool for predicting effects of protein mutations. In this sense, it confirms that consensus prediction is an accurate and robust alternative to classical and individual machine learning tools [49].
Therefore, the Holdout Random Sampler employs a consensus system connecting different decision boundaries. In this sense, the Holdout Random Sampler, removes and does not consider protein mutations with low statistics, which leads to inaccuracies in the evaluation of (∆∆G) in the CDF. In addition, the Holdout Random Sampler computes the sign of the median (∆∆G) in the CDF in order to decide whether the specific mutation will lead to protein stability or instability.
The accuracy of each Holdout is computed as reported in the previous section, utilizing the expression: Acc HD,i = n k=1 1 S k,pred = S k,obs 0 S k,pred S k,obs n The results are combined in Figure 4, where both the CDFs of accuracies and distribution histograms are presented.   Once the accuracy of each Holdout is calculated, those which satisfy the condition: Acc HD,i > 0.99·Acc HD,min are selected to compute the distribution of energy changes for the entire protein dataset.
This distribution of energy changes is obtained from E cd f ,pred,k = E cd f ,MC,k + k Ω ·Ω k,cd f , where the parameter k Ω and the distribution Ω k,cd f were learned in the Holdouts and predicted by the selected ones. The distribution of energy change E cd f ,MC,k is predicted by the best Holdout through a Monte Carlo simulation, in which the acceptance or rejection of an energy change is determined by the landscape of possible energy changes that a mutation causes regardless of the residue, its position or a protein. This energetic landscape for mutations was previously found in the Holdout learning dataset and predicted by the selected best Holdouts. Figures 5 and 6 show the cumulative distribution function of energy changes for selected sets of proteins with a Single Mutation and with Multiple Mutations. It can be observed that a given protein mutation can lead to any energy change that is admissible by the CDF, since such energy change may be affected by various factors, such as structural features, amino acid interactions in the neighborhood of the mutation site, temperature, pH, solubility, etc. The boxplots in Figures 5 and 6     In a wide range of problems, the median is a promising predictor since it is robust, and it is not highly influenced by outliers. However, the Holdout Random Sampler tends to overpredict stability for protein mutations as shown in Figure 7. This is consistent with the fact that CDF of energy change upon mutation has a much larger region of stability than instability, consequently, during random sampling of the energy landscape of a mutant protein, it is more likely to predict a stable energy change than an unstable one. In that sense, increasing the number of training cases of unstable mutations would contribute to the improvement of the quality of predictions with only the Holdout Sampler, otherwise a posterior regression model, as the one implemented in this research is required. However, the Holdout Random Sampler was designed and should be understood as a simple methodology to sample the uncertainty space of the energy changes upon mutation for a specific protein, but not to accurately predict the energy change. Therefore, it is a very simple, but powerful tool to utilize the outcome of it to parametrize a Neural Network in order to dramatically reduce its complexity.
Neural Networks experience a highly varying performance which depends on the initial random conditions, when relatively small datasets are utilized. This poor performance is specially found on training datasets, which suggests that their hyper-parameter tuning processes normally underfits, rather than overfits their performance, in contrast with other methods such as Random Forest or In a wide range of problems, the median is a promising predictor since it is robust, and it is not highly influenced by outliers. However, the Holdout Random Sampler tends to overpredict stability for protein mutations as shown in Figure 7. This is consistent with the fact that CDF of energy change upon mutation has a much larger region of stability than instability, consequently, during random sampling of the energy landscape of a mutant protein, it is more likely to predict a stable energy change than an unstable one. In that sense, increasing the number of training cases of unstable mutations would contribute to the improvement of the quality of predictions with only the Holdout Sampler, otherwise a posterior regression model, as the one implemented in this research is required.  In a wide range of problems, the median is a promising predictor since it is robust, and it is not highly influenced by outliers. However, the Holdout Random Sampler tends to overpredict stability for protein mutations as shown in Figure 7. This is consistent with the fact that CDF of energy change upon mutation has a much larger region of stability than instability, consequently, during random sampling of the energy landscape of a mutant protein, it is more likely to predict a stable energy change than an unstable one. In that sense, increasing the number of training cases of unstable mutations would contribute to the improvement of the quality of predictions with only the Holdout Sampler, otherwise a posterior regression model, as the one implemented in this research is required. However, the Holdout Random Sampler was designed and should be understood as a simple methodology to sample the uncertainty space of the energy changes upon mutation for a specific protein, but not to accurately predict the energy change. Therefore, it is a very simple, but powerful tool to utilize the outcome of it to parametrize a Neural Network in order to dramatically reduce its complexity.
Neural Networks experience a highly varying performance which depends on the initial random conditions, when relatively small datasets are utilized. This poor performance is specially found on training datasets, which suggests that their hyper-parameter tuning processes normally underfits, rather than overfits their performance, in contrast with other methods such as Random Forest or However, the Holdout Random Sampler was designed and should be understood as a simple methodology to sample the uncertainty space of the energy changes upon mutation for a specific protein, but not to accurately predict the energy change. Therefore, it is a very simple, but powerful tool to utilize the outcome of it to parametrize a Neural Network in order to dramatically reduce its complexity.
Neural Networks experience a highly varying performance which depends on the initial random conditions, when relatively small datasets are utilized. This poor performance is specially found on training datasets, which suggests that their hyper-parameter tuning processes normally underfits, rather than overfits their performance, in contrast with other methods such as Random Forest or Support Vector Machines. Consequently, semi-supervised methods or "a priori" parametrization of the model seems to be the best fit when approaching this problem, which agrees with Dehghanpoor et al. [28]. Support Vector Machines. Consequently, semi-supervised methods or "a priori" parametrization of the model seems to be the best fit when approaching this problem, which agrees with Dehghanpoor et al. [28].   Figure 9 shows the averages for the training, testing and validation datasets that were randomly split 100 times in a way such that 70% of the mutant proteins and their CDFs were used for learning, 15% were used for testing and 15% were used for blind validation. The Neural Network classifies the protein type, mutation type and performs a regression with the cumulative distribution function of energy change to fit the target values of energy changes.  Figure 9 shows the averages for the training, testing and validation datasets that were randomly split 100 times in a way such that 70% of the mutant proteins and their CDFs were used for learning, 15% were used for testing and 15% were used for blind validation. The Neural Network classifies the protein type, mutation type and performs a regression with the cumulative distribution function of energy change to fit the target values of energy changes. Biomolecules 2019, 9, x FOR PEER REVIEW 12 of 17 The assessment of the accuracy of predicting the values of energy changes of mutations in the protein sequence is carried out by analyzing the Pearson Correlation Coefficients and the Root Mean Square Error (RMSE) shown in Figure 10. This figure shows that the Pearson Correlation Coefficient increases linearly as the RMSE decreases, which is expected result. Nevertheless, for the same RMSE value, the Pearson Correlation Coefficient is much higher for the Multiple Point mutations and the slope of the regression line is smaller (has a larger absolute value). The results presented in Figures 9 and 10 show that our algorithm outperforms other approaches reported [35,50] in the literature and summarized in Table 1.  The results presented in Figures 9 and 10 show that our algorithm outperforms other approaches reported [35,50] in the literature and summarized in Table 1. Our Pearson Correlation Coefficient 0.77 ranks top, followed by ProMaya, Mcsm and ELASPIC servers which have achieved correlation coefficients of 0.74, 0.76, and 0.77 respectively. Supplementary Material 7 shows the raw results from the neural network, which predicts the value of free energy changes upon mutations. Despite having different cross-validation approaches or pre-processing schemes, the results could be reasonably compared as supported by the hypothesis of Biological Invariance reported by Alvarez et al. [61], that is; the analysis of the genomics, metabolomics and proteomics data should be independent of the sampling methodology and the classifier utilized for their inference. Only the differences in the size of the datasets might affect the comparison, since each set has its inherent noise and peculiarities. In addition, the utilization of binary classification data to exclude neutral models (∆∆G = 0 ± 0.5 kcal/mol), might affect the ultimate performance of this prediction scheme. Another major point that is worth mentioning with respect to our prediction methodology is that no direct energetic calculations were performed, the methodology was solely based on generalizing the landscape of energetic changes for each amino acid substitution regardless the residue position and the protein. Later, this generalization was applied to specific proteins in order to obtain the set of admissible values of energy changes to parametrize a neural network

Conclusions
In this work, we present a pre-parametrized machine learning-based methodology to infer the effects of single and multiple point mutations on the stability of a protein. More specifically, our approach can predict the change of free energy of unfolding upon mutation (∆∆G) by using the Holdout Random Sampler to compute the distribution of the change of free energy of unfolding upon mutation of a protein and a Neural Network composed of 10 nodes to predict this effect. This distribution consists of a set of admissible values of free energy changes of a protein and it is an indication of the uncertainty behind this prediction, since the change of energy may be affected by a wide range of factors, from structural to external ones, such as temperature, pH, secondary structure, amino acid interactions within the vicinity of the mutation site, etc.
The Neural Network is trained and tested by randomly splitting the data set. This procedure has been repeated 100 times in order to assess the robustness of the modeling. Our average Pearson Correlation coefficient is 0.6630 in the case of single point mutations and 0.7747 in the case of multiple point mutations, which proves that our method predicts the effects of mutations with high accuracy and a low root mean square error (RMSE), outperforming other algorithms currently available in the literature.