A Deep-Learning Sequence-Based Method to Predict Protein Stability Changes Upon Genetic Variations

Several studies have linked disruptions of protein stability and its normal functions to disease. Therefore, during the last few decades, many tools have been developed to predict the free energy changes upon protein residue variations. Most of these methods require both sequence and structure information to obtain reliable predictions. However, the lower number of protein structures available with respect to their sequences, due to experimental issues, drastically limits the application of these tools. In addition, current methodologies ignore the antisymmetric property characterizing the thermodynamics of the protein stability: a variation from wild-type to a mutated form of the protein structure (XW→XM) and its reverse process (XM→XW) must have opposite values of the free energy difference (ΔΔGWM=−ΔΔGMW). Here we propose ACDC-NN-Seq, a deep neural network system that exploits the sequence information and is able to incorporate into its architecture the antisymmetry property. To our knowledge, this is the first convolutional neural network to predict protein stability changes relying solely on the protein sequence. We show that ACDC-NN-Seq compares favorably with the existing sequence-based methods.


Introduction
Predicting protein stability changes upon genetic variations is still an open challenge. It is essential to understand the impact of the alterations in the amino acid sequence, mainly due to non-synonymous (or missense) DNA variations leading to the disruption or the enhancement of the protein activity, on human health and disease [1][2][3][4]. In particular, protein stability perturbations have already been associated to pathogenic missense variants [5] and they were shown to contribute to the loss of function in haploinsufficient genes [6].
The protein stability changes upon variations of the amino acid sequence is usually expressed as the Gibbs free energy of unfolding (∆∆G), which is defined as the difference between the energy of the mutated structure of the protein and its wild-type form (∆∆G = ∆G M − ∆G W ). Thermodynamics imposes an antisymmetry relationship on ∆∆G that can be summarized as follows: given the wild-type (W) and mutated (M) protein structures, differing by one residue in position X, the quantity ∆∆G W M (= ∆G W − ∆G M ) represents the change in the protein stability caused by the amino acid substitution X W → X M . Similarly, given the symmetry between the two molecular systems M and W, for the reverse variation X M → X W the corresponding change in Gibbs free energy has the opposite sign: Since experimental measurement of ∆∆G is a time-consuming and complex task, during the last decades several computational tools have been developed to predict ∆∆G values. Some methods are structure-based, requiring the knowledge of the protein tertiary structure [7][8][9][10][11][12], others are sequence-based, either relying only on protein sequences [13][14][15] or optionally taking advantage of the protein structure when available [16,17]. However, most of these methods violate the antisymmetry property and suffer from high biases in predicting reverse variations [10,[18][19][20][21][22][23]. To address this problem we recently introduced ACDC-NN, a novel structure-based method that satisfies the physical property of antisymmetry, while reaching comparable performance to the state-of-the-art methods [24]. However, the experimental structure determination and characterization of protein thermodynamical features are still limited [18], while a dramatic increase in protein sequence databases has occurred as genomic and metagenomic sequencing efforts have expanded in the last years. The latest release of UniProtKB/TrEMBL protein database contains 214,406,399 sequence entries in all, including 175,817 human proteins, while the Protein Data Bank contains 177,655 entries, 52,485 of them human. Hence, computational approaches able to predict the impact of genetic variations on the protein stability using only sequence information are needed. To this aim, we created ACDC-NN-Seq, a sequence-based version of ACDC-NN that, like its predecessor, achieves accurate predictions satisfying the antisymmetry property, without the need for tertiary protein structures. Here we show that ACDC-NN-Seq compares well with both sequence-based and structure-based methods. We tested the antisymmetry of the predictions on an unbiased dataset and the accuracy on three clinically relevant proteins, avoiding overfitting by filtering out sequence similarities greater than 25%.

Datasets and Cross-Validation
Since the application of a deep learning technique requires a large amount of data to achieve the best performance, we pre-trained ACDC-NN-Seq using the predictions of another method, DDGun3D, which has shown to achieve antisymmetry with good performance. DDGun3D was not trained on experimental data, but it is based on evolutionary information and statistical potentials [14]. The pre-training phase was performed on an artificial set of variants created from the Ivankov dataset [21], that we named IvankovD-DGun. The ∆∆G values for this dataset were calculated using DDGun3D and the obtained scores were learnt by ACDC-NN-Seq.
For training/testing the network we considered the observed experimental ∆∆G variants reported by some of the most widely used datasets extracted from Protherm [25] database, already cleaned for redundancies and inaccuracies that are known to affect this database, which are: S2648 [26] and Varibench [27]. The ACDC-NN-Seq performance was tested on the following datsets: Ssym [26], p53 [28], myoglobin [29] and frataxin mutants from the CAGI5 challenge [30]. The datasets are summarized in Table 1 and a complete description can be found in our previous study [24]. Since these datasets often contain proteins with a high degree of sequence similarity between them, we avoided the related overfitting issues by filtering out similarities above 25%, as done in ACDC-NN; we generated cross-validation folds with blastclust algorithm [31] to create clusters of protein with sequence identity lower than 25% (command blasclust -i infile.fasta -o out.custers -p T -L 0.5 -b F -S 25).

Sequence Profiles
ACDC-NN-Seq uses the evolutionary information derived from multiple sequence alignments. Through the alignment it is possible to obtain a profile, i.e., an N × 20 matrix, where each entry Prof(i,j) corresponds to the frequency of the j-th residue in the i-th sequence position. N represents the protein length and 20 are the different aminoacids residues. Multiple sequence alignments were computed using the Uniprot database (release 2016) by the hhblits tool [32] with default parameters.

ACDC-NN-Seq Architecture
The ACDC-NN-Seq architecture is the sequence-based counterpart of the 3D version, already published [24]. Here, we report only the relevant points, leaving out some mathematical details.
ACDC-NN-Seq is an Antisymmetric Convolutional Differential Concatenated Neural Network (ACDC-NN) that takes as inputs both direct and reverse variations, processes them by convolution operations and then uses the extracted features as input for two siamese neural networks [33,34]. Specifically, each of the two ACDC-NN-Seq inputs consists of 160 elements to code variation and sequence evolutionary information: • Variation (V): 20 features (one for each amino acid) coding for the variation by setting all the entries to 0 with the exception of the wild-type and the variant residue positions set to −1 and 1, respectively. This input corresponds to a one-dimensional matrix V ∈ R 20×1 ; • Sequence (S or 1D-input): 140 features representing protein profile information of the variation neighbourhood. Considering i as the variant position in the sequence, we used a window of 3 residues, i.e., [i − 3; i + 3], so to obtain 20 × 7 elements, with the profile information of these 7 positions. This input then corresponds to a sequence of 7 vectors taken from the protein profile.
A 2D Convolutional layer was applied on the S matrix using a kernel equal to (1, 20) and stride (1, 1) (reported as Keras-style parameters [35]) and generating a 20 × 20 filter matrix K S . After the convolution, a dot product was performed between the variant vector V and the resulting 2D convolution matrix, obtaining 7 processed features, computed for both the direct variation and its reverse. These features were then concatenated with the variant vector V and used as input to a Differential Siamese Networks [33,34]. Finally, their outputs were combined in two Lambda layers of "difference/2" and average. To incorporate the antisymmetric property into the network structure, we designed a specific loss function that minimizes both the absolute value of the average output and the distance between the difference output and the true ∆∆G values. The ACDC-NN-Seq architecture is displayed in Figures 1 and 2.
The choice of an input window of 7 residues, three for each side of the position of interest, was made to learn the DDGun3D mapping (in the pre-training phase) since we adopted the same DDGun3D sequence neighborhood. Although we experimented with some other input window sizes, there was not much difference up to 11 residues, where we found a performance decrease.   Figure 1 is used for both direct and reverse variations. Given a variation, we provide to the network its coding to the left, and the coding to the reverse variation to the right. A final layer takes the average and the difference between the two outputs. The difference computes (∆∆G direct − ∆∆G inverse )/2, which in case of perfect antisymmetry is exactly equal to ∆∆G. The average computes (∆∆G direct + ∆∆G inverse )/2, which in case of perfect antisymmetry is equal to 0. The ACDC-NN-Seq outputs are estimations of the target ∆∆Gs learned during the training phase. The two Siamese networks have shared weights, both in the convolutional and the dense parts of the network.

Pre-Training Phase
Due to the lack of experimental data to train a network from scratch, we first pretrained the model on an artificial dataset and then performed transfer learning on real datasets. The pre-training phase was performed on the unlabeled artificial dataset IvankovD-DGun, which includes all the possible direct and reverse variations in every sequence position. We chose to perform the predictions on the IvankovDDGun dataset with a 3D method to internally encode 3D information from the sequence; DDGun3D was selected among the other predictors for its near-perfect antisymmetry. The training, validation and test sets used by DDGun3D include 400,000, 100,000 and 100,000 protein variants, respectively. The Differential Siamese Network consists of two hidden layers with 128 and 64 units (complete description in Table 2).

Transfer Learning on Experimental Data
After the pre-training phase, we applied transfer learning using S2648 [26] and Varibench [27] datasets of experimental ∆∆G values, splitting the data and removing the sequence similarity among training, validation and test sets. Specifically, the weights of the Convolutional layer were fixed while the Differential Siamese network part was re-trained selecting the best parameters. To increase the size of the training set, the unlabeled Ivankov2000 dataset with DDGun3D predictions was also considered. The complete description of the optimal sets of parameters is shown in Table 2.

Performance Evaluation
Pearson correlation (indicated by r) and root mean square error (RMSE) were estimated between the predicted and observed ∆∆G values to evaluate the performance of the methods. Two scoring indices were adopted to assess the antisymmetric property of ∆∆G predictors: r d−i and δ . r d−i is the Pearson correlation coefficient between the direct and the corresponding reverse variations: where Cov is the covariance and σ is the standard deviation. δ is the average bias quantifying the prediction shift: A perfectly antisymmetric method should have r d−i equal to −1 and δ equal to 0.

Learning 3D Properties on Artificial Data
A proper training of a neural network requires a huge amount of experimental ∆∆G values that are not currently available; we addressed this problem by performing a pretraining phase on the artificial dataset IvankovDDGun and then applying a transfer learning on the experimental datasets S2648 and Varibench, as described in the Materials and Methods section.
In Table 3, we show the results on the IvankovDDGun test set. It is worth noticing that the DDGun3D values were computed using the protein structures, while the ACDC-NN-Seq predictions are only based on sequence information. Thus, this approach obtained a sequence-based method capable of internally encoding the 3D statistical potentials that maintain the antisymmetric property.

Prediction of the Experimental ∆∆G Values
After training ACDC-NN-Seq on the IvankovDDGun set, we fine-tuned the network by retraining the last layers on the experimentally-derived ∆∆G values from S2648 and Varibench through 10-fold cross-validation. In Figure 3 we showed the experimental ∆∆G values versus the ACDC-NN-Seq predicted ones on Varibench and S2648 datasets both combined and alone, and for both direct and reverse variations. These results were obtained in cross-validation as explained in Benevenuta et al. [24]. ADCD-NN-Seq achieved both consistent performance with the state-of-the-art methods (measured in terms of r and RMSE) and perfect antisymmetry (r d−i = −1 and δ = 0.0).
We also compared the predictions of both ACDC-NN-Seq and DDGun with their corresponding stucture-based versions, i.e., ACDC-NN and DDGun3D. Figure 4 reports the comparison performance (in cross-validation for ACDC-NN and ACDC-NN-Seq) on the Ssym dataset [10], which was specifically built to assess the antisymmetry and it contains ∆∆G experimental values for direct and reverse variants. The performance of ACDC-NN-Seq is balanced and close to those obtained using the protein structures. This makes ADCD-NN-Seq ideal for genome variant analyses.
In order to evaluate the effect of the neural network design, we compared ACDC-NN-Seq with a feed-forward neural network (FFNN) trained and optimized in the same conditions. The structure of the optimized FFNN consists of an input layer of 140 input neurons (window of 7 residues coded with 20-element vector profiles), a sequence of hidden layers consisting of (128,64,32,16) neurons, and an output neuron coding for the ∆∆G value. Thus the main difference is due to the anty-symmetric construction of ACDC-NN-Seq (FFNN Figure 4). FFNN performance is quite good, and the neural networks learned most of the antisymmetry from the data provided (direct and reverse variations). However, ACDC-NN-Seq outperforms FFNN both in the prediction task and antisymmetry reconstructions. Regarding the results presented in Figures 3 and 4, it is worth noticing that the maximum achievable Pearson's correlation is not necessarily equal to 1, as usually thought. It may be far lower depending on the experimental uncertainty and the ∆∆G distributions [36,37]. In particular, when considering the different experiments on the same variants included in the Protherm database or in manually-cleaned datasets, the expected Pearson upper bound is in the range of 0.70-0.85 [36]. Significantly higher Pearson correlations can be obtained in small sets or might be indicative of overfitting issues [36].

Comparison with Other Sequence-Based Machine-Learning Methods
As mentioned above, few available methods can predict the effect of the variants on the protein stability starting from sequence only. We therefore compared ACDC-NN-Seq on three datasets with the following sequence-based methods: DDGun [14], INPS [13], I-Mutant2.0 [16], MUpro [17] and the recent SAAFEC-SEQ [15].
The obtained results are reported in Table 4; ACDC-NN-Seq predicts equally well both direct and reverse variants with nearly perfect antisymmetry (−0.99). ACDC-NN-Seq performance is higher than the one obtained by INPS, which is the only machine-learning method proven to be antisymmetric in past tests [22,38]. Table 4. Results on Ssym: The performance on both direct and reverse variants was measured in terms of Pearson correlation coefficient (r) and root mean square error (RMSE). The antisymmetry was assessed using the correlation coefficient r d−i (2) and the bias δ (3). RMSE and δ are expressed in kcal/mol. The results of INPS were taken from Montanucci et al. [14] and Fariselli et al. [13]; the results of SAAFEC-SEQ and I-mutant2.0 were obtained using their stand-alone code, those of MUpro were obtained using the webserver available. Only Inps-NoSeqId and ACDN-NN-Seq were trained in cross-validation addressing the sequence identity issue (sequence similarity <25%). .0 and MUpro do not respect the antisymmetry property since this issue was not properly addressed or known at the time the two models were created. However, it must be noted that both I-Mutant2.0 and MUpro do not use evolutionary information making them extremely fast predictors, as compared to ACDC-NN-Seq, which requires a multiple sequence alignment.

Method
Another significant point is that, looking at all the methods reported in Tables 4-6, only Inps-NoSeqId and all the versions of ACDN-NN were trained in cross-validation removing the sequence identity (i.e., sequence similarity <25%).

Frataxin CAGI 5 Challenge
The Critical Assessment of Genome Interpretation (CAGI) is a community experiment aimed at fairly assessing the computational methods for genome interpretation [30]. In CAGI 5, data providers measured unfolding free energy of a set of variants with far-UV circular dichroism and intrinsic fluorescence spectra on Frataxin (FXN), a highly conserved protein fundamental for the cellular iron homeostasis in both prokaryotes and eukaryotes. These measurements were used to calculate the change in unfolding free energy between the variant and wild-type proteins at zero denaturant concentrations (∆∆G). In addition, the experimental dataset [39], including eight amino acid substitutions, was used to evaluate the performance of the web-only tools, based on protein structure information, for predicting the value of the associated ∆∆G [40]. Here we compare the available machinelearning sequence-based predictors on the dataset (Table 7), showing the consistency of the prediction performance of ACDC-NN-Seq. Table 7. Results on Frataxin Challenge in CAGI5 [30]. The INPS, SAAFEC-SEQ and I-mutant2.0 results were obtained using their stand alone code, those of MUpro were obtained using the webserver available.

Discussion
Few sequence-based methods are currently available to predict the ∆∆G and only some of them satisfy the principle of antisymmetry imposed by thermodynamics. Given the huge amount of sequencing data currently available and their possible applications, we have proposed a reliable sequence-based predictor that could be applied to a wide range of biological problems, even when the crystallized protein structures are not available. Moreover, ACDC-NN-Seq addresses the antisymmetry physical property by exploiting a specifically-designed loss function which minimizes the absolute difference between the direct and reverse predictions [24]. We showed that ACDC-NN-Seq compares well with with the other state-of-the-art sequence-based machine-learning predictors and with some of the structure-based ones, as shown in Tables 4-7. To avoid overfitting issues due to sequence similarity, the correct procedure to assess the performance of a method on any dataset should use variants in proteins which are not similar to those used for the training [18]. Specifically, this means to remove proteins with a sequence identity >25% between the training and testing data. In this study, we followed this procedure and we divided all the datasets used into 10 non-similar subsets during cross-validation. In conclusion, ACDC-NN-Seq is a sequence-based tool able to show comparable or better performance with respect to the state-of-the-art methods while preserving perfect thermodynamic antisymmetry. In future studies, this method can be extended to predict the ∆∆Gs of multiple site variations.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study and the code to reproduce the results are freely available in https://github.com/compbiomed-unito/acdc-nn (accessed on 11 June 2021).

Acknowledgments:
We thank the Italian Ministry for Education, University and Research for the program "Dipartimenti di Eccellenza 20182022D15D18000410001" and for MIUR-PRIN-201744NR8S.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: ACDC-NN Antisymmetric Convolutional Differential Concatenated -Neural Network NGS Next Generation Sequencing