# Exploiting Sparse Statistics for a Sequence-Based Prediction of the Effect of Mutations

## Abstract

**:**

## 1. Introduction

^{4}sequences were absent in the sequence set of proteins with known structure that was used in the development of the score. The present work is based on recognizing the fact that the absence of an n-tuple in the sequences of folded proteins is also significant information thus it is worth examining what can be learned from the, obviously limited, statistics of AA pentuplets, hextuplets and heptuplets.

## 2. Materials and Methods

_{p}was defined as:

_{i}and PR

_{i}are the probabilities of finding the construct i in the experimental set and in the RW set, resp. For constructs i that were missing from the experimental set PN

_{i}was set to 0.5/20

^{f}(f = 3, 4 and 5 for triplets, quadruplets and pentuplets, resp.).

_{p}was defined:

_{i}is the number of occurrences of the construct i in the experimental set. This choice was motivated by the fact that the hextuplet and heptuplet counts are so sparse that they can not be considered to be a reasonable approximation of their probability of occurrence.

_{p}was used only to study mutations, the pentuplet score SC

_{p}was also tested for its ability to predict the foldability of a sequence employing the methods used in Reference [2]. This test consisted of the following steps:

- Determine the distribution of scores over the PDB set.
- Determine the distribution of scores over the RW set, consisting of 100,000 sequences of 200 residues.
- Calculate the overlap between the two distributions to see how well the scores can distinguish between folding and non-folding sequences.
- For a given sequence, calculate its score and see where it lies with respect to the score value at the intersection of the two distributions, resulting in the prediction regarding the foldability of that sequence.

## 3. Results

#### 3.1. Foldability Prediction Using Pentuplet Statistics

#### 3.2. Prediction of Protein Stability Change Upon Mutationon

_{p}(Equation (1)) was used while for the hextuplet and heptuplet scores SSC

_{p,}the simplified formula of Equation (2), was used.

## 4. Discussion

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Mezei, M. On predicting foldability of a protein from its sequence. Proteins
**2019**, 87. in print. [Google Scholar] [CrossRef] [PubMed] - De Lucrezia, D.; Slanzi, D.; Poli, I.; Polticelli, F.; Minervini, G. Do natural proteins differ from random sequences polypeptides? Natural vs. Random proteins classification using an evolutionary neural network. PLoS ONE
**2012**, 7, e36634. [Google Scholar] [CrossRef] [PubMed] - El Hage, K.; Mondal, P.; Meuwly, M. Free energy simulations for protein ligand binding and stability. Mol. Simulat.
**2018**, 44, 1044–1061. [Google Scholar] [CrossRef] - Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The protein data bank. Nucleic Acids Res.
**2000**, 28, 235–242. [Google Scholar] [CrossRef] [PubMed] - Pucci, F.; Bourgeas, R.; Rooman, M. High-quality thermodynamic data on the stability changes of proteins upon single-site mutations. J. Phys. Chem. Ref. Data
**2016**, 45, 023104. [Google Scholar] [CrossRef] - Lavelle, D.T.; Pearson, W.R. Globally, unrelated protein sequences appear random. Bioinformatics
**2010**, 26, 310–318. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Pentuplet score distributions for the experimental set (full line), randomly generated from the uniform amino acid distribution (short dashes) and randomly generated with amino acids sampled with their natural propensity (long dashes).

Prediction | Score Source | New PDB Set | Uniform Random | Weighted Random | |||
---|---|---|---|---|---|---|---|

Folded | Quadruplets, RW | 3855 | 79.3% | 3964 | 4.0% | 4257 | 4.3% |

Random | Quadruplets, RW | 980 | 20.7% | 96,036 | 96.0% | 95,745 | 95.7% |

Folded | Pentuplets, RU | 2874 | 60.7% | 3628 | 3.6% | 499 | 0.5% |

Random | Pentuplets, RU | 1861 | 39.3% | 96,372 | 96.4% | 99,501 | 99.5% |

Folded | Pentuplets, RW | 3760 | 79.4% | 30,822 | 30.8% | 281 | 0.3% |

Random | Pentuplets, RW | 975 | 20.6% | 69,177 | 69.2% | 99,719 | 99.7% |

**Table 2.**Comparison of the foldability predictions using combinations of quadruplet and pentuplet scores.

Prediction | Score Source | New PDB Set | Uniform Random | Weighted Random | |||
---|---|---|---|---|---|---|---|

Folded | Quad, RW-Pent, RW | 4184 | 88.4% | 31,142 | 31.1% | 4154 | 4.2% |

Random | Quad, RW-Pent, RW | 551 | 11.6% | 68,858 | 68.9% | 95,846 | 95.8% |

Folded | Quad, RW-Pent, RU | 4004 | 84.6% | 5125 | 5.1% | 4579 | 4.6% |

Random | Quad, RW-Pent, RU | 731 | 15.4% | 94,875 | 94.9% | 95,421 | 95.4% |

**Table 3.**Number of matches between the signs of score changes and melting temperature changes upon mutation.

n-Tuplet | N_{match} | %Match | N_{no data}^{1} | n-Tuplets | N_{match} | % Match | N_{consensus}^{2} |
---|---|---|---|---|---|---|---|

3 | 799 | 50.8% | 3 + 4 | 690 | 55.2% | 1251 | |

4 | 904 | 57.4% | 3 + 4 + 5 | 572 | 65.5% | 872 | |

5 | 1069 | 67.9% | 5 + 6 | 983 | 72.7% | 1353 | |

6 | 1117 | 71.4% | 10 | 6 + 7 | 1082 | 72.1% | 1500 |

7 | 1085 | 70.6% | 16 | 5 + 6 + 7 | 956 | 73.7% | 1298 |

^{1}Number of mutations where the hextuplet or heptuplet scores for both the wild type and the mutants were zero and thus not used.

^{2}Number of mutations where all the n-tuplets involved showed the same sign changes and were thus used for counting matches.

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mezei, M. Exploiting Sparse Statistics for a Sequence-Based Prediction of the Effect of Mutations. *Algorithms* **2019**, *12*, 214.
https://doi.org/10.3390/a12100214

**AMA Style**

Mezei M. Exploiting Sparse Statistics for a Sequence-Based Prediction of the Effect of Mutations. *Algorithms*. 2019; 12(10):214.
https://doi.org/10.3390/a12100214

**Chicago/Turabian Style**

Mezei, Mihaly. 2019. "Exploiting Sparse Statistics for a Sequence-Based Prediction of the Effect of Mutations" *Algorithms* 12, no. 10: 214.
https://doi.org/10.3390/a12100214