# Data Size and Quality Matter: Generating Physically-Realistic Distance Maps of Protein Tertiary Structures

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Preliminaries

#### 3.1. 2D Representations of Tertiary Structure

#### 3.2. VAE Architecture

#### 3.3. CVAE-SPP

## 4. Methods

#### 4.1. $\beta $-CVAE-SPP

#### 4.2. Training Dataset Compositions

#### 4.3. Input Datasets

#### 4.4. Metrics to Evaluate Generated Data

#### 4.5. Statistical Significance Tests

#### 4.6. Implementation Details

## 5. Results

#### 5.1. Impact of $\beta $ in $\beta $CVAE-SPP

#### 5.2. Comparison of $\beta $CVAE-SPP to CVAE-SPP

#### 5.3. Evaluating the Impact of Data Size, Quality, and Composition

#### 5.4. Visualization of Distance Matrices as Heatmaps

#### 5.5. Statistical Significance Analysis

## 6. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

ANOVA | One-way Analysis of Variance |

$\beta $VAE | Beta Variational Autoencoder |

CA atom | Central Carbon atom |

C-terminus | Carboxyl terminus |

CASP | Critical Assessment of Protein Structure Prediction |

CVAE | Convolutional Variational Autoencoder |

EMD | Earth Mover Distance |

GAN | Generative Adversarial Network |

KL divergence | Kullback-Leibler divergence |

LR | Long-Range Contact |

N-terminus | $N{H}_{2}$ (Amino) terminus |

PDB | Protein Data Bank |

PISCES | Protein Sequence Culling Server |

SPP | Spatial Pyramid Pooling |

SR | Short-Range Contact |

ResNet | Residual Neural Network |

2-stage FDR | Two-stage False Discovery Rate |

VAE | Variational Autoencoder |

## References

- Maximova, T.; Moffatt, R.; Ma, B.; Nussinov, R.; Shehu, A. Principles and Overview of Sampling Methods for Modeling Macromolecular Structure and Dynamics. PLoS Comp. Biol.
**2016**, 12, e1004619. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kryshtafovych, A.; Monastyrskyy, B.; Fidelis, K.; Schwede, T.; Tramontano, A. Assessment of model accuracy estimations in CASP12. Proteins Struct. Funct. Bioinf.
**2018**, 86, 345–360. [Google Scholar] [CrossRef] [PubMed] - Bradley, P.; Misura, K.M.S.; Baker, D. Toward High-Resolution de Novo Structure Prediction for Small Proteins. Science
**2005**, 309, 1868–1871. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Hou, J.; Wu, T.; Cao, R.; Cheng, J. Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins
**2019**, 87, 1165–1178. [Google Scholar] [CrossRef] [Green Version] - Xu, J.; McPartlon, M.; Lin, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nature Mach. Intel.
**2020**, 3, 601–609. [Google Scholar] [CrossRef] [PubMed] - Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature
**2021**, 596, 583–589. [Google Scholar] [CrossRef] - Boehr, D.D.; Nussinov, R.; Wright, P.E. The role of dynamic conformational ensembles in biomolecular recognition. Nat. Chem. Biol.
**2009**, 5, 789–796. [Google Scholar] [CrossRef] [Green Version] - Shehu, R.; Nussinov, R. Computational Methods for Exploration and Analysis of Macromolecular Structure and Dynamics. PLoS Comput. Biol.
**2015**, 11, e1004585. [Google Scholar] [CrossRef] [Green Version] - Maximova, T.; Zhao, Z.; Carr, D.B.; Plaku, E.; Shehu, A. Sample-based Models of Protein Energy Landscapes and Slow Structural Rearrangements. J. Comput. Biol.
**2017**, 25, 33–50. [Google Scholar] [CrossRef] [Green Version] - Maximova, T.; Plaku, E.; Shehu, A. Structure-guided Protein Transition Modeling with a Probabilistic Roadmap Algorithm. IEEE/ACM Trans. Comput. Biol. Bioinf.
**2018**, 15, 1783–1796. [Google Scholar] [CrossRef] - Sapin, E.; Carr, D.B.; De Jong, K.A.; Shehu, A. Computing energy landscape maps and structural excursions of proteins. BMC Genom.
**2016**, 17, 456. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Sapin, E.; De Jong, K.A.; Shehu, A. From Optimization to Mapping: An Evolutionary Algorithm for Protein Energy Landscapes. IEEE/ACM Trans. Comput. Biol. Bioinf.
**2018**, 15, 719–731. [Google Scholar] [CrossRef] [PubMed] - Jones, D.T.; Singh, T.; Kosciolek, T.; Tetchner, S. MetaPSICOV: Combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics
**2015**, 31, 999–1006. [Google Scholar] [CrossRef] - Xu, J. Distance-based protein folding powered by deep learning. Proc. Natl. Acad. Sci. USA
**2019**, 116, 16856–16865. [Google Scholar] [CrossRef] [Green Version] - Li, Y.; Zhang, C.; Bell, E.W.; Zheng, W.; Zhou, X.; Yu, D.J.; Zhang, Y. Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks. PLoS Comput. Biol.
**2021**, 17, e1008865. [Google Scholar] [CrossRef] - Zhou, X.; Li, Y.; Zhang, C.; Zheng, W.; Zhang, G.; Zhang, Y. Progressive assembly of multi-domain protein structures from cryo-EM density maps. Nat. Comput. Sci.
**2022**, 2, 265–275. [Google Scholar] [CrossRef] - Hoseini, P.; Zhao, L.; Shehu, A. Generative Deep Learning for Macromolecular Structure and Dynamics. Curr. Opin. Struct. Biol.
**2020**, 67, 170–177. [Google Scholar] [CrossRef] - Alam, F.F.; Shehu, A. Variational Autoencoders for Protein Structure Prediction. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Association for Computing Machinery, BCB ’20, Virtual Event, 21–24 September 2020. [Google Scholar] [CrossRef]
- Rahman, T.; Du, Y.; Shehu, A. Graph Representation Learning for Protein Conformation Sampling. In Proceedings of the IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Virtual Conference, 16–18 December 2021; pp. 1–12. [Google Scholar]
- Hang, H.; Wang, M.; Yu, Z.; Zhao, X.; Li, A. GANcon: Protein contact map prediction with deep generative adversarial network. IEEE Access
**2020**, 8, 80899–80907. [Google Scholar] - Ding, W.; Gong, H. Predicting the Real-Valued Inter-Residue Distances for Proteins. Adv. Sci.
**2020**, 7, 2001314. [Google Scholar] [CrossRef] - Rahman, T.; Du, Y.; Zhao, L.; Shehu, A. Generative Adversarial Learning of Protein Tertiary Structures. Molecules
**2021**, 26, 1209. [Google Scholar] [CrossRef] - Alam, F.F.; Shehu, A. Generating Physically-Realistic Tertiary Protein Structures with Deep Latent Variable Models Learning Over Experimentally-available Structures. In Proceedings of the 21st IEEE International Conference on BioInformatics and BioEngineering Workshops (BIBMW), Kragujevac, Serbia, 25–27 October 2021; pp. 2463–2470. [Google Scholar]
- Berman, H.M.; Henrick, K.; Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol.
**2003**, 10, 980. [Google Scholar] [CrossRef] [PubMed] - Ingraham, J.; Riesselman, A.; Sander, C.; Marks, D. Learning protein structure with a differentiable simulator. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Sabban, S.; Markovsky, M. RamaNet: Computational De Novo Protein Design using a Long Short-Term Memory Generative Adversarial Neural Network. BioRxiv
**2019**, 671552. [Google Scholar] [CrossRef] - Namrata, A.; Po-Ssu, H. Generative modeling for protein structures. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 7494–7505. [Google Scholar]
- Namrata, A.; Raphael, E.; Po-Ssu, H. Fully differentiable full-atom protein backbone generation. In Proceedings of the International Conference on Learning Representations (ICLR) Workshops: DeepGenStruct, Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Alam, F.F.; Rahman, T.; Shehu, A. Learning Reduced Latent Representations of Protein Structure Data. In Proceedings of the ACM Conference of Bioinformatics and Computational Biology (BCB) Workshops: Computational Structural Biology Workshop (CSBW), Niagara Falls, NY, USA, 7–10 September 2019; pp. 592–597. [Google Scholar]
- Alam, F.F.; Rahman, T.; Shehu, A. Evaluating autoencoder-based featurization and supervised learning for protein decoy selection. Molecules
**2020**, 25, 1146. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Ashiquzzaman, A.; Lee, H.; Kim, K.; Kim, H.Y.; Park, J.; Kim, J. Compact spatial pyramid pooling deep convolutional neural network based hand gestures decoder. Appl. Sci.
**2020**, 10, 7898. [Google Scholar] [CrossRef] - Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 1 January 2022).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In Computer Vision—ECCV 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 346–361. [Google Scholar] [CrossRef] [Green Version]
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-vae: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
- Burgess, C.P.; Higgins, I.; Pal, A.; Matthey, L.; Watters, N.; Desjardins, G.; Lerchner, A. Understanding disentangling in β-VAE. arXiv
**2018**. [Google Scholar] [CrossRef] - Wang, G.; Dunbrack, R. PISCES: A protein sequence culling server. Bioinformatics
**2003**, 19, 1589–1591. [Google Scholar] [CrossRef] [Green Version] - Rubner, Y.; Tomasi, C.; Guibas, L.J. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis.
**2000**, 40, 99–121. [Google Scholar] [CrossRef] - Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods
**2020**, 17, 261–272. [Google Scholar] [CrossRef] [Green Version] - Howell, D.C. Statistical Methods for Psychology; Cengage Learning: Boston, MA, USA, 2012. [Google Scholar]
- Laerd Statistics. Kruskal-Wallis H test using SPSS statistics. In Statistical Tutorials and Software Guides; Lund Research Ltd.: Derby, UK, 2015. [Google Scholar]
- Dinno, A. Nonparametric pairwise multiple comparisons in independent groups using Dunn’s test. Stata J.
**2015**, 15, 292–300. [Google Scholar] [CrossRef] [Green Version] - Chen, S.Y.; Feng, Z.; Yi, X. A general introduction to adjustment for multiple comparisons. J. Thorac. Dis.
**2017**, 9, 1725. [Google Scholar] [CrossRef] [Green Version] - Falcon, William and The PyTorch Lightning Team. PyTorch Lightning, GitHub. March 2019. Available online: https://github.com/PyTorchLightning/pytorch-lightning (accessed on 5 January 2022).
- Sirkin, R.M. Statistics for the Social Sciences; Sage: Newcastle upon Tyne, UK, 2006. [Google Scholar]
- Sproull, N.L. Handbook of Research Methods: A Guide for Practitioners and Students in the Social Sciences; Scarecrow Press: Lanham, MD, USA, 2002. [Google Scholar]

**Figure 1.**The key components of the proposed CVAE-SPP network are depicted in the schematic. Convolutional layers are used in both the encoder and decoder of the network, allowing it to learn patterns over distance matrices. A SPP layer enables the encoder to accommodate distance matrices of various sizes. Another proposed $\beta $CVAE-SPP network utilizes the same schematic but adds an extra hyperparameter denoted by $\beta $ in order to regulate the loss function.

**Figure 2.**Varying $\beta $ in $[5,50]$ in increments of 5. Each resulting $\beta $-CVAE-SPP model is trained separately on each of the 3 training dataset configurations built over the res0.0-2.5 input dataset. Data generated by each model are compared to the respective training dataset via EMD over % Backbone (top panel), SR-Score (middle panel), and LR-Score (bottom panel).

**Figure 3.**Average of % Backbone (over 3 independent runs) over generated distance matrices by CVAE-SPP and $\beta $CVAE-SPP. Each model is trained on a training dataset configuration (x-axis) of an input dataset. The three panels relate the different input datasets.

**Figure 4.**Average of EMD values over SR-Score distributions (over 3 independent runs) computed over generated versus training distance matrices by CVAE-SPP and $\beta $CVAE-SPP. Each model is trained on a training dataset configuration (x axis) of an input dataset. The three panels relate the different input datasets.

**Figure 5.**Average of EMD values over LR-Score distributions (over 3 independent runs) computed over generated versus training distance matrices by CVAE-SPP and $\beta $CVAE-SPP. Each model is trained on a training dataset configuration (x axis) of an input dataset. The three panels relate the different input datasets.

**Figure 6.**The (average) percentage of backbone over generated data for the CVAE-SPP (top panel) and $\beta $CVAE-SPP model (bottom panel) is compared along the five training dataset configurations for each of the three input datasets.

**Figure 7.**The EMD over SR-Score distributions of generated versus training data are tracked separately for CVAE-SPP (top panel) and $\beta $CVAE-SPP. In each panel, the comparison is made along the five training dataset configurations for each of the three input datasets.

**Figure 8.**The EMD over LR-Score distributions of generated versus training data are tracked separately for CVAE-SPP (top panel) and $\beta $CVAE-SPP. In each panel, the comparison is made along the five training dataset configurations for each of the three input datasets.

**Figure 9.**We relate here distance matrices selected at random over a training dataset (leftmost column) and over generated datasets (other columns) for each model and two of the training dataset configurations. The rows separate the three input datasets utilized. A yellow-to-blue color spectrum indicates high-to-low distance values.

Training Dataset Configurations | Descriptions |
---|---|

Configuration 1: $64\times 64$ | All contact maps extracted from an input dataset are of size $64\times 64$. |

Configuration 2: $64(70\%)+72(30\%)$ | $70\%$ of the contact maps extracted from an input dataset are of size $64\times 64$, and the rest are of size $72\times 72$. |

Configuration 3: $64(50\%)+72(50\%)$ | $50\%$ of the contact maps extracted from an input dataset are of size $64\times 64$, and the rest are of size $72\times 72$. |

Configuration 4: $64(40\%)+72(30\%)+90(30\%)$ | $40\%$ of the contact maps extracted from an input dataset are of size $64\times 64$, $30\%$ of the contact maps extracted from an input dataset are of size $72\times 72$ and the rest are of size $90\times 90$. |

Configuration 5: $64(34\%)+72(33\%)+90(33\%)$ | $34\%$ of the contact maps extracted from an input dataset are of size $64\times 64$, $33\%$ of the contact maps extracted from an input dataset are of size $72\times 72$ and the rest are of size $90\times 90$. |

**Table 2.**Statistical significance test over 3 groups of EMD values (over LR-Scores of generated vs. training dataset distributions) corresponding to the three different input datasets. Each group includes EMD values over LR-Score distributions (generated versus training) obtained from a model trained over each of the five training dataset configurations. The test is repeated separately for CVAE-SPP and $\beta $CVAE-SPP. The non-parametric version of the One-way ANOVA test, the Kruskal Wallis test is included in the last column. p-values are shown. Those no higher than $0.005$ are highlighted in bold, indicating that there are statistically-significant differences among the means of the three groups.

res0.0-2.0 vs. res0.0-2.5 vs. res0.0-3.0 | ||
---|---|---|

LR-Score | ||

Model | One way ANOVA | Kruskal-Wallis |

p value | p value | |

CVAE-SPP: 5 training dataset configs | 0.0017 | 0.0131 |

$\beta $CVAE-SPP: 5 training dataset configs | 0.0018 | 0.0103 |

**Table 3.**Post-hoc analysis over EMD values (over LR-Score generated vs. training dataset distributions) obtained over the training dataset configurations, comparing all pairs of input datasets. The analysis is carried out separately for CVAE-SPP and $\beta $CVAE-SPP. The left panel relates Dunn’s test using the FDR 2 stage Benjamini-Hochberg method. The right panel indicates Dunn’s test using the Holm-Bonferroni method. p-values are shown. Those no higher than $0.005$ are highlighted in bold, indicating statistically-significant differences among the means of the groups under comparison.

Post Hoc Dunn’s Test (CVAE-SPP: 5 Different Configs on Training Dataset), $\mathit{\alpha}$ = 0.05 | |||||||
---|---|---|---|---|---|---|---|

FDR 2 stage Benjamini-Hochberg Method | Holm-Bonferroni Method | ||||||

LR-Score (Training, Generated) | |||||||

Dataset | res$0.0$-$2.0$ | res$0.0$-$2.5$ | res$0.0$-$3.0$ | Dataset | res$0.0$-$2.0$ | res$0.0$-$2.5$ | res$0.0$-$3.0$ |

p value | p value | ||||||

res$0.0$-$2.0$ | 1.0000 | 0.2958 | 0.0067 | res$0.0$-$2.0$ | 1.0000 | 1.0000 | 0.0266 |

res$0.0$-$2.5$ | 0.2958 | 1.0000 | 0.0066 | res$0.0$-$2.5$ | 1.0000 | 1.0000 | 0.3998 |

res$0.0$-$3.0$ | 0.0067 | 0.0066 | 1.0000 | res$0.0$-$3.0$ | 0.0266 | 0.3998 | 1.0000 |

Post Hoc Dunn’s Test ($\mathbf{\beta}$CVAE-SPP: 5 Different Configs on Training Dataset), $\mathit{\alpha}$ = 0.05 | |||||||

FDR 2 stage Benjamini-Hochberg Method | Holm-Bonferroni Method | ||||||

LR-Score (Training, Generated) | |||||||

Dataset | res$0.0$-$2.0$ | res$0.0$-$2.5$ | res$0.0$-$3.0$ | Dataset | res$0.0$-$2.0$ | res$0.0$-$2.5$ | res$0.0$-$3.0$ |

p value | p value | ||||||

res$0.0$-$2.0$ | 1.0000 | 0.1598 | 0.0037 | res$0.0$-$2.0$ | 1.0000 | 1.0000 | 0.0112 |

res$0.0$-$2.5$ | 0.1598 | 1.0000 | 0.0141 | res$0.0$-$2.5$ | 1.0000 | 1.0000 | 0.0851 |

res$0.0$-$3.0$ | 0.0037 | 0.0141 | 1.0000 | res$0.0$-$3.0$ | 0.0112 | 0.0851 | 1.000 |

**Table 4.**Statistical significance between CVAE-SPP and $\beta $CVAE-SPP models for each 3 datasets res$0.0$-$2.0$ (first row), res$0.0$-$2.5$ (second row) and res$0.0$-$3.0$ (third row) are determined through different statistical significance test at $\alpha $ = $0.05$. Column 1 lists the individual dataset for CVAE-SPP and $\beta $CVAE-SPP models comparison. Column 2 shows the “p-value” using the One-way ANOVA test, Column 3 using the Student’s t-test, Column 4 using the Kruskal–Wallis test and Column 5 using the Mann-Whitney U test. We recall that LR-Score measures the number of long-range contacts in a distance matrix (normalizing by the number of CA atoms). And for both VAE models, we have considered all 5 different configurations on the training dataset.

CVAE-SPP vs. $\mathit{\beta}$CVAE-SPP: 5 Different Configs on Training Dataset | |||||
---|---|---|---|---|---|

LR-Score | |||||

Dataset | One way ANOVA | t-test | Kruskal | Mann Whitney | Reject-hs |

p value | |||||

res$0.0$-$2.0$ | 0.3409 | 0.3409 | 0.3472 | 0.4033 | False |

res$0.0$-$2.5$ | 0.5707 | 0.5707 | 0.3472 | 0.4033 | False |

res$0.0$-$3.0$ | 0.0624 | 0.0624 | 0.0758 | 0.0946 | False |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Alam, F.F.; Shehu, A.
Data Size and Quality Matter: Generating Physically-Realistic Distance Maps of Protein Tertiary Structures. *Biomolecules* **2022**, *12*, 908.
https://doi.org/10.3390/biom12070908

**AMA Style**

Alam FF, Shehu A.
Data Size and Quality Matter: Generating Physically-Realistic Distance Maps of Protein Tertiary Structures. *Biomolecules*. 2022; 12(7):908.
https://doi.org/10.3390/biom12070908

**Chicago/Turabian Style**

Alam, Fardina Fathmiul, and Amarda Shehu.
2022. "Data Size and Quality Matter: Generating Physically-Realistic Distance Maps of Protein Tertiary Structures" *Biomolecules* 12, no. 7: 908.
https://doi.org/10.3390/biom12070908